一、核心技术栈与数据

 技术栈:Hive 2.3.9(数据处理)、Python 3.8(pandas)、ECharts 5.4.3(可视化)

数据:阿里天池电商用户行为数据(500万条,(下载https://pan.baidu.com/s/1uGddx2BzRdNencKnyN4quQ链接)

 二、Step1:Hive数据处理(SQL直接复制运行)

 1. 建外部表

CREATE EXTERNAL TABLE user_behavior (

    user_id STRING, -- 用户ID

    item_id STRING, -- 商品ID

    behavior_type INT, -- 1=浏览/2=收藏/3=加购/4=购买

    time STRING -- 行为时间(YYYY-MM-DD HH)

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

SKIP LINES 1 -- 跳过CSV表头

LOCATION '/hive/data/'; -- 提前创建HDFS路径:hdfs dfs -mkdir -p /hive/data/

 2. 加载清洗后数据(清洗见Step2)

LOAD DATA LOCAL INPATH '/home/hadoop/clean_data.csv' -- 本地文件路径

OVERWRITE INTO TABLE user_behavior;

-- 权限报错解决:chmod 777 /home/hadoop/clean_data.csv

3. 3个核心查询(结果存本地)

 ① 行为分布(存到/home/hadoop/res/behavior)

INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/res/behavior'

SELECT behavior_type, COUNT(*) AS cnt FROM user_behavior GROUP BY behavior_type;

 ② 时段趋势(存到/home/hadoop/res/hour)

INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/res/hour'

SELECT SUBSTR(time,12,2) AS hour, COUNT(*) AS cnt FROM user_behavior GROUP BY SUBSTR(time,12,2);

③ 购买转化(存到/home/hadoop/res/conv)

INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/res/conv'

SELECT

  (SELECT COUNT(DISTINCT user_id) FROM user_behavior WHERE behavior_type=1) AS view,

  (SELECT COUNT(DISTINCT user_id) FROM user_behavior WHERE behavior_type=3) AS cart,

  (SELECT COUNT(DISTINCT user_id) FROM user_behavior WHERE behavior_type=4) AS buy;

三、Step2:Python清洗+指标计算(代码可直接跑)

1. 数据清洗

import pandas as pd

读原始数据→去重→转时间格式

raw_df = pd.read_csv("user_behavior.csv")

clean_df = raw_df.drop_duplicates()

clean_df["time"] = pd.to_datetime(clean_df["time"], unit="s").dt.strftime("%Y-%m-%d %H")

clean_df.to_csv("clean_data.csv", index=False)

print(f"清洗完成!有效数据{len(clean_df)}条")

 2. 核心指标计算(输出业务结论)

import pandas as pd

# 加载Hive结果(替换为你的路径)

behavior = pd.read_csv("/home/hadoop/res/behavior/000000_0", names=["type", "cnt"])

hour = pd.read_csv("/home/hadoop/res/hour/000000_0", names=["hour", "cnt"])

conv = pd.read_csv("/home/hadoop/res/conv/000000_0", names=["view", "cart", "buy"])

1. 行为分布占比

behavior["name"] = behavior["type"].map({1:"浏览",2:"收藏",3:"加购",4:"购买"})

behavior["ratio"] = (behavior["cnt"]/behavior["cnt"].sum()).round(3)

print("=== 行为分布 ===")

print(behavior[["name", "cnt", "ratio"]]) # 示例输出:浏览 420000 0.84

 2. 高活跃时段(Top3)

hour["hour"] = hour["hour"].astype(int)

print("\n=== 高活跃时段 ===")

print(hour.nlargest(3, "cnt")[["hour", "cnt"]]) # 示例输出:20点 62000、21点 59000

3. 购买转化率

cart_conv = (conv["cart"].iloc[0]/conv["view"].iloc[0]*100).round(1)

buy_conv = (conv["buy"].iloc[0]/conv["cart"].iloc[0]*100).round(1)

print("\n=== 转化率 ===")

print(f"浏览→加购:{cart_conv}% | 加购→购买:{buy_conv}%") # 示例:15.0%、15.0%

四、Step3:ECharts交互式可视化(HTML直接打开)

<!DOCTYPE html>

<html>

<head>

    <meta charset="UTF-8">

    <script src="https://cdn.jsdelivr.net/npm/echarts@5.4.3/dist/echarts.min.js"></script>

    <style>

        .dashboard { display: grid; grid-template-columns:1fr 1fr; gap:20px; margin:20px; }

        .chart { width:100%; height:400px; border:1px solid #f5f5f5; border-radius:8px; padding:15px; }

    </style>

</head>

<body>

    <div class="dashboard">

        <div class="chart" id="pie"></div>

        <div class="chart" id="line"></div>

        <div class="chart" id="funnel"></div>

    </div>



    <script>

        // 初始化图表

        const pie = echarts.init(document.getElementById("pie"));

        const line = echarts.init(document.getElementById("line"));

        const funnel = echarts.init(document.getElementById("funnel"));



        // 替换为你的Hive结果(示例数据)

        const pieData = [{name:"浏览",value:420000},{name:"加购",value:50000},{name:"购买",value:25000},{name:"收藏",value:5000}];

        const lineData = Array.from({length:24}, (_,i)=>({h:i, v:[18000,15000,14000,13000,12000,15000,20000,28000,35000,38000,42000,45000,43000,39000,40000,41000,43000,46000,52000,58000,62000,59000,51000,32000][i]}));

        const funnelData = [{name:"浏览用户",value:100000},{name:"加购用户",value:15000},{name:"购买用户",value:2250}];



        // 1. 行为分布饼图

        pie.setOption({

            tooltip:{formatter:"{b}<br>数量:{c}<br>占比:{d}%"},

            series:[{type:"pie", radius:["40%","70%"], data:pieData, itemStyle:{borderRadius:8, borderColor:"#fff"}}]

        });



        // 2. 时段趋势折线图(支持缩放)

        line.setOption({

            tooltip:{trigger:"axis"},

            dataZoom:[{type:"inside"},{type:"slider"}],

            xAxis:{data:lineData.map(i=>`${i.h}点`)},

            yAxis:{name:"行为数"},

            series:[{type:"line", data:lineData.map(i=>i.v), smooth:true, lineStyle:{color:"#FF6B6B", width:2}}]

        });



        // 3. 转化漏斗图

        funnel.setOption({

            tooltip:{formatter:p=>`${p.name}:${p.value}人<br>转化率:${p.dataIndex>0?(p.value/funnelData[p.dataIndex-1].value*100).toFixed(1)+"%":"-"}`},

            series:[{type:"funnel", data:funnelData, itemStyle:{color:p=>["#4ECDC4","#45B7D1","#96CEB4"][p.dataIndex]}}]

        });



        // 窗口适配

        window.onresize=()=>{pie.resize(); line.resize(); funnel.resize();};

    </script>

</body>

</html>

五、项目亮点

1. 全流程落地:500万数据清洗→Hive结构化处理→Python指标计算→ECharts交互展示;

2. 业务价值:识别20-21点高活跃时段,提出“该时段推券提升转化”的落地建议;

3. 技能匹配:熟练使用Hive SQL、Python数据处理、ECharts可视化

           你还有哪些好的方案方法,希望对您有帮助!,欢迎讨论

关注我会分享大数据技术

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐