3步落地电商用户行为分析:Hive+Python+ECharts
一、核心技术栈与数据 技术栈:Hive 2.3.9(数据处理)、Python 3.8(pandas)、ECharts 5.4.3(可视化)数据:阿里天池电商用户行为数据(500万条,(下载https://pan.baidu.com/s/1uGddx2BzRdNencKnyN4quQ链接) 二、Step1:Hive数据处理(SQL直接复制运行) 1. 建外部表 CREATE EXTERNAL TAB
一、核心技术栈与数据
技术栈:Hive 2.3.9(数据处理)、Python 3.8(pandas)、ECharts 5.4.3(可视化)
数据:阿里天池电商用户行为数据(500万条,(下载https://pan.baidu.com/s/1uGddx2BzRdNencKnyN4quQ链接)
二、Step1:Hive数据处理(SQL直接复制运行)
1. 建外部表
CREATE EXTERNAL TABLE user_behavior (
user_id STRING, -- 用户ID
item_id STRING, -- 商品ID
behavior_type INT, -- 1=浏览/2=收藏/3=加购/4=购买
time STRING -- 行为时间(YYYY-MM-DD HH)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SKIP LINES 1 -- 跳过CSV表头
LOCATION '/hive/data/'; -- 提前创建HDFS路径:hdfs dfs -mkdir -p /hive/data/
2. 加载清洗后数据(清洗见Step2)
LOAD DATA LOCAL INPATH '/home/hadoop/clean_data.csv' -- 本地文件路径
OVERWRITE INTO TABLE user_behavior;
-- 权限报错解决:chmod 777 /home/hadoop/clean_data.csv
3. 3个核心查询(结果存本地)
① 行为分布(存到/home/hadoop/res/behavior)
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/res/behavior'
SELECT behavior_type, COUNT(*) AS cnt FROM user_behavior GROUP BY behavior_type;
② 时段趋势(存到/home/hadoop/res/hour)
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/res/hour'
SELECT SUBSTR(time,12,2) AS hour, COUNT(*) AS cnt FROM user_behavior GROUP BY SUBSTR(time,12,2);
③ 购买转化(存到/home/hadoop/res/conv)
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/res/conv'
SELECT
(SELECT COUNT(DISTINCT user_id) FROM user_behavior WHERE behavior_type=1) AS view,
(SELECT COUNT(DISTINCT user_id) FROM user_behavior WHERE behavior_type=3) AS cart,
(SELECT COUNT(DISTINCT user_id) FROM user_behavior WHERE behavior_type=4) AS buy;
三、Step2:Python清洗+指标计算(代码可直接跑)
1. 数据清洗
import pandas as pd
读原始数据→去重→转时间格式
raw_df = pd.read_csv("user_behavior.csv")
clean_df = raw_df.drop_duplicates()
clean_df["time"] = pd.to_datetime(clean_df["time"], unit="s").dt.strftime("%Y-%m-%d %H")
clean_df.to_csv("clean_data.csv", index=False)
print(f"清洗完成!有效数据{len(clean_df)}条")
2. 核心指标计算(输出业务结论)
import pandas as pd
# 加载Hive结果(替换为你的路径)
behavior = pd.read_csv("/home/hadoop/res/behavior/000000_0", names=["type", "cnt"])
hour = pd.read_csv("/home/hadoop/res/hour/000000_0", names=["hour", "cnt"])
conv = pd.read_csv("/home/hadoop/res/conv/000000_0", names=["view", "cart", "buy"])
1. 行为分布占比
behavior["name"] = behavior["type"].map({1:"浏览",2:"收藏",3:"加购",4:"购买"})
behavior["ratio"] = (behavior["cnt"]/behavior["cnt"].sum()).round(3)
print("=== 行为分布 ===")
print(behavior[["name", "cnt", "ratio"]]) # 示例输出:浏览 420000 0.84
2. 高活跃时段(Top3)
hour["hour"] = hour["hour"].astype(int)
print("\n=== 高活跃时段 ===")
print(hour.nlargest(3, "cnt")[["hour", "cnt"]]) # 示例输出:20点 62000、21点 59000
3. 购买转化率
cart_conv = (conv["cart"].iloc[0]/conv["view"].iloc[0]*100).round(1)
buy_conv = (conv["buy"].iloc[0]/conv["cart"].iloc[0]*100).round(1)
print("\n=== 转化率 ===")
print(f"浏览→加购:{cart_conv}% | 加购→购买:{buy_conv}%") # 示例:15.0%、15.0%
四、Step3:ECharts交互式可视化(HTML直接打开)
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<script src="https://cdn.jsdelivr.net/npm/echarts@5.4.3/dist/echarts.min.js"></script>
<style>
.dashboard { display: grid; grid-template-columns:1fr 1fr; gap:20px; margin:20px; }
.chart { width:100%; height:400px; border:1px solid #f5f5f5; border-radius:8px; padding:15px; }
</style>
</head>
<body>
<div class="dashboard">
<div class="chart" id="pie"></div>
<div class="chart" id="line"></div>
<div class="chart" id="funnel"></div>
</div>
<script>
// 初始化图表
const pie = echarts.init(document.getElementById("pie"));
const line = echarts.init(document.getElementById("line"));
const funnel = echarts.init(document.getElementById("funnel"));
// 替换为你的Hive结果(示例数据)
const pieData = [{name:"浏览",value:420000},{name:"加购",value:50000},{name:"购买",value:25000},{name:"收藏",value:5000}];
const lineData = Array.from({length:24}, (_,i)=>({h:i, v:[18000,15000,14000,13000,12000,15000,20000,28000,35000,38000,42000,45000,43000,39000,40000,41000,43000,46000,52000,58000,62000,59000,51000,32000][i]}));
const funnelData = [{name:"浏览用户",value:100000},{name:"加购用户",value:15000},{name:"购买用户",value:2250}];
// 1. 行为分布饼图
pie.setOption({
tooltip:{formatter:"{b}<br>数量:{c}<br>占比:{d}%"},
series:[{type:"pie", radius:["40%","70%"], data:pieData, itemStyle:{borderRadius:8, borderColor:"#fff"}}]
});
// 2. 时段趋势折线图(支持缩放)
line.setOption({
tooltip:{trigger:"axis"},
dataZoom:[{type:"inside"},{type:"slider"}],
xAxis:{data:lineData.map(i=>`${i.h}点`)},
yAxis:{name:"行为数"},
series:[{type:"line", data:lineData.map(i=>i.v), smooth:true, lineStyle:{color:"#FF6B6B", width:2}}]
});
// 3. 转化漏斗图
funnel.setOption({
tooltip:{formatter:p=>`${p.name}:${p.value}人<br>转化率:${p.dataIndex>0?(p.value/funnelData[p.dataIndex-1].value*100).toFixed(1)+"%":"-"}`},
series:[{type:"funnel", data:funnelData, itemStyle:{color:p=>["#4ECDC4","#45B7D1","#96CEB4"][p.dataIndex]}}]
});
// 窗口适配
window.onresize=()=>{pie.resize(); line.resize(); funnel.resize();};
</script>
</body>
</html>
五、项目亮点
1. 全流程落地:500万数据清洗→Hive结构化处理→Python指标计算→ECharts交互展示;
2. 业务价值:识别20-21点高活跃时段,提出“该时段推券提升转化”的落地建议;
3. 技能匹配:熟练使用Hive SQL、Python数据处理、ECharts可视化
你还有哪些好的方案方法,希望对您有帮助!,欢迎讨论
关注我会分享大数据技术
更多推荐



所有评论(0)