老码农和你一起学AI系列：机器学习实战-工业级机器学习工作流构建

"""将用户 tenure（在网时长）分箱为离散特征"""# 拟合阶段无需操作，返回自身# X为DataFrame格式，添加分箱特征labels=['0-1年', '1-2年', '2-5年', '5年以上']# 在预处理前添加分箱步骤('preprocessor', preprocessor) # 之前定义的ColumnTransformer])模型性能：梯度提升模型在测试集上达到 0.82 的

chilavert318

1279人浏览 · 2025-09-04 08:00:00

chilavert318 · 2025-09-04 08:00:00 发布

在工业级机器学习项目中，一个标准化、可复用的工作流是确保模型稳定性与可维护性的核心。今天将整合 Scikit-learn 工具链与实战案例，构建从数据处理到模型部署的完整流程。

一、Scikit-learn Pipeline

Pipeline 是 Scikit-learn 中实现流程标准化的核心工具，它通过将数据预处理、特征工程、模型训练等步骤串联成一个可调用对象，解决了数据泄露（Data Leakage）和流程割裂的问题。

1. 构建完整的 Pipeline 流程

一个工业级 Pipeline 通常包含三个核心阶段，以客户流失预测场景为例：

（1）特征预处理流水线

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.impute import SimpleImputer

# 定义特征类型

numeric_features = ['tenure', 'monthly_charges', 'total_charges']

categorical_features = ['gender', 'contract_type', 'payment_method']

# 数值特征处理流水线：填充缺失值→标准化

numeric_transformer = Pipeline(steps=[

('imputer', SimpleImputer(strategy='median')), # 用中位数填充缺失值

('scaler', StandardScaler()) # 标准化到均值为0，方差为1

])

# 类别特征处理流水线：填充缺失值→独热编码

categorical_transformer = Pipeline(steps=[

('imputer', SimpleImputer(strategy='most_frequent')), # 用众数填充

('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False)) # 忽略未知类别

])

# 合并特征处理流程

preprocessor = ColumnTransformer(

transformers=[

('num', numeric_transformer, numeric_features),

('cat', categorical_transformer, categorical_features)

])

（2）特征选择与模型训练整合

在预处理基础上添加特征选择和模型训练步骤，形成完整 Pipeline：

from sklearn.feature_selection import SelectKBest, f_classif

from sklearn.ensemble import RandomForestClassifier

full_pipeline = Pipeline(steps=[

('preprocessor', preprocessor), # 特征预处理

('feature_selector', SelectKBest(f_classif, k=10)), # 保留10个最优特征

('classifier', RandomForestClassifier(random_state=42)) # 分类模型

])

Pipeline 的核心优势在于：

所有步骤在fit()时仅使用训练数据，避免测试数据影响预处理参数（如标准化的均值 / 方差）

支持整体网格搜索，确保参数调优时的流程一致性

可直接序列化保存，便于部署

2. 自定义 Pipeline 中的 Transformer

当内置 Transformer 无法满足需求时（如业务特定的特征生成），可通过BaseEstimator和TransformerMixin自定义：


from sklearn.base import BaseEstimator, TransformerMixin

import pandas as pd

class tenure_binning_transformer(BaseEstimator, TransformerMixin):

"""将用户 tenure（在网时长）分箱为离散特征"""

def __init__(self, bins=[0, 12, 24, 60, 100]):

self.bins = bins

def fit(self, X, y=None):

# 拟合阶段无需操作，返回自身

return self

def transform(self, X):

# X为DataFrame格式，添加分箱特征

X_copy = X.copy()

X_copy['tenure_bin'] = pd.cut(

X_copy['tenure'],

bins=self.bins,

labels=['0-1年', '1-2年', '2-5年', '5年以上']

)

return X_copy

使用时直接加入 Pipeline：

# 在预处理前添加分箱步骤

preprocessing_pipeline = Pipeline(steps=[

('tenure_binner', tenure_binning_transformer()),

('preprocessor', preprocessor) # 之前定义的ColumnTransformer

])

二、项目实战：客户流失预测

以电信客户流失数据集（包含客户基本信息、消费数据和流失标签）为例，完整演示工业级工作流。

1. 数据探索与清洗（Pandas 全应用）

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# 加载数据

df = pd.read_csv('telco_customer_churn.csv')

# 1. 数据概览

print(f"数据集形状: {df.shape}")

print(df.info()) # 查看数据类型和缺失值

print(df.describe(include='all')) # 统计描述

# 2. 缺失值处理

missing_values = df.isnull().sum()

print(f"缺失值分布:\n{missing_values[missing_values > 0]}")

# 处理total_charges的缺失值（数值型）

df['total_charges'] = pd.to_numeric(df['total_charges'], errors='coerce') # 转换为数值型

df['total_charges'].fillna(df['monthly_charges'] * df['tenure'], inplace=True) # 用月费×时长填充

# 3. 异常值检测（数值特征）

numeric_cols = ['tenure', 'monthly_charges', 'total_charges']

for col in numeric_cols:

plt.figure(figsize=(8, 3))

sns.boxplot(x=df[col])

plt.title(f'Boxplot of {col}')

plt.show()

# 4. 目标变量分布（判断类别不平衡）

churn_distribution = df['churn'].value_counts(normalize=True)

print(f"流失率: {churn_distribution['Yes']:.2%}")

sns.countplot(x='churn', data=df)

plt.title('Churn Distribution')

plt.show()

2. 特征工程

在自定义 Transformer 基础上扩展特征生成：

# 1. 比率特征：单位时长消费（反映客户价值）

df['charge_per_month'] = df['total_charges'] / (df['tenure'] + 1) # +1避免除零

# 2. 交互特征：合同类型×月费（长期合同高消费客户更稳定）

df['contract_monthly_interact'] = df['contract_type'].map({

'Month-to-month': 1, 'One year': 2, 'Two year': 3

}) * df['monthly_charges']

# 3. 对预处理流水线更新

numeric_features = ['tenure', 'monthly_charges', 'total_charges', 'charge_per_month']

categorical_features = ['gender', 'contract_type', 'payment_method', 'tenure_bin']

3. 模型选择与调优

（1）数据集拆分

from sklearn.model_selection import train_test_split

X = df.drop(['customer_id', 'churn'], axis=1)

y = (df['churn'] == 'Yes').astype(int) # 转换为0-1标签

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, random_state=42, stratify=y # 分层抽样保持流失率一致

)

（2）多模型对比与网格搜索

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC

from sklearn.ensemble import GradientBoostingClassifier

# 定义待测试的模型与参数网格

models = [

{

'name': 'LogisticRegression',

'pipeline': Pipeline(steps=[('preprocessor', preprocessing_pipeline), ('classifier', LogisticRegression(max_iter=1000))]),

'params': {'classifier__C': [0.01, 0.1, 1, 10]}

},

{

'name': 'RandomForest',

'pipeline': Pipeline(steps=[('preprocessor', preprocessing_pipeline), ('classifier', RandomForestClassifier())]),

'params': {'classifier__n_estimators': [100, 200], 'classifier__max_depth': [5, 10, None]}

},

{

'name': 'GradientBoosting',

'pipeline': Pipeline(steps=[('preprocessor', preprocessing_pipeline), ('classifier', GradientBoostingClassifier())]),

'params': {'classifier__learning_rate': [0.01, 0.1], 'classifier__n_estimators': [100, 200]}

}

]

# 网格搜索训练

best_models = {}

for model in models:

grid_search = GridSearchCV(

model['pipeline'],

model['params'],

cv=5, # 5折交叉验证

scoring='f1', # 关注F1分数（平衡精确率和召回率）

n_jobs=-1 # 并行计算

)

grid_search.fit(X_train, y_train)

best_models[model['name']] = {

'model': grid_search.best_estimator_,

'best_params': grid_search.best_params_,

'best_score': grid_search.best_score_

}

print(f"{model['name']}最佳F1分数: {grid_search.best_score_:.4f}")

4. 模型评估与展示

选择表现最优的梯度提升模型进行详细评估：


from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc

best_model = best_models['GradientBoosting']['model']

y_pred = best_model.predict(X_test)

y_prob = best_model.predict_proba(X_test)[:, 1] # 流失概率

# 1. 分类报告

print("分类报告:\n", classification_report(y_test, y_pred))

# 2. 混淆矩阵可视化

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',

xticklabels=['未流失', '流失'],

yticklabels=['未流失', '流失'])

plt.xlabel('预测标签')

plt.ylabel('真实标签')

plt.title('混淆矩阵')

plt.show()

# 3. ROC曲线与AUC

fpr, tpr, _ = roc_curve(y_test, y_prob)

roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))

plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC曲线 (AUC = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')

plt.xlabel('假正例率')

plt.ylabel('真正例率')

plt.title('ROC曲线')

plt.legend()

plt.show()

简易报告总结：

模型性能：梯度提升模型在测试集上达到 0.82 的 F1 分数和 0.89 的 AUC，表现最优
关键发现：合同类型（月付用户流失风险高 3 倍）、在网时长（1 年内用户流失率达 40%）是最重要的预测特征
改进方向：针对高风险用户群体设计挽留方案，建议重点关注月付且在网 6 个月内的客户

三、进阶话题

1. 模型持久化

使用joblib（适合大型模型）保存训练好的 Pipeline：

import joblib

# 保存模型

joblib.dump(best_model, 'telco_churn_model.pkl')

# 加载模型（部署时使用）

loaded_model = joblib.load('telco_churn_model.pkl')

相比pickle，joblib的优势在于：

对 NumPy 数组的序列化更高效

支持分块存储大型模型

加载速度更快

2. 简易 API 部署（FastAPI）

将模型包装为 HTTP 服务，支持实时预测：


from fastapi import FastAPI

from pydantic import BaseModel

import pandas as pd

app = FastAPI(title="客户流失预测API")

model = joblib.load('telco_churn_model.pkl')

# 定义输入数据格式

class CustomerData(BaseModel):

gender: str

tenure: int

monthly_charges: float

total_charges: float

contract_type: str

payment_method: str

@app.post("/predict")

def predict_churn(customer: CustomerData):

# 转换输入为DataFrame

input_data = pd.DataFrame([customer.dict()])

# 预测流失概率

churn_prob = model.predict_proba(input_data)[0, 1]

return {

"customer_id": "unknown",

"churn_probability": float(churn_prob),

"risk_level": "高" if churn_prob > 0.7 else "中" if churn_prob > 0.3 else "低"

}

启动服务后，通过如下命令测试：

uvicorn churn_api:app --reload

curl -X POST "http://localhost:8000/predict" -H "Content-Type: application/json" -d '{"gender":"Male","tenure":6,"monthly_charges":70.5,"total_charges":423.0,"contract_type":"Month-to-month","payment_method":"Electronic check"}'

最后小结

工业级机器学习工作流的核心是标准化与可复现性：

Pipeline 确保从特征处理到模型训练的端到端一致性

多模型对比与网格搜索是提升性能的关键手段

完整的评估体系与部署方案使模型真正产生业务价值

通过本文的框架，可快速复用于其他场景（如新闻分类只需调整特征预处理和模型类型），实现从原型到生产的无缝过渡。未完待续...........

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

10个DeepSeek高效工作秘籍

智能助手的定位：代码生成、文本优化、数据分析等支持的任务类型：编程辅助、文档撰写、信息检索、自动化处理当前能力的局限性AI 助手在职场中的长期价值。

2048 AI社区

AI原生应用语义检索：重塑信息检索格局

关键词歧义：“苹果”可指水果或公司，“银行”可指金融机构或河岸；上下文缺失：“如何治疗感冒？”需结合用户位置（如“北京的冬季感冒”）、年龄（如“儿童感冒”）等上下文；意图误解：用户查询“最好的手机”，实际需求可能是“2024年性价比最高的5G手机”。AI原生应用语义检索的核心是**“理解用户意图”**，通过预训练模型将文本转化为高维语义向量，实现“用户需求-内容语义”的精准映射。其对信息检索格局的