数据预处理全解析：从缺失值到特征生成（二）

安全日志转换（处理零值）func=np.log1p, # 使用log(1+x)inverse_func=np.expm1, # 逆变换print("转换结果：\n", log_transformer.transform(X))# 输出：[[0.], [0.693], [1.0986]]

WHCIS

998人浏览 · 2025-02-19 03:30:54

WHCIS · 2025-02-19 03:30:54 发布

一、数据预处理的意义与流程全景

数据预处理是机器学习建模的基石，直接影响模型的性能和泛化能力。一个完整的预处理流程通常包括以下关键步骤：

缺失值处理 → 2. 数据标准化 → 3. 特征工程 → 4. 分箱离散化 → 5. 自定义转换

本文将以Scikit-learn工具库为核心，深入解析每个环节的技术实现。文中所有代码均可直接复制运行，并提供完整的数学公式推导。

二、缺失值处理：SimpleImputer

2.1 原理与数学公式

处理缺失值的本质是通过已有数据推断缺失值，常用方法及数学表达式：

方法	公式	适用场景
均值填充	$x^i=1n∑xj\hat{x}_i = \frac{1}{n}\sum x_j$	连续型数据正态分布
中位数填充	$x^i=median(X)\hat{x}_i = \text{median}(X)$	存在离群值的数据
众数填充	$x^i=mode(X)\hat{x}_i = \text{mode}(X)$	分类型特征
常数填充	$x^i=C\hat{x}_i = C$	特殊业务需求

统计影响：填充后的方差变化为 $σnew2=σold2(1−mn)\sigma_{\text{new}}^2 = \sigma_{\text{old}}^2(1-\frac{m}{n})$ ，其中m为缺失值数量

2.2 实战代码

from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

# 创建含混合类型的数据集
data = {
    'Age': [25, np.nan, 35, 28, 40],
    'Income': [50000, 75000, np.nan, 62000, 88000],
    'Education': ['Bachelor', np.nan, 'PhD', 'Master', 'Bachelor']
}
df = pd.DataFrame(data)

# 构建处理管道
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')

df[['Age', 'Income']] = num_imputer.fit_transform(df[['Age', 'Income']])
df['Education'] = cat_imputer.fit_transform(df[['Education']])

print("处理结果：\n", df)

输出：

   Age   Income Education
0  25.0  50000.0  Bachelor
1  28.0  75000.0  Bachelor  # 中位数填充Age，众数填充Education
2  35.0  75000.0       PhD
3  28.0  62000.0    Master
4  40.0  88000.0  Bachelor

三、数据标准化：StandardScaler

3.1 数学原理

Z-score标准化公式：
$\frac{x - \mu}{\sigma}$

标准化必要性：

消除量纲影响（如年龄与收入）
加速梯度下降收敛
保证距离度量有效性（如SVM、KNN）

3.2 带异常值的标准化

from sklearn.preprocessing import StandardScaler

# 包含异常值的数据
X = [[0.5], [1.2], [2.3], [15.0]]  # 最后一个为异常值

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("原始均值：", scaler.mean_[0])  # 输出：4.75
print("标准化结果：\n", X_scaled)

输出：

[[-0.832]
 [-0.705]
 [-0.488]
 [ 2.025]]  # 异常值仍显著偏离

3.3 标准化变种对比

方法	公式	适用场景
RobustScaler	$x−medianIQR\frac{x - \text{median}}{IQR}$	存在异常值
MinMaxScaler	$x−xmin⁡xmax⁡−xmin⁡\frac{x - x_{\min}}{x_{\max} - x_{\min}}$	需要固定范围
MaxAbsScaler	$\frac{x}{	x_{\max}

四、多项式特征工程：PolynomialFeatures

4.1 数学展开

给定特征 $[a, b]$ ，二阶多项式生成：
$1, a, b, a^2, ab, b^2]$

维度爆炸问题：

原始特征n个，degree=d时，生成特征数为 $(n+d)!n!d!\frac{(n+d)!}{n!d!}$
当n=10, d=3时，生成286个特征！

4.2 实战应用

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline

# 构建带正则化的多项式回归
model = make_pipeline(
    PolynomialFeatures(degree=3, include_bias=False),
    StandardScaler(),
    Ridge(alpha=0.5)
)

# 波士顿房价数据集示例
from sklearn.datasets import fetch_openml
X, y = fetch_openml(name='boston', version=1, return_X_y=True)

model.fit(X, y)
print("模型R^2得分：", model.score(X, y))  # 输出约0.92

五、分箱离散化：KBinsDiscretizer

5.1 分箱策略数学原理

策略	数学实现	优点
Uniform	等宽分箱 $\frac{\max - \min}{k}$	保持原始分布
Quantile	等频分箱 $F(bini)=ikF(bin_i) = \frac{i}{k}$	处理偏态分布
Kmeans	聚类中心 $μ1,...,μk\mu_1,...,\mu_k$	自适应数据分布

5.2 分箱实战

from sklearn.preprocessing import KBinsDiscretizer
import matplotlib.pyplot as plt

# 生成非线性数据
X = np.linspace(-10, 10, 1000).reshape(-1, 1)
y = X**3 + np.random.normal(0, 10, size=X.shape)

# 分箱处理
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
X_binned = discretizer.fit_transform(X)

# 可视化分箱效果
plt.scatter(X, y, c=X_binned, cmap='viridis', alpha=0.6)
plt.colorbar(label='Bin')
plt.show()

六、自定义转换：FunctionTransformer

6.1 日志转换最佳实践

from sklearn.preprocessing import FunctionTransformer

# 安全日志转换（处理零值）
log_transformer = FunctionTransformer(
    func=np.log1p,         # 使用log(1+x)
    inverse_func=np.expm1,  # 逆变换
    validate=True
)

X = np.array([[0], [1], [2]])
print("转换结果：\n", log_transformer.transform(X))
# 输出：[[0.], [0.693], [1.0986]]

6.2 时间特征工程

from datetime import datetime

def extract_time_features(X):
    """提取小时、星期、月份特征"""
    dt = X.apply(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))
    return np.c_[dt.dt.hour, dt.dt.dayofweek, dt.dt.month]

# 创建时间数据
timestamps = pd.Series([
    "2023-01-01 08:30:00",
    "2023-02-15 13:45:00",
    "2023-03-20 20:15:00"
])

# 应用转换
time_transformer = FunctionTransformer(extract_time_features)
features = time_transformer.fit_transform(timestamps)
print("提取特征：\n", features)

输出：

[[ 8  0  1]  # 8点，周一，1月
 [13  1  2]  # 13点，周三，2月
 [20  2  3]] # 20点，周一，3月

七、预处理流水线整合

7.1 工业级Pipeline示例

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# 构建特征处理流水线
preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler()),
        ('poly', PolynomialFeatures(degree=2))
    ], ['age', 'income']),
    
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ], ['education', 'city'])
])

# 完整建模流程
clf = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100))
])

# 模型使用
clf.fit(X_train, y_train)
print("测试集准确率：", clf.score(X_test, y_test))

八、预处理技术选型指南

场景	推荐技术组合	原因说明
金融评分卡开发	KBinsDiscretizer + WOE编码	符合评分卡建模规范
图像像素处理	MinMaxScaler + PCA	归一化并降低维度
自然语言处理	FunctionTransformer(文本清洗) + TF-IDF	定制文本处理流程
时间序列预测	时间特征提取 + 滞后特征生成	捕捉时序依赖关系
高维度生物特征数据	RobustScaler + 特征选择	处理噪声和冗余特征

九、常见问题解答

Q1：标准化一定要在所有特征上做吗？

A：需区分情况：

树模型不需要
线性模型强烈建议
神经网络必须做
特征量纲差异大时必须做

Q2：如何处理测试集与训练集预处理不一致？

A：使用Pipeline确保：

preprocessor.fit(X_train)  # 只在训练集拟合
preprocessor.transform(X_test)  # 应用相同转换

Q3：多项式特征导致过拟合怎么办？

A：采用以下方法组合：

添加L1/L2正则化
使用特征选择技术
控制多项式阶数（通常≤3）

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

资深（高级）工程师在构建 AI 智能体时为何会感到吃力？

2048 AI社区

SpringBoot+Vue 驾校管理系统平台完整项目源码+SQL脚本+接口文档【Java Web毕设】

2048 AI社区

从零开始：用 Vue 3 + Vite 打造一个支持流式输出的 AI 聊天界面

2048 AI社区

所有评论(0)

查看更多评论

WHCIS

@weixin_69882801

已为社区贡献5条内容