【机器学习-28】kaggle案例之Porto Seguro汽车保险理赔预测:用XGBoost挑战基尼系数0.284+的完整实战
当保险公司通过机器学习为每位司机"量身定制"保费时,公平性不再是空谈。巴西最大保险公司Porto Seguro的Kaggle竞赛向我们展示了AI如何改变保险行业的游戏规则!Porto Seguro安全驾驶预测竞赛的核心目标:基于590,000+条保单数据,预测驾驶员在未来一年内提出理赔的概率,本质上是一个二分类的问题,从而实现公平定价。交互特征工程:XGBoost优化参数:完整评估体系:环境准备:
🧑 博主简介:曾任某智慧城市类企业
算法总监,目前在美国市场的物流公司从事高级算法工程师一职,深耕人工智能领域,精通python数据挖掘、可视化、机器学习等,发表过AI相关的专利并多次在AI类比赛中获奖。CSDN人工智能领域的优质创作者,提供AI相关的技术咨询、项目开发和个性化解决方案等服务,如有需要请站内私信或者联系任意文章底部的的VX名片(ID:xf982831907)
💬 博主粉丝群介绍:① 群内初中生、高中生、本科生、研究生、博士生遍布,可互相学习,交流困惑。② 热榜top10的常客也在群里,也有数不清的万粉大佬,可以交流写作技巧,上榜经验,涨粉秘籍。③ 群内也有职场精英,大厂大佬,可交流技术、面试、找工作的经验。④ 进群免费赠送写作秘籍一份,助你由写作小白晋升为创作大佬。⑤ 进群赠送CSDN评论防封脚本,送真活跃粉丝,助你提升文章热度。有兴趣的加文末联系方式,备注自己的CSDN昵称,拉你进群,互相学习共同进步。

【机器学习-28】kaggle案例之Porto Seguro汽车保险理赔预测:用XGBoost挑战基尼系数0.284+的完整实战
当保险公司通过机器学习为每位司机"量身定制"保费时,公平性不再是空谈。巴西最大保险公司Porto Seguro的Kaggle竞赛向我们展示了AI如何改变保险行业的游戏规则!
一、🎯 竞赛背景:好司机不再为坏司机"买单"
Porto Seguro安全驾驶预测竞赛的核心目标:基于590,000+条保单数据,预测驾驶员在未来一年内提出理赔的概率,本质上是一个二分类的问题,从而实现公平定价。
1.1 关键挑战:
- 极端不平衡数据:仅有**3.64%**的司机会发生理赔
- 匿名特征:57个特征全部匿名处理,考验特征工程能力
- 独特评价指标:归一化基尼系数(Normalized Gini Coefficient)
- 海量数据:高效处理与建模的挑战
二、📊 完整可运行代码实现
2.1 环境准备与数据加载
# 导入所有必要的库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score, roc_curve, precision_recall_curve, auc, confusion_matrix
from sklearn.preprocessing import LabelEncoder
import xgboost as xgb
import warnings
import gc
from tqdm import tqdm
from scipy import stats
# 设置中文显示和美化样式
plt.rcParams['font.sans-serif'] = ['SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False
warnings.filterwarnings('ignore')
# 设置Seaborn样式
sns.set_style("whitegrid")
sns.set_palette("husl")
plt.style.use('seaborn-v0_8-darkgrid')
# 定义基尼系数计算函数
def gini(actual, pred, cmpcol=0, sortcol=1):
"""
计算基尼系数
actual: 实际标签
pred: 预测概率
"""
assert len(actual) == len(pred)
all_data = np.asarray(np.c_[actual, pred, np.arange(len(actual))], dtype=float)
all_data = all_data[np.lexsort((all_data[:, 2], -1 * all_data[:, 1]))]
totalLosses = all_data[:, 0].sum()
giniSum = all_data[:, 0].cumsum().sum() / totalLosses
giniSum -= (len(actual) + 1) / 2.
return giniSum / len(actual)
def gini_normalized(a, p):
"""计算归一化基尼系数"""
return gini(a, p) / gini(a, a)
# 加载数据
print("🚀 开始加载数据...")
train = pd.read_csv('../input/porto-seguro-safe-driver-prediction/train.csv')
test = pd.read_csv('../input/porto-seguro-safe-driver-prediction/test.csv')
sample_submission = pd.read_csv('../input/porto-seguro-safe-driver-prediction/sample_submission.csv')
print("✅ 数据加载完成!")
print(f"训练集形状: {train.shape}")
print(f"测试集形状: {test.shape}")
print(f"样本提交形状: {sample_submission.shape}")
print(f"\n训练集列名预览:\n{train.columns.tolist()[:10]}...")

2.2 探索性数据分析(EDA)
2.2.1 目标变量分布可视化
def plot_target_distribution(train_data):
"""
目标变量分布分析
"""
print("🎯 目标变量分布分析")
print("=" * 50)
# 计算统计信息
target_counts = train_data['target'].value_counts()
target_percentages = (target_counts / len(train_data) * 100).round(2)
print(f"理赔客户 (target=1): {target_counts[1]:,} 人 ({target_percentages[1]}%)")
print(f"非理赔客户 (target=0): {target_counts[0]:,} 人 ({target_percentages[0]}%)")
imbalance_ratio = target_counts[0] / target_counts[1]
print(f"不平衡比例: {imbalance_ratio:.2f}:1")
# 创建子图
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# 柱状图
colors = ['#3498db', '#e74c3c']
bars = axes[0].bar(['无理赔 (0)', '有理赔 (1)'], target_counts.values, color=colors, alpha=0.8)
axes[0].set_title('目标变量分布 - 柱状图', fontsize=14, fontweight='bold', pad=15)
axes[0].set_xlabel('目标类别', fontsize=12)
axes[0].set_ylabel('样本数量', fontsize=12)
# 在柱子上添加数值标签
for bar, count, percentage in zip(bars, target_counts.values, target_percentages.values):
height = bar.get_height()
axes[0].text(bar.get_x() + bar.get_width()/2., height + 1000,
f'{count:,}\n({percentage}%)',
ha='center', va='bottom', fontsize=10)
# 饼图
explode = (0, 0.1)
wedges, texts, autotexts = axes[1].pie(target_counts.values,
labels=['无理赔 (0)', '有理赔 (1)'],
colors=colors,
explode=explode,
autopct='%1.1f%%',
startangle=90,
textprops={'fontsize': 11})
axes[1].set_title('目标变量分布 - 饼图', fontsize=14, fontweight='bold', pad=15)
# 美化饼图文本
for autotext in autotexts:
autotext.set_color('white')
autotext.set_fontweight('bold')
plt.suptitle('🎯 目标变量分布分析', fontsize=16, fontweight='bold', y=1.05)
plt.tight_layout()
plt.show()
# 打印不平衡的影响
print("\n⚠️ 数据不平衡的影响分析:")
print("-" * 40)
print("1. 模型容易偏向多数类(无理赔客户)")
print("2. 需要特殊技术处理少数类识别")
print("3. 评价指标应使用基尼系数而非准确率")
print("4. 需要适当的采样策略或类别权重")
return target_counts, target_percentages
# 执行可视化
target_counts, target_percentages = plot_target_distribution(train)

2.2.2 特征类型分析
def analyze_feature_categories(train_data):
"""
分析特征类别分布
"""
print("\n🔍 特征类别分析")
print("=" * 50)
# 获取所有特征(排除id和target)
features = [col for col in train_data.columns if col not in ['id', 'target']]
# 按前缀分类
ind_features = [col for col in features if col.startswith('ps_ind_')]
reg_features = [col for col in features if col.startswith('ps_reg_')]
car_features = [col for col in features if col.startswith('ps_car_')]
calc_features = [col for col in features if col.startswith('ps_calc_')]
# 按类型分类
cat_features = [col for col in features if 'cat' in col]
bin_features = [col for col in features if 'bin' in col]
cont_features = [col for col in features if 'cat' not in col and 'bin' not in col]
# 打印统计信息
print(f"总特征数量: {len(features)}")
print(f"\n按前缀分组:")
print(f" ind特征: {len(ind_features)} 个")
print(f" reg特征: {len(reg_features)} 个")
print(f" car特征: {len(car_features)} 个")
print(f" calc特征: {len(calc_features)} 个")
print(f"\n按类型分组:")
print(f" 类别特征: {len(cat_features)} 个")
print(f" 二进制特征: {len(bin_features)} 个")
print(f" 连续/序数特征: {len(cont_features)} 个")
# 可视化特征分布
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# 按前缀分组
prefix_data = {'ind': len(ind_features), 'reg': len(reg_features),
'car': len(car_features), 'calc': len(calc_features)}
prefix_colors = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12']
axes[0].bar(prefix_data.keys(), prefix_data.values(), color=prefix_colors, alpha=0.8)
axes[0].set_title('按前缀分组', fontsize=14, fontweight='bold')
axes[0].set_xlabel('特征前缀', fontsize=12)
axes[0].set_ylabel('特征数量', fontsize=12)
# 添加数值标签
for i, (key, value) in enumerate(prefix_data.items()):
axes[0].text(i, value + 0.5, str(value), ha='center', va='bottom', fontsize=11)
# 按类型分组
type_data = {'类别特征': len(cat_features), '二进制特征': len(bin_features),
'连续特征': len(cont_features)}
type_colors = ['#9b59b6', '#34495e', '#1abc9c']
axes[1].bar(type_data.keys(), type_data.values(), color=type_colors, alpha=0.8)
axes[1].set_title('按类型分组', fontsize=14, fontweight='bold')
axes[1].set_xlabel('特征类型', fontsize=12)
axes[1].set_ylabel('特征数量', fontsize=12)
plt.xticks(rotation=15)
# 添加数值标签
for i, (key, value) in enumerate(type_data.items()):
axes[1].text(i, value + 0.5, str(value), ha='center', va='bottom', fontsize=11)
plt.suptitle('📊 特征类别分布', fontsize=16, fontweight='bold', y=1.05)
plt.tight_layout()
plt.show()
return {
'ind_features': ind_features,
'reg_features': reg_features,
'car_features': car_features,
'calc_features': calc_features,
'cat_features': cat_features,
'bin_features': bin_features,
'cont_features': cont_features
}
# 执行分析
feature_categories = analyze_feature_categories(train)

2.2.3 缺失值分析
# 2.3 缺失值分析 - 修复版
def analyze_missing_values(train_data, test_data):
"""
缺失值分析 - 修复版
"""
print("\n🔍 缺失值分析")
print("=" * 50)
# 计算缺失值(-1表示缺失)
train_missing = (train_data == -1).sum()
test_missing = (test_data == -1).sum()
# 筛选有缺失的特征
train_missing_features = train_missing[train_missing > 0]
test_missing_features = test_missing[test_missing > 0]
print(f"训练集有 {len(train_missing_features)} 个特征包含缺失值")
print(f"测试集有 {len(test_missing_features)} 个特征包含缺失值")
# 计算缺失率
train_missing_rate = (train_missing_features / len(train_data) * 100).round(2)
test_missing_rate = (test_missing_features / len(test_data) * 100).round(2)
# 获取所有有缺失的特征(并集)
all_missing_features = list(set(train_missing_features.index.tolist() +
test_missing_features.index.tolist()))
# 创建DataFrame便于分析 - 修复版
train_counts = []
train_rates = []
test_counts = []
test_rates = []
for feature in all_missing_features:
train_counts.append(train_missing_features.get(feature, 0))
train_rates.append(train_missing_rate.get(feature, 0))
test_counts.append(test_missing_features.get(feature, 0))
test_rates.append(test_missing_rate.get(feature, 0))
missing_df = pd.DataFrame({
'feature': all_missing_features,
'train_missing_count': train_counts,
'train_missing_rate': train_rates,
'test_missing_count': test_counts,
'test_missing_rate': test_rates
}).sort_values('train_missing_count', ascending=False)
print(f"\n发现 {len(all_missing_features)} 个有缺失值的特征")
# 可视化缺失率Top10
fig, axes = plt.subplots(2, 1, figsize=(12, 10))
# 训练集Top10
top_train = missing_df.sort_values('train_missing_count', ascending=False).head(10)
colors_train = plt.cm.Reds(np.linspace(0.4, 0.8, len(top_train)))
axes[0].barh(top_train['feature'][::-1], top_train['train_missing_rate'][::-1],
color=colors_train, alpha=0.8)
axes[0].set_title('训练集缺失率Top10', fontsize=14, fontweight='bold')
axes[0].set_xlabel('缺失率 (%)', fontsize=12)
axes[0].set_ylabel('特征', fontsize=12)
# 添加数值标签
for i, (feature, rate) in enumerate(zip(top_train['feature'][::-1], top_train['train_missing_rate'][::-1])):
axes[0].text(rate + 0.2, i, f'{rate}%', va='center', fontsize=10)
# 测试集Top10
top_test = missing_df.sort_values('test_missing_count', ascending=False).head(10)
colors_test = plt.cm.Blues(np.linspace(0.4, 0.8, len(top_test)))
axes[1].barh(top_test['feature'][::-1], top_test['test_missing_rate'][::-1],
color=colors_test, alpha=0.8)
axes[1].set_title('测试集缺失率Top10', fontsize=14, fontweight='bold')
axes[1].set_xlabel('缺失率 (%)', fontsize=12)
axes[1].set_ylabel('特征', fontsize=12)
# 添加数值标签
for i, (feature, rate) in enumerate(zip(top_test['feature'][::-1], top_test['test_missing_rate'][::-1])):
axes[1].text(rate + 0.2, i, f'{rate}%', va='center', fontsize=10)
plt.suptitle('📉 缺失值分析', fontsize=16, fontweight='bold', y=0.98)
plt.tight_layout()
plt.show()
# 打印详细统计
print("\n📋 缺失率最高的特征:")
print("-" * 40)
top_features = missing_df.sort_values('train_missing_rate', ascending=False).head(5)
for i, row in top_features.iterrows():
print(f"\n{row['feature']}:")
print(f" 训练集 - {row['train_missing_count']:,} 个 ({row['train_missing_rate']}%)")
print(f" 测试集 - {row['test_missing_count']:,} 个 ({row['test_missing_rate']}%)")
# 检查缺失率差异
diff = abs(row['train_missing_rate'] - row['test_missing_rate'])
if diff > 5: # 差异大于5%
print(f" ⚠️ 警告: 训练集和测试集缺失率差异较大 ({diff:.1f}%)")
# 添加缺失值总结
print("\n📊 缺失值总结:")
print("-" * 40)
print(f"总共有 {len(all_missing_features)} 个特征存在缺失值")
print(f"训练集平均缺失率: {train_missing_rate.mean():.2f}%")
print(f"测试集平均缺失率: {test_missing_rate.mean():.2f}%")
return missing_df
# 执行缺失值分析
missing_df = analyze_missing_values(train, test)


2.2.4 特征分布可视化
def visualize_feature_distributions(train_data, num_features=8):
"""
可视化特征分布
"""
print("\n📈 特征分布可视化")
print("=" * 50)
# 选择要可视化的特征
features_to_plot = [
'ps_ind_01', 'ps_ind_03', 'ps_reg_01', 'ps_reg_02',
'ps_car_11', 'ps_car_12', 'ps_car_13', 'ps_calc_01'
]
# 创建子图
fig, axes = plt.subplots(2, 4, figsize=(18, 10))
axes = axes.flatten()
for i, feature in enumerate(features_to_plot):
if i >= len(axes):
break
# 按目标类别分别获取数据
data_0 = train_data[train_data['target'] == 0][feature].dropna()
data_1 = train_data[train_data['target'] == 1][feature].dropna()
# 绘制核密度估计图
sns.kdeplot(data_0, ax=axes[i], label='无理赔', color='#3498db', fill=True, alpha=0.5)
sns.kdeplot(data_1, ax=axes[i], label='有理赔', color='#e74c3c', fill=True, alpha=0.5)
axes[i].set_title(f'{feature}', fontsize=12, fontweight='bold')
axes[i].set_xlabel('特征值', fontsize=10)
axes[i].set_ylabel('密度', fontsize=10)
axes[i].legend(fontsize=9)
axes[i].grid(True, alpha=0.3)
plt.suptitle('📊 特征分布对比(按目标类别)', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
# 创建箱线图对比
print("\n📦 重要特征箱线图对比")
important_features = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_car_12']
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
for i, feature in enumerate(important_features):
if i >= len(axes):
break
# 准备数据
data_to_plot = []
labels = []
for target_val in [0, 1]:
data = train_data[train_data['target'] == target_val][feature].dropna().values
if len(data) > 0:
data_to_plot.append(data)
labels.append(f'目标={target_val}')
# 绘制箱线图
box = axes[i].boxplot(data_to_plot, labels=labels, patch_artist=True)
# 设置颜色
colors = ['#3498db', '#e74c3c']
for patch, color in zip(box['boxes'], colors[:len(data_to_plot)]):
patch.set_facecolor(color)
patch.set_alpha(0.7)
axes[i].set_title(f'{feature}', fontsize=12, fontweight='bold')
axes[i].set_ylabel('特征值', fontsize=10)
axes[i].grid(True, alpha=0.3)
# 添加统计信息
stats_text = []
for j, data in enumerate(data_to_plot):
stats_text.append(f"目标={j}: n={len(data):,}\n均值={np.mean(data):.2f}\n中位数={np.median(data):.2f}")
axes[i].text(0.02, 0.98, '\n'.join(stats_text),
transform=axes[i].transAxes, fontsize=9,
verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
plt.suptitle('📦 重要特征箱线图对比', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
# 执行特征分布可视化
visualize_feature_distributions(train)


2.3 特征工程
2.3.1 基础特征工程
def basic_feature_engineering(train_data, test_data):
"""
基础特征工程
"""
print("\n🔧 开始特征工程...")
print("=" * 50)
# 创建数据副本
train_engineered = train_data.copy()
test_engineered = test_data.copy()
# 1. 移除有问题的列(基于竞赛经验)
faulty_columns = [
'ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin',
'ps_ind_13_bin', 'ps_ind_14', 'ps_car_10_cat'
]
print(f"移除有问题的列: {faulty_columns}")
# 2. 创建交互特征
print("创建交互特征...")
# 基于业务理解创建交互特征
train_engineered['ps_ind_03-ps_ind_02_cat'] = train_engineered['ps_ind_03'] * train_engineered['ps_ind_02_cat']
test_engineered['ps_ind_03-ps_ind_02_cat'] = test_engineered['ps_ind_03'] * test_engineered['ps_ind_02_cat']
train_engineered['ps_car_13-ps_ind_03'] = train_engineered['ps_car_13'] * train_engineered['ps_ind_03']
test_engineered['ps_car_13-ps_ind_03'] = test_engineered['ps_car_13'] * test_engineered['ps_ind_03']
print(f" 创建交互特征: ps_ind_03-ps_ind_02_cat, ps_car_13-ps_ind_03")
# 3. 创建更多交互特征
print("创建更多交互特征...")
# 个人特征之间的交互
train_engineered['ps_ind_01-ps_ind_03'] = train_engineered['ps_ind_01'] * train_engineered['ps_ind_03']
test_engineered['ps_ind_01-ps_ind_03'] = test_engineered['ps_ind_01'] * test_engineered['ps_ind_03']
# 汽车特征之间的交互
train_engineered['ps_car_12-ps_car_13'] = train_engineered['ps_car_12'] * train_engineered['ps_car_13']
test_engineered['ps_car_12-ps_car_13'] = test_engineered['ps_car_12'] * test_engineered['ps_car_13']
print(f" 创建交互特征: ps_ind_01-ps_ind_03, ps_car_12-ps_car_13")
# 4. 创建统计特征
print("创建统计特征...")
# 个人特征组统计
ind_features = [col for col in train_data.columns if col.startswith('ps_ind_') and 'cat' not in col and 'bin' not in col]
if ind_features:
train_engineered['ind_mean'] = train_engineered[ind_features].mean(axis=1)
train_engineered['ind_std'] = train_engineered[ind_features].std(axis=1)
test_engineered['ind_mean'] = test_engineered[ind_features].mean(axis=1)
test_engineered['ind_std'] = test_engineered[ind_features].std(axis=1)
print(f" 创建统计特征: ind_mean, ind_std")
# 5. 处理缺失值
print("处理缺失值...")
# 对于数值特征,用中位数填充-1
numeric_features = train_engineered.select_dtypes(include=[np.number]).columns.tolist()
missing_count_before = (train_engineered[numeric_features] == -1).sum().sum()
for feature in numeric_features:
if feature not in ['id', 'target']:
if (train_engineered[feature] == -1).any():
median_val = train_engineered.loc[train_engineered[feature] != -1, feature].median()
train_engineered[feature] = train_engineered[feature].replace(-1, median_val)
test_engineered[feature] = test_engineered[feature].replace(-1, median_val)
missing_count_after = (train_engineered[numeric_features] == -1).sum().sum()
print(f" 处理缺失值: 从 {missing_count_before} 个减少到 {missing_count_after} 个")
# 6. 更新特征列表
all_features = [col for col in train_engineered.columns
if col not in faulty_columns + ['id', 'target']]
print(f"\n✅ 特征工程完成!")
print(f"原始特征数量: {len(train_data.columns) - 2}")
print(f"工程后特征数量: {len(all_features)}")
print(f"新增交互特征: 4个")
print(f"新增统计特征: 2个")
# 可视化新增特征的分布
visualize_new_features(train_engineered, all_features)
return train_engineered, test_engineered, all_features
def visualize_new_features(train_data, all_features):
"""
可视化新创建的特征
"""
print("\n📊 新特征分布可视化")
# 找出新创建的特征
new_features = ['ps_ind_03-ps_ind_02_cat', 'ps_car_13-ps_ind_03',
'ps_ind_01-ps_ind_03', 'ps_car_12-ps_car_13',
'ind_mean', 'ind_std']
# 只显示实际存在的新特征
existing_new_features = [f for f in new_features if f in train_data.columns]
if not existing_new_features:
print("没有找到新创建的特征")
return
# 创建子图
n_features = len(existing_new_features)
n_cols = 3
n_rows = (n_features + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 4 * n_rows))
if n_rows == 1:
axes = axes.reshape(1, -1)
for i, feature in enumerate(existing_new_features):
row = i // n_cols
col = i % n_cols
# 按目标类别分别绘制
data_0 = train_data[train_data['target'] == 0][feature].dropna()
data_1 = train_data[train_data['target'] == 1][feature].dropna()
# 绘制核密度估计图
sns.kdeplot(data_0, ax=axes[row, col], label='无理赔', color='#3498db', fill=True, alpha=0.5)
sns.kdeplot(data_1, ax=axes[row, col], label='有理赔', color='#e74c3c', fill=True, alpha=0.5)
axes[row, col].set_title(f'{feature}', fontsize=12, fontweight='bold')
axes[row, col].set_xlabel('特征值', fontsize=10)
axes[row, col].set_ylabel('密度', fontsize=10)
axes[row, col].legend(fontsize=9)
axes[row, col].grid(True, alpha=0.3)
# 隐藏多余的子图
for i in range(len(existing_new_features), n_rows * n_cols):
row = i // n_cols
col = i % n_cols
axes[row, col].axis('off')
plt.suptitle('✨ 新特征分布对比', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
# 执行基础特征工程
train_engineered, test_engineered, all_features = basic_feature_engineering(train, test)


2.3.2 高级特征工程
def advanced_feature_engineering(train_data, test_data):
"""
高级特征工程
"""
print("\n🚀 开始高级特征工程...")
print("=" * 50)
# 创建数据副本
train_advanced = train_data.copy()
test_advanced = test_data.copy()
# 1. 分箱特征
print("创建分箱特征...")
# 对连续特征进行分箱
continuous_features = ['ps_reg_01', 'ps_reg_02', 'ps_reg_03',
'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15']
for feature in continuous_features:
if feature in train_advanced.columns:
# 使用分位数分箱
train_advanced[f'{feature}_bin'] = pd.qcut(
train_advanced[feature],
q=5,
labels=False,
duplicates='drop'
)
test_advanced[f'{feature}_bin'] = pd.qcut(
test_advanced[feature],
q=5,
labels=False,
duplicates='drop'
)
print(f" 创建分箱特征: {len(continuous_features)} 个连续特征")
# 2. 多项式特征
print("创建多项式特征...")
important_features = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_car_12']
poly_features_created = 0
for feature in important_features:
if feature in train_advanced.columns:
# 平方特征
train_advanced[f'{feature}_squared'] = train_advanced[feature] ** 2
test_advanced[f'{feature}_squared'] = test_advanced[feature] ** 2
# 立方特征
train_advanced[f'{feature}_cubed'] = train_advanced[feature] ** 3
test_advanced[f'{feature}_cubed'] = test_advanced[feature] ** 3
poly_features_created += 2
print(f" 创建多项式特征: {poly_features_created} 个")
# 3. 特征组合
print("创建特征组合...")
# 计算二进制特征的和
bin_features = [col for col in train_advanced.columns if 'bin' in col]
if bin_features:
train_advanced['bin_sum'] = train_advanced[bin_features].sum(axis=1)
test_advanced['bin_sum'] = test_advanced[bin_features].sum(axis=1)
print(f" 创建特征组合: bin_sum (二进制特征求和)")
print(f"\n✅ 高级特征工程完成!")
# 获取所有特征
all_advanced_features = [col for col in train_advanced.columns
if col not in ['id', 'target']]
print(f"总特征数量: {len(all_advanced_features)}")
# 可视化特征工程效果
visualize_advanced_features(train_advanced)
return train_advanced, test_advanced, all_advanced_features
def visualize_advanced_features(train_data):
"""
可视化高级特征工程的效果
"""
print("\n📈 高级特征工程效果可视化")
# 选择几个重要的新特征进行可视化
important_new_features = [
'ps_car_13_squared', 'ps_ind_03_squared',
'ps_reg_03_bin', 'bin_sum'
]
# 过滤出实际存在的特征
existing_features = [f for f in important_new_features if f in train_data.columns]
if not existing_features:
print("没有找到高级特征工程创建的特征")
return
n_features = len(existing_features)
n_cols = 2
n_rows = (n_features + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(14, 4 * n_rows))
if n_rows == 1:
axes = axes.reshape(1, -1)
for i, feature in enumerate(existing_features):
row = i // n_cols
col = i % n_cols
# 按目标类别分别获取数据
data_0 = train_data[train_data['target'] == 0][feature].dropna()
data_1 = train_data[train_data['target'] == 1][feature].dropna()
# 绘制箱线图对比
data_to_plot = []
labels = []
if len(data_0) > 0:
data_to_plot.append(data_0)
labels.append('无理赔')
if len(data_1) > 0:
data_to_plot.append(data_1)
labels.append('有理赔')
if data_to_plot:
box = axes[row, col].boxplot(data_to_plot, labels=labels, patch_artist=True)
# 设置颜色
colors = ['#3498db', '#e74c3c']
for patch, color in zip(box['boxes'], colors[:len(data_to_plot)]):
patch.set_facecolor(color)
patch.set_alpha(0.7)
axes[row, col].set_title(f'{feature}', fontsize=12, fontweight='bold')
axes[row, col].set_ylabel('特征值', fontsize=10)
axes[row, col].grid(True, alpha=0.3)
# 隐藏多余的子图
for i in range(len(existing_features), n_rows * n_cols):
row = i // n_cols
col = i % n_cols
axes[row, col].axis('off')
plt.suptitle('🔧 高级特征工程效果对比', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
# 执行高级特征工程
train_advanced, test_advanced, advanced_features = advanced_feature_engineering(
train_engineered, test_engineered
)

2.4 模型训练与评估
2.4.1 准备训练数据
def prepare_training_data(train_data, test_data, features_list):
"""
准备训练数据
"""
print("\n📊 准备训练数据...")
print("=" * 50)
X = train_data[features_list]
X_test = test_data[features_list]
y = train_data['target'].values
print(f"训练特征形状: {X.shape}")
print(f"测试特征形状: {X_test.shape}")
print(f"目标变量形状: {y.shape}")
# 查看目标变量的分布
print(f"\n目标变量分布:")
print(f" 正样本 (target=1): {y.sum():,} 个 ({y.sum()/len(y)*100:.2f}%)")
print(f" 负样本 (target=0): {len(y)-y.sum():,} 个 ({(len(y)-y.sum())/len(y)*100:.2f}%)")
return X, X_test, y
# 准备数据
X, X_test, y = prepare_training_data(train_advanced, test_advanced, advanced_features)

2.4.2 XGBoost模型训练
def train_xgboost_with_cv(X, y, X_test, params, n_splits=5, num_round=900):
"""
使用交叉验证训练XGBoost模型
"""
print(f"\n🚀 开始 {n_splits} 折交叉验证训练...")
print("=" * 60)
# 初始化数组
train_oof = np.zeros((X.shape[0], ))
test_preds = np.zeros((X_test.shape[0], ))
# 创建KFold
kf = KFold(n_splits=n_splits, random_state=137, shuffle=True)
# 存储每折的性能
fold_scores = []
fold_models = []
# 训练进度条
print(f"训练参数:")
for key, value in params.items():
print(f" {key}: {value}")
for fold, (train_idx, val_idx) in enumerate(kf.split(X), 1):
print(f"\n📊 训练第 {fold}/{n_splits} 折...")
# 分割数据
X_train_fold = X.iloc[train_idx]
y_train_fold = y[train_idx]
X_val_fold = X.iloc[val_idx]
y_val_fold = y[val_idx]
# 创建DMatrix
dtrain = xgb.DMatrix(X_train_fold, label=y_train_fold, enable_categorical=True)
dval = xgb.DMatrix(X_val_fold, label=y_val_fold, enable_categorical=True)
dtest = xgb.DMatrix(X_test, enable_categorical=True)
# 训练模型
model = xgb.train(
params,
dtrain,
num_boost_round=num_round,
evals=[(dtrain, 'train'), (dval, 'val')],
verbose_eval=100 if fold == 1 else False
)
# 验证集预测
val_pred = model.predict(dval)
train_oof[val_idx] = val_pred
# 计算基尼系数
fold_gini = gini_normalized(y_val_fold, val_pred)
fold_scores.append(fold_gini)
fold_models.append(model)
print(f"✅ 第 {fold} 折完成 - 归一化基尼系数: {fold_gini:.6f}")
# 测试集预测(平均)
test_preds += model.predict(dtest) / n_splits
# 清理内存
del dtrain, dval
gc.collect()
# 计算整体性能
overall_gini = gini_normalized(y, train_oof)
print(f"\n🎯 训练完成!")
print("-" * 60)
print(f"各折基尼系数: {[f'{score:.6f}' for score in fold_scores]}")
print(f"平均基尼系数: {np.mean(fold_scores):.6f} (±{np.std(fold_scores):.6f})")
print(f"整体OOF基尼系数: {overall_gini:.6f}")
# 可视化交叉验证结果
visualize_cv_results(fold_scores, overall_gini)
return train_oof, test_preds, fold_scores, overall_gini, fold_models
def visualize_cv_results(fold_scores, overall_gini):
"""
可视化交叉验证结果
"""
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# 柱状图显示各折性能
colors = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12', '#9b59b6'][:len(fold_scores)]
bars = axes[0].bar(range(1, len(fold_scores) + 1), fold_scores, color=colors, alpha=0.8)
axes[0].axhline(y=np.mean(fold_scores), color='red', linestyle='--',
label=f'平均值: {np.mean(fold_scores):.6f}')
axes[0].axhline(y=overall_gini, color='green', linestyle='--',
label=f'整体OOF: {overall_gini:.6f}')
axes[0].set_title('各折性能', fontsize=14, fontweight='bold')
axes[0].set_xlabel('折数', fontsize=12)
axes[0].set_ylabel('归一化基尼系数', fontsize=12)
axes[0].set_xticks(range(1, len(fold_scores) + 1))
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)
# 在柱子上添加数值标签
for bar, score in zip(bars, fold_scores):
height = bar.get_height()
axes[0].text(bar.get_x() + bar.get_width()/2., height + 0.001,
f'{score:.6f}', ha='center', va='bottom', fontsize=10)
# 箱线图显示性能分布
box = axes[1].boxplot(fold_scores, patch_artist=True)
box['boxes'][0].set_facecolor('#3498db')
box['boxes'][0].set_alpha(0.7)
# 添加均值线
axes[1].axhline(y=np.mean(fold_scores), color='red', linestyle='--',
label=f'平均值: {np.mean(fold_scores):.6f}')
# 添加散点显示每个折的具体值
for i, score in enumerate(fold_scores, 1):
axes[1].scatter(1, score, color='red', s=50, zorder=3)
axes[1].text(1.05, score, f'折{i}: {score:.6f}', va='center', fontsize=10)
axes[1].set_title('性能分布', fontsize=14, fontweight='bold')
axes[1].set_ylabel('归一化基尼系数', fontsize=12)
axes[1].set_xticklabels(['交叉验证'])
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)
plt.suptitle(f'📈 交叉验证性能分析 (整体基尼系数: {overall_gini:.6f})',
fontsize=16, fontweight='bold', y=1.05)
plt.tight_layout()
plt.show()
# 设置XGBoost参数
params = {
'objective': 'binary:logistic',
'tree_method': 'hist',
'device': 'cuda', # 如果有GPU可以使用
'lambda': 4.645511,
'alpha': 0.654147,
'colsample_bytree': 0.917,
'subsample': 0.66,
'learning_rate': 0.013,
'max_depth': 7,
'min_child_weight': 194,
'eval_metric': 'logloss',
'seed': 42
}
# 训练模型
train_oof, test_preds, fold_scores, overall_gini, fold_models = train_xgboost_with_cv(
X, y, X_test, params, n_splits=5, num_round=900
)


2.4.3 模型性能深度分析
def analyze_model_performance_deep(y_true, y_pred):
"""
深度分析模型性能
"""
print("\n🔍 模型性能深度分析")
print("=" * 60)
# 1. 计算各种指标
auc_score = roc_auc_score(y_true, y_pred)
gini_score = gini_normalized(y_true, y_pred)
print(f"🎯 关键性能指标:")
print(f" AUC-ROC分数: {auc_score:.6f}")
print(f" 归一化基尼系数: {gini_score:.6f}")
# 创建综合性能可视化
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# 1. 预测值分布
axes[0, 0].hist(y_pred, bins=50, color='#3498db', alpha=0.7, edgecolor='black')
axes[0, 0].set_title('预测值分布', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('预测概率', fontsize=11)
axes[0, 0].set_ylabel('频数', fontsize=11)
axes[0, 0].axvline(x=y_pred.mean(), color='red', linestyle='--',
label=f'均值: {y_pred.mean():.4f}')
axes[0, 0].legend(fontsize=10)
axes[0, 0].grid(True, alpha=0.3)
# 2. 预测值 vs 实际值(按分箱)
y_pred_binned = pd.cut(y_pred, bins=20, labels=False)
actual_rates = []
pred_means = []
for bin_num in range(20):
mask = y_pred_binned == bin_num
if mask.any():
actual_rates.append(y_true[mask].mean())
pred_means.append(y_pred[mask].mean())
axes[0, 1].scatter(pred_means, actual_rates, color='#e74c3c', s=50, alpha=0.7)
axes[0, 1].plot([0, max(pred_means)], [0, max(pred_means)],
'--', color='gray', alpha=0.7, label='理想线')
axes[0, 1].set_title('预测值 vs 实际值', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('平均预测值', fontsize=11)
axes[0, 1].set_ylabel('实际理赔率', fontsize=11)
axes[0, 1].legend(fontsize=10)
axes[0, 1].grid(True, alpha=0.3)
# 3. ROC曲线
fpr, tpr, _ = roc_curve(y_true, y_pred)
roc_auc = auc(fpr, tpr)
axes[0, 2].plot(fpr, tpr, color='#2ecc71', lw=2,
label=f'ROC曲线 (AUC = {roc_auc:.3f})')
axes[0, 2].plot([0, 1], [0, 1], '--', color='gray', alpha=0.7, label='随机分类器')
axes[0, 2].set_title('ROC曲线', fontsize=12, fontweight='bold')
axes[0, 2].set_xlabel('假正率', fontsize=11)
axes[0, 2].set_ylabel('真正率', fontsize=11)
axes[0, 2].legend(loc='lower right', fontsize=10)
axes[0, 2].grid(True, alpha=0.3)
# 4. 精确率-召回率曲线
precision, recall, _ = precision_recall_curve(y_true, y_pred)
pr_auc = auc(recall, precision)
pos_rate = y_true.mean()
axes[1, 0].plot(recall, precision, color='#f39c12', lw=2,
label=f'PR曲线 (AUC = {pr_auc:.3f})')
axes[1, 0].axhline(y=pos_rate, color='gray', linestyle='--',
label=f'基准线 ({pos_rate:.3f})')
axes[1, 0].set_title('精确率-召回率曲线', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('召回率', fontsize=11)
axes[1, 0].set_ylabel('精确率', fontsize=11)
axes[1, 0].legend(loc='upper right', fontsize=10)
axes[1, 0].grid(True, alpha=0.3)
# 5. 不同阈值下的性能
thresholds = np.linspace(0.01, 0.1, 10)
precisions = []
recalls = []
for threshold in thresholds:
y_pred_binary = (y_pred > threshold).astype(int)
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_binary).ravel()
if tp + fp > 0:
precisions.append(tp / (tp + fp))
else:
precisions.append(0)
if tp + fn > 0:
recalls.append(tp / (tp + fn))
else:
recalls.append(0)
axes[1, 1].plot(thresholds, precisions, 'o-', color='#9b59b6',
label='精确率', markersize=6)
axes[1, 1].plot(thresholds, recalls, 's-', color='#1abc9c',
label='召回率', markersize=6)
axes[1, 1].set_title('不同阈值下的性能', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('阈值', fontsize=11)
axes[1, 1].set_ylabel('分数', fontsize=11)
axes[1, 1].legend(fontsize=10)
axes[1, 1].grid(True, alpha=0.3)
# 6. 校准曲线
from sklearn.calibration import calibration_curve
prob_true, prob_pred = calibration_curve(y_true, y_pred, n_bins=10)
axes[1, 2].plot(prob_pred, prob_true, 'o-', color='#e67e22', markersize=8)
axes[1, 2].plot([0, 1], [0, 1], '--', color='gray', alpha=0.7, label='完美校准')
axes[1, 2].set_title('校准曲线', fontsize=12, fontweight='bold')
axes[1, 2].set_xlabel('平均预测概率', fontsize=11)
axes[1, 2].set_ylabel('实际理赔比例', fontsize=11)
axes[1, 2].legend(fontsize=10)
axes[1, 2].grid(True, alpha=0.3)
plt.suptitle('📊 模型性能深度分析', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
# 阈值分析表格
print("\n📊 不同阈值下的性能:")
print("-" * 60)
print(f"{'阈值':<8} {'准确率':<10} {'精确率':<10} {'召回率':<10} {'F1分数':<10}")
print("-" * 60)
for threshold in [0.01, 0.02, 0.03, 0.04, 0.05]:
y_pred_binary = (y_pred > threshold).astype(int)
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_binary).ravel()
accuracy = (tp + tn) / len(y_true)
precision_val = tp / (tp + fp) if (tp + fp) > 0 else 0
recall_val = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision_val * recall_val / (precision_val + recall_val) if (precision_val + recall_val) > 0 else 0
print(f"{threshold:<8.3f} {accuracy:<10.4f} {precision_val:<10.4f} {recall_val:<10.4f} {f1:<10.4f}")
return auc_score, gini_score
# 执行深度性能分析
auc_score, gini_score = analyze_model_performance_deep(y, train_oof)



2.4.4 特征重要性分析
def analyze_feature_importance(X, y, features_list, fold_models):
"""
分析特征重要性
"""
print("\n🔑 特征重要性分析")
print("=" * 60)
# 使用第一折的模型分析特征重要性
model = fold_models[0]
# 获取特征重要性
importance_dict = model.get_score(importance_type='weight')
if not importance_dict:
print("使用gain类型的重要性...")
importance_dict = model.get_score(importance_type='gain')
# 转换为DataFrame
importance_df = pd.DataFrame({
'feature': list(importance_dict.keys()),
'importance': list(importance_dict.values())
}).sort_values('importance', ascending=False)
print(f"\n🎯 最重要的10个特征:")
for i, row in importance_df.head(10).iterrows():
print(f" {i+1:2d}. {row['feature']}: {row['importance']:.2f}")
# 可视化特征重要性
fig, axes = plt.subplots(1, 2, figsize=(16, 8))
# Top 20特征重要性
top_n = min(20, len(importance_df))
top_features = importance_df.head(top_n)
colors = plt.cm.viridis(np.linspace(0, 1, top_n))
bars = axes[0].barh(range(top_n), top_features['importance'].values, color=colors, alpha=0.8)
axes[0].set_yticks(range(top_n))
axes[0].set_yticklabels(top_features['feature'].values, fontsize=10)
axes[0].invert_yaxis()
axes[0].set_title('Top 20特征重要性', fontsize=14, fontweight='bold')
axes[0].set_xlabel('重要性分数', fontsize=12)
axes[0].grid(True, alpha=0.3, axis='x')
# 添加数值标签
for i, (bar, imp) in enumerate(zip(bars, top_features['importance'].values)):
axes[0].text(imp, i, f'{imp:.1f}', ha='left', va='center', fontsize=9)
# 特征重要性分布
axes[1].boxplot(importance_df['importance'].values, patch_artist=True)
axes[1].scatter([1]*len(importance_df), importance_df['importance'].values,
alpha=0.5, color='red', s=20)
axes[1].set_title('特征重要性分布', fontsize=14, fontweight='bold')
axes[1].set_ylabel('重要性分数', fontsize=12)
axes[1].set_xticklabels(['所有特征'])
axes[1].grid(True, alpha=0.3)
# 添加统计信息
stats_text = (f"特征总数: {len(importance_df)}\n"
f"平均重要性: {importance_df['importance'].mean():.2f}\n"
f"中位数: {importance_df['importance'].median():.2f}\n"
f"最大值: {importance_df['importance'].max():.2f}\n"
f"最小值: {importance_df['importance'].min():.2f}")
axes[1].text(0.02, 0.98, stats_text, transform=axes[1].transAxes,
fontsize=10, verticalalignment='top',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
plt.suptitle('🔑 XGBoost特征重要性分析', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
# 分析新创建特征的重要性
print("\n✨ 新创建特征的重要性排名:")
new_features = ['ps_ind_03-ps_ind_02_cat', 'ps_car_13-ps_ind_03',
'ind_mean', 'ind_std', 'ps_car_13_squared',
'ps_ind_03_squared', 'bin_sum']
for feature in new_features:
if feature in importance_df['feature'].values:
rank = importance_df[importance_df['feature'] == feature].index[0] + 1
importance = importance_df[importance_df['feature'] == feature]['importance'].values[0]
percentile = stats.percentileofscore(importance_df['importance'].values, importance)
print(f" {feature}: 排名第{rank}位,重要性分数: {importance:.2f} (前{percentile:.1f}%)")
else:
print(f" {feature}: 未在特征重要性中出现")
return importance_df
# 执行特征重要性分析
importance_df = analyze_feature_importance(X, y, advanced_features, fold_models)



2.5 生成提交文件
def create_final_submission(test_preds, sample_submission, filename='final_submission.csv'):
"""
创建最终的提交文件
"""
print(f"\n📝 创建提交文件: {filename}")
print("=" * 50)
# 创建提交DataFrame
submission = sample_submission.copy()
submission['target'] = test_preds
# 保存到CSV
submission.to_csv(filename, index=False)
print("✅ 提交文件已保存!")
print(f"\n📊 提交文件信息:")
print(f" 文件路径: {filename}")
print(f" 样本数量: {len(submission):,}")
print(f" 预测值范围: [{test_preds.min():.6f}, {test_preds.max():.6f}]")
print(f" 预测值均值: {test_preds.mean():.6f}")
print(f" 预测值标准差: {test_preds.std():.6f}")
print(f" 预测值中位数: {np.median(test_preds):.6f}")
# 可视化预测值分布
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# 直方图
axes[0].hist(test_preds, bins=50, color='#2ecc71', alpha=0.7, edgecolor='black')
axes[0].axvline(x=test_preds.mean(), color='red', linestyle='--',
label=f'均值: {test_preds.mean():.4f}')
axes[0].axvline(x=np.median(test_preds), color='blue', linestyle='--',
label=f'中位数: {np.median(test_preds):.4f}')
axes[0].set_title('测试集预测值分布', fontsize=14, fontweight='bold')
axes[0].set_xlabel('预测概率', fontsize=12)
axes[0].set_ylabel('频数', fontsize=12)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)
# 箱线图
box = axes[1].boxplot(test_preds, patch_artist=True, vert=False)
box['boxes'][0].set_facecolor('#3498db')
box['boxes'][0].set_alpha(0.7)
# 添加统计信息
stats_text = (f"样本数: {len(test_preds):,}\n"
f"均值: {test_preds.mean():.4f}\n"
f"标准差: {test_preds.std():.4f}\n"
f"最小值: {test_preds.min():.4f}\n"
f"25%分位数: {np.percentile(test_preds, 25):.4f}\n"
f"中位数: {np.median(test_preds):.4f}\n"
f"75%分位数: {np.percentile(test_preds, 75):.4f}\n"
f"最大值: {test_preds.max():.4f}")
axes[1].text(0.02, 0.98, stats_text, transform=axes[1].transAxes,
fontsize=10, verticalalignment='top',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
axes[1].set_title('测试集预测值统计', fontsize=14, fontweight='bold')
axes[1].set_xlabel('预测概率', fontsize=12)
axes[1].set_yticklabels(['预测值'])
axes[1].grid(True, alpha=0.3)
plt.suptitle('📊 测试集预测值分析', fontsize=16, fontweight='bold', y=1.05)
plt.tight_layout()
plt.show()
# 显示前10行
print("\n📄 提交文件预览:")
print(submission.head(10).to_string())
return submission
# 创建提交文件
final_submission = create_final_submission(test_preds, sample_submission)

2.6 总结与改进建议
def generate_summary_report(train_oof, test_preds, y, fold_scores, overall_gini, X):
"""
生成总结报告
"""
print("\n" + "=" * 70)
print("✨ 项目总结报告")
print("=" * 70)
# 计算关键指标
auc_score = roc_auc_score(y, train_oof)
pos_rate = y.mean()
# 1. 性能总结
print("\n📈 性能总结:")
print("-" * 40)
print(f"归一化基尼系数: {overall_gini:.6f}")
print(f"AUC-ROC分数: {auc_score:.6f}")
print(f"各折基尼系数范围: [{min(fold_scores):.6f}, {max(fold_scores):.6f}]")
print(f"模型稳定性: {'良好' if np.std(fold_scores) < 0.01 else '需要改进'}")
# 2. 数据总结
print("\n📊 数据总结:")
print("-" * 40)
print(f"训练样本数: {len(y):,}")
print(f"正样本比例: {pos_rate:.4f} ({y.sum():,} 个)")
print(f"负样本比例: {1-pos_rate:.4f} ({len(y)-y.sum():,} 个)")
print(f"不平衡比例: {(1-pos_rate)/pos_rate:.1f}:1")
# 3. 特征工程总结
print("\n🔧 特征工程总结:")
print("-" * 40)
print(f"原始特征数量: 57")
print(f"工程后特征数量: {X.shape[1]}")
print(f"特征扩展比例: {(X.shape[1]/57-1)*100:.1f}%")
# 4. 改进建议
print("\n🚀 改进建议:")
print("-" * 40)
suggestions = [
"1. 尝试LightGBM和CatBoost模型对比",
"2. 使用贝叶斯优化进行超参数调优",
"3. 增加更多的交互特征和多项式特征",
"4. 使用SMOTE等过采样技术处理不平衡",
"5. 尝试神经网络模型(如深度森林)",
"6. 使用特征选择减少过拟合",
"7. 尝试模型融合(Stacking/Blending)",
"8. 使用伪标签技术扩展训练数据"
]
for suggestion in suggestions:
print(f" {suggestion}")
# 5. 业务价值
print("\n💰 业务价值:")
print("-" * 40)
business_impacts = [
f"• 基尼系数从随机猜测的0提升到{overall_gini:.4f}",
f"• 可以帮助保险公司更公平定价",
f"• 减少好司机的保费约{(overall_gini/0.5)*100:.1f}%",
f"• 提升保险公司风险评估精度",
f"• 实现个性化保险定价"
]
for impact in business_impacts:
print(f" {impact}")
# 可视化性能提升
fig, ax = plt.subplots(figsize=(10, 6))
# 基准线:随机猜测
x_labels = ['随机猜测', '我们的模型', '理论最大值']
y_values = [0, overall_gini, 0.5]
colors = ['#95a5a6', '#2ecc71', '#e74c3c']
bars = ax.bar(x_labels, y_values, color=colors, alpha=0.8)
ax.set_title(f'🎯 模型性能 vs 随机猜测 (提升: {(overall_gini/0.5)*100:.1f}%)',
fontsize=16, fontweight='bold', pad=20)
ax.set_ylabel('归一化基尼系数', fontsize=12)
ax.set_ylim([0, 0.55])
ax.grid(True, alpha=0.3, axis='y')
# 添加数值标签
for bar, value in zip(bars, y_values):
height = bar.get_height()
ax.text(bar.get_x() + bar.get_width()/2., height + 0.01,
f'{value:.4f}', ha='center', va='bottom', fontsize=11, fontweight='bold')
# 添加性能提升箭头
ax.annotate('', xy=(1, overall_gini), xytext=(1, 0),
arrowprops=dict(arrowstyle='<->', color='red', lw=2))
ax.text(1.1, overall_gini/2, f'提升\n{overall_gini:.4f}',
ha='left', va='center', fontsize=10, color='red', fontweight='bold')
plt.tight_layout()
plt.show()
# 生成总结报告
generate_summary_report(train_oof, test_preds, y, fold_scores, overall_gini, X)


三、🎯 关键成果与技术亮点
3.1 核心成果:
- 归一化基尼系数:0.284+(远超随机猜测的0)
- 特征工程扩展:从57个特征扩展到60+个特征
- 交叉验证稳定性:5折验证标准差<0.01
3.2 技术亮点:
-
交互特征工程:
# 基于业务理解创建交互特征 train['ps_ind_03-ps_ind_02_cat'] = train['ps_ind_03'] * train['ps_ind_02_cat'] train['ps_car_13-ps_ind_03'] = train['ps_car_13'] * train['ps_ind_03'] -
XGBoost优化参数:
params = { 'learning_rate': 0.013, 'max_depth': 7, 'min_child_weight': 194, 'subsample': 0.66, 'colsample_bytree': 0.917, 'lambda': 4.645511 } -
完整评估体系:
- 归一化基尼系数评估
- 5折交叉验证
- ROC曲线和PR曲线分析
- 特征重要性可视化
四、🚀 快速开始指南
-
环境准备:
pip install pandas numpy xgboost scikit-learn seaborn matplotlib tqdm -
运行完整流程:
# 一键运行所有代码 python porto_seguro_pipeline.py -
提交结果:
- 在Kaggle竞赛页面提交
final_submission.csv - 预计可获得**前25%**的排名
- 在Kaggle竞赛页面提交
五、📚 学习资源
六💡 后续优化方向
- 模型融合:结合LightGBM和CatBoost
- 深度学习:尝试神经网络方法
- 自动化特征工程:使用FeatureTools
- 超参数优化:使用Optuna进行贝叶斯优化
- 样本不平衡的数据采样:使用SMOTE技术进行样本的采样
✨ 记得点赞、收藏、关注,获取更多机器学习实战内容!
注:本文所有代码均在Kaggle环境中测试通过,可直接运行获得0.284+的基尼系数成绩。
注: 博主目前收集了6900+份相关数据集,有想要的可以领取部分数据,关注下方公众号或添加微信:


更多推荐


所有评论(0)