LGB+LR的实践

文章目录1 背景2 原理3 数据的准备3.1 读入数据3.2 切分训练集测试集4 LR5 LGB6 LGB+LR6.1 LGB实现6.2 LGB的vector导出来！6.2.1 训练集6.2.2 测试集6.3 LR+LGB7 结果对比1 背景相信大名鼎鼎的GBDT+LR组合很多小伙伴都听过，这种组合模型的预测效果要比单模型要好，但之前一直没有亲自实践过，最近刚好公司一个项目用到了，故抓紧时间总结一

写代码的阿呆

5889人浏览 · 2020-10-11 20:59:38

写代码的阿呆 · 2020-10-11 20:59:38 发布

文章目录

1 背景
2 原理
3 数据的准备
- 3.1 读入数据
- 3.2 切分训练集测试集
4 LR
5 LGB
6 LGB+LR
7 结果对比

1 背景

相信大名鼎鼎的GBDT+LR组合很多小伙伴都听过，这种组合模型的预测效果要比单模型要好，但之前一直没有亲自实践过，最近刚好公司一个项目用到了，故抓紧时间总结一波~

2 原理

简单来说就是首先用树模型（GBDT、Xgboost、Lightgbm）来预测样本结果，然后将树模型的结果转为标准的变量形式放入LR中，最终进行预测~

具有stacking思想的二分类器模型，GBDT用来对训练集提取特征作为新的训练输入数据，LR作为新训练输入数据的分类器。
GBDT算法的特点正好可以用来发掘有区分度的特征、特征组合，减少特征工程中人力成本。而LR则可以快速实现算法

具体的一个demo例子见下方，根据树模型的结果转为标准变量形式并放入模型~

在这里插入图片描述

下面就拿一个具体数据来看看GBDT+LR的效果，以及与其余模型的比较

3 数据的准备

3.1 读入数据

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# 读入数据
df = pd.read_csv('telecom_churn.csv')
df['churn'] = df['churn'].map(str)
churn_dic = {'True':1, 'False':0}
df['churn'] = df['churn'].map(churn_dic)
print(df.shape)
df.head()

(3333, 21)

	state	account length	area code	phone number	international plan	voice mail plan	number vmail messages	total day minutes	total day calls	total day charge	...	total eve calls	total eve charge	total night minutes	total night calls	total night charge	total intl minutes	total intl calls	total intl charge	customer service calls
0	KS	128	415	382-4657	no	yes	25	265.1	110	45.07	...	99	16.78	244.7	91	11.01	10.0	3	2.70	1
1	OH	107	415	371-7191	no	yes	26	161.6	123	27.47	...	103	16.62	254.4	103	11.45	13.7	3	3.70	1
2	NJ	137	415	358-1921	no	no	0	243.4	114	41.38	...	110	10.30	162.6	104	7.32	12.2	5	3.29	0
3	OH	84	408	375-9999	yes	no	0	299.4	71	50.90	...	88	5.26	196.9	89	8.86	6.6	7	1.78	2
4	OK	75	415	330-6626	yes	no	0	166.7	113	28.34	...	122	12.61	186.9	121	8.41	10.1	3	2.73	3

5 rows × 21 columns

df['churn'].value_counts()

0    2850
1     483
Name: churn, dtype: int64

3.2 切分训练集测试集

X = df[['total day calls', 'total night charge', 'number vmail messages', 'total intl charge', 'total eve calls']]
y = df['churn'].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,
                                                    random_state = 23)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(2333, 5) (1000, 5) (2333,) (1000,)

4 LR

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# 初始建模
lr = LogisticRegression(random_state = 23)
print(lr)
lr.fit(X_train, y_train)

# 计算AUC
scores = lr.predict_proba(X_test)[:,1]
LR_auc = metrics.roc_auc_score(y_test, scores) # y_test真实标签 scores为预测为1的概率
LR_auc

LogisticRegression(random_state=23)





0.5834069949026194

5 LGB

import lightgbm as lgb
from lightgbm.sklearn import LGBMClassifier # 是lightgbm的sklearn包。这个包允许我们像GBM一样使用Grid Search 和并行处理。

# 搭建模型
model_lgb = lgb.LGBMClassifier(
                                 boosting_type='gbdt',
                                 objective = 'binary',
                                 metric = 'auc',
                                 verbose = 0,
                                 learning_rate = 0.01,
                                 num_leaves = 35,
                                 feature_fraction=0.8,
                                 bagging_fraction= 0.9,
                                 bagging_freq= 8,
                                 lambda_l1= 0.6,
                                 lambda_l2= 0
                               )

#  拟合模型
model_lgb.fit(X_train, y_train)

# 计算AUC
scores = model_lgb.predict_proba(X_test)[:,1]
LGB_auc = metrics.roc_auc_score(y_test, scores) # y_test真实标签 scores为预测为1的概率
LGB_auc

[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000353 seconds.
You can set `force_col_wise=true` to remove the overhead.





0.601792922596423

6 LGB+LR

6.1 LGB实现

import lightgbm as lgb
from lightgbm.sklearn import LGBMClassifier # 是lightgbm的sklearn包。这个包允许我们像GBM一样使用Grid Search 和并行处理。

# 搭建模型
lgb_param = {'boosting_type':'gbdt',
                                 'objective' : 'binary',
                                 'metric' : 'auc',
                                 'verbose' : 0,
                                 'learning_rate' : 0.01,
                                 'num_leaves' : 4,
                                 'feature_fraction':0.8,
                                 'bagging_fraction': 0.9,
                                 'bagging_freq': 8,
                                 'lambda_l1': 0.6,
                                 'lambda_l2': 0,
            'n_estimators' : 200}

'''
num_leaves:代表的是一棵树上的叶子数
n_estimators:代表的是多少棵树！
- 每棵树4个叶子，然后默认是100棵树！！！！本场景选择200！
'''

model = lgb.LGBMClassifier(
                                 boosting_type='gbdt',
                                 objective = 'binary',
                                 metric = 'auc',
                                 verbose = 0,
                                 learning_rate = 0.01,
                                 num_leaves = 4,
                                 feature_fraction=0.8,
                                 bagging_fraction= 0.9,
                                 bagging_freq= 8,
                                 lambda_l1= 0.6,
                                 lambda_l2= 0,
                                n_estimators = 200
                               )

#  拟合模型
model.fit(X_train, y_train)

[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000282 seconds.
You can set `force_col_wise=true` to remove the overhead.





LGBMClassifier(bagging_fraction=0.9, bagging_freq=8, feature_fraction=0.8,
               lambda_l1=0.6, lambda_l2=0, learning_rate=0.01, metric='auc',
               n_estimators=200, num_leaves=4, objective='binary', verbose=0)

model.get_params()

{'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'importance_type': 'split',
 'learning_rate': 0.01,
 'max_depth': -1,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_split_gain': 0.0,
 'n_estimators': 200,
 'n_jobs': -1,
 'num_leaves': 4,
 'objective': 'binary',
 'random_state': None,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'silent': True,
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0,
 'metric': 'auc',
 'verbose': 0,
 'feature_fraction': 0.8,
 'bagging_fraction': 0.9,
 'bagging_freq': 8,
 'lambda_l1': 0.6,
 'lambda_l2': 0}

6.2 LGB的vector导出来！

6.2.1 训练集

import numpy as np

y_pred = model.predict(X_train,pred_leaf=True) 
#  预测结果为该样本最终落在树的哪一个节点上！如果'num_leaves': 4,则可能落在 0 1 2 3 这四个位置上！
train_matrix = np.zeros([len(y_pred), len(y_pred[0])*lgb_param['num_leaves']],dtype=np.int64)
print(train_matrix.shape) # 1000行 800列 因为是1000个样本点，同时200棵树，每棵树4个节点，则800个变量
train_matrix

(2333, 800)





array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

for i in range(len(y_pred)):
    # 对每一个样本点做循环！然后卡一个点，每隔4个设一个关卡！
    temp = np.arange(len(y_pred[0]))*lgb_param['num_leaves'] + np.array(y_pred[i])
    train_matrix[i][temp] += 1

lgb_output_vec_train = pd.DataFrame(train_matrix)
lgb_output_vec_train.columns = ['leaf_' + str(i) for i in lgb_output_vec_train.columns]
lgb_output_vec_train

	leaf_0	leaf_1	leaf_2	leaf_3	leaf_4	leaf_5	leaf_6	leaf_7	leaf_8	leaf_9	...	leaf_790	leaf_791	leaf_792	leaf_793	leaf_794	leaf_795	leaf_796	leaf_797	leaf_798	leaf_799
0	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
1	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	0	1	0	0	1	0
2	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	1	0	0	0	1	0
3	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	1	0	0	0	1	0
4	1	0	0	0	1	0	0	0	1	0	...	0	0	0	0	1	0	0	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2328	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
2329	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
2330	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
2331	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
2332	1	0	0	0	0	0	1	0	0	0	...	0	0	0	0	0	1	0	1	0	0

2333 rows × 800 columns

6.2.2 测试集

import numpy as np

y_pred = model.predict(X_test,pred_leaf=True) 
#  预测结果为该样本最终落在树的哪一个节点上！如果'num_leaves': 4,则可能落在 0 1 2 3 这四个位置上！
test_matrix = np.zeros([len(y_pred), len(y_pred[0])*lgb_param['num_leaves']],dtype=np.int64)
print(test_matrix.shape) # 1000行 800列 因为是1000个样本点，同时200棵树，每棵树4个节点，则800个变量
test_matrix

(1000, 800)





array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

for i in range(len(y_pred)):
    # 对每一个样本点做循环！然后卡一个点，每隔4个设一个关卡！
    temp = np.arange(len(y_pred[0]))*lgb_param['num_leaves'] + np.array(y_pred[i])
    test_matrix[i][temp] += 1

lgb_output_vec = pd.DataFrame(test_matrix)
lgb_output_vec.columns = ['leaf_' + str(i) for i in lgb_output_vec.columns]
lgb_output_vec

	leaf_0	leaf_1	leaf_2	leaf_3	leaf_4	leaf_5	leaf_6	leaf_7	leaf_8	leaf_9	...	leaf_790	leaf_791	leaf_792	leaf_793	leaf_794	leaf_795	leaf_796	leaf_797	leaf_798	leaf_799
0	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
1	0	1	0	0	0	1	0	0	0	1	...	0	1	0	0	1	0	0	0	0	1
2	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
3	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
4	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	1	0	0	0	1	0
996	1	0	0	0	1	0	0	0	1	0	...	0	0	1	0	0	0	1	0	0	0
997	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
998	0	0	0	1	0	1	0	0	0	1	...	0	1	0	0	1	0	0	0	0	1
999	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0

1000 rows × 800 columns

y_pred[0] # 第一个样本点在100棵树上分别落的位置！

array([0, 0, 0, 3, 3, 0, 3, 0, 3, 3, 0, 3, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 3, 0, 0, 3, 3, 3, 2, 0, 3, 0, 2, 2, 2, 3, 0, 2, 0, 2, 0,
       0, 0, 0, 3, 3, 3, 0, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,
       3, 3, 0, 3, 3, 0, 2, 0, 2, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,
       3, 3, 3, 3, 1, 3, 3, 3, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 2, 0, 0, 3,
       0, 2, 2, 2, 3, 2, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0,
       3, 3, 0, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 0, 3, 3, 0, 3, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 3, 2, 2, 2,
       3, 2])

len(y_pred) # 表示1000个样本点！

len(y_pred[0]) # 表示200棵树

6.3 LR+LGB

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# 初始建模
lr = LogisticRegression(random_state = 23)
print(lr)
lr.fit(lgb_output_vec_train, y_train)

# 计算AUC
scores = lr.predict_proba(lgb_output_vec)[:,1]
LR_LGB_auc = metrics.roc_auc_score(y_test, scores) # y_test真实标签 scores为预测为1的概率
LR_LGB_auc

LogisticRegression(random_state=23)





0.58792613217832

7 结果对比

df = pd.DataFrame({'model':['LR', 'LGB', 'LGB+LR'], 'AUC':[LR_auc, LGB_auc, LR_LGB_auc]})
df

	model	AUC
0	LR	0.583407
1	LGB	0.601793
2	LGB+LR	0.587926

结论：就本案例而言，LGB+LR的效果没有LGB好，所以并不能绝对说某一个模型效果如何好，应该根据不同数据场景选择最优的模型。一般而言，在CTR预估场景下LGB+LR效果还是不错的

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

智能体来了：HR如何用AI提高工作效率

HR 可以借助 AI 智能体及各类 AI 工具，覆盖招聘、员工管理、培训、绩效、薪酬、员工体验等全流程工作，实现提效降本、优化决策、提升员工体验的目标。

2048 AI社区

豆包 1.6 商品图生成指南：从 0 到 1 写好提示词，轻松生成服饰 / 零食 / 宣传图

豆包1.6图像生成依赖精准提示词，需包含主体、风格、细节等要素。文章提供通用公式（主体+风格+细节+场景+光线+画质）及服饰、零食、宣传图三类场景的模板。强调避免模糊表述，建议用逗号分隔元素，并针对不同风格添加专属优化词。通过案例演示如何调试不满意的生成结果，最终提供可直接套用的模板。核心是通过结构化提示词将需求转化为AI可执行的指令，逐步迭代优化即可获得理想图像。

2048 AI社区

垂直领域SFT训练翻车实录：用Y-Trainer解决模型“复读+失忆“困境

2048 AI社区

所有评论(0)

查看更多评论

写代码的阿呆

@qq_27782503

已为社区贡献14条内容

	leaf_0	leaf_1	leaf_2	leaf_3	leaf_4	leaf_5	leaf_6	leaf_7	leaf_8	leaf_9	...	leaf_790	leaf_791	leaf_792	leaf_793	leaf_794	leaf_795	leaf_796	leaf_797	leaf_798	leaf_799
0	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
1	0	1	0	0	0	1	0	0	0	1	...	0	1	0	0	1	0	0	0	0	1
2	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
3	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
4	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	1	0	0	0	1	0
996	1	0	0	0	1	0	0	0	1	0	...	0	0	1	0	0	0	1	0	0	0
997	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
998	0	0	0	1	0	1	0	0	0	1	...	0	1	0	0	1	0	0	0	0	1
999	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0

	leaf_0	leaf_1	leaf_2	leaf_3	leaf_4	leaf_5	leaf_6	leaf_7	leaf_8	leaf_9	...	leaf_790	leaf_791	leaf_792	leaf_793	leaf_794	leaf_795	leaf_796	leaf_797	leaf_798	leaf_799
0	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
1	0	1	0	0	0	1	0	0	0	1	...	0	1	0	0	1	0	0	0	0	1
2	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
3	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
4	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	1	0	0	0	1	0
996	1	0	0	0	1	0	0	0	1	0	...	0	0	1	0	0	0	1	0	0	0
997	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
998	0	0	0	1	0	1	0	0	0	1	...	0	1	0	0	1	0	0	0	0	1
999	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0

	leaf_0	leaf_1	leaf_2	leaf_3	leaf_4	leaf_5	leaf_6	leaf_7	leaf_8	leaf_9	...	leaf_790	leaf_791	leaf_792	leaf_793	leaf_794	leaf_795	leaf_796	leaf_797	leaf_798	leaf_799
0	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
1	0	1	0	0	0	1	0	0	0	1	...	0	1	0	0	1	0	0	0	0	1
2	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
3	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
4	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	1	0	0	0	1	0
996	1	0	0	0	1	0	0	0	1	0	...	0	0	1	0	0	0	1	0	0	0
997	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
998	0	0	0	1	0	1	0	0	0	1	...	0	1	0	0	1	0	0	0	0	1
999	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0