1 背景

相信大名鼎鼎的GBDT+LR组合很多小伙伴都听过,这种组合模型的预测效果要比单模型要好,但之前一直没有亲自实践过,最近刚好公司一个项目用到了,故抓紧时间总结一波~

2 原理

简单来说就是首先用树模型(GBDT、Xgboost、Lightgbm)来预测样本结果,然后将树模型的结果转为标准的变量形式放入LR中,最终进行预测~

  • 具有stacking思想的二分类器模型,GBDT用来对训练集提取特征作为新的训练输入数据,LR作为新训练输入数据的分类器。
  • GBDT算法的特点正好可以用来发掘有区分度的特征、特征组合,减少特征工程中人力成本。而LR则可以快速实现算法

具体的一个demo例子见下方,根据树模型的结果转为标准变量形式并放入模型~

在这里插入图片描述

下面就拿一个具体数据来看看GBDT+LR的效果,以及与其余模型的比较

3 数据的准备

3.1 读入数据

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# 读入数据
df = pd.read_csv('telecom_churn.csv')
df['churn'] = df['churn'].map(str)
churn_dic = {'True':1, 'False':0}
df['churn'] = df['churn'].map(churn_dic)
print(df.shape)
df.head()
(3333, 21)
state account length area code phone number international plan voice mail plan number vmail messages total day minutes total day calls total day charge ... total eve calls total eve charge total night minutes total night calls total night charge total intl minutes total intl calls total intl charge customer service calls churn
0 KS 128 415 382-4657 no yes 25 265.1 110 45.07 ... 99 16.78 244.7 91 11.01 10.0 3 2.70 1 0
1 OH 107 415 371-7191 no yes 26 161.6 123 27.47 ... 103 16.62 254.4 103 11.45 13.7 3 3.70 1 0
2 NJ 137 415 358-1921 no no 0 243.4 114 41.38 ... 110 10.30 162.6 104 7.32 12.2 5 3.29 0 0
3 OH 84 408 375-9999 yes no 0 299.4 71 50.90 ... 88 5.26 196.9 89 8.86 6.6 7 1.78 2 0
4 OK 75 415 330-6626 yes no 0 166.7 113 28.34 ... 122 12.61 186.9 121 8.41 10.1 3 2.73 3 0

5 rows × 21 columns

df['churn'].value_counts()
0    2850
1     483
Name: churn, dtype: int64

3.2 切分训练集测试集

X = df[['total day calls', 'total night charge', 'number vmail messages', 'total intl charge', 'total eve calls']]
y = df['churn'].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,
                                                    random_state = 23)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(2333, 5) (1000, 5) (2333,) (1000,)

4 LR

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# 初始建模
lr = LogisticRegression(random_state = 23)
print(lr)
lr.fit(X_train, y_train)

# 计算AUC
scores = lr.predict_proba(X_test)[:,1]
LR_auc = metrics.roc_auc_score(y_test, scores) # y_test真实标签 scores为预测为1的概率
LR_auc
LogisticRegression(random_state=23)





0.5834069949026194

5 LGB

import lightgbm as lgb
from lightgbm.sklearn import LGBMClassifier # 是lightgbm的sklearn包。这个包允许我们像GBM一样使用Grid Search 和并行处理。

# 搭建模型
model_lgb = lgb.LGBMClassifier(
                                 boosting_type='gbdt',
                                 objective = 'binary',
                                 metric = 'auc',
                                 verbose = 0,
                                 learning_rate = 0.01,
                                 num_leaves = 35,
                                 feature_fraction=0.8,
                                 bagging_fraction= 0.9,
                                 bagging_freq= 8,
                                 lambda_l1= 0.6,
                                 lambda_l2= 0
                               )

#  拟合模型
model_lgb.fit(X_train, y_train)

# 计算AUC
scores = model_lgb.predict_proba(X_test)[:,1]
LGB_auc = metrics.roc_auc_score(y_test, scores) # y_test真实标签 scores为预测为1的概率
LGB_auc
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000353 seconds.
You can set `force_col_wise=true` to remove the overhead.





0.601792922596423

6 LGB+LR

6.1 LGB实现

import lightgbm as lgb
from lightgbm.sklearn import LGBMClassifier # 是lightgbm的sklearn包。这个包允许我们像GBM一样使用Grid Search 和并行处理。
# 搭建模型
lgb_param = {'boosting_type':'gbdt',
                                 'objective' : 'binary',
                                 'metric' : 'auc',
                                 'verbose' : 0,
                                 'learning_rate' : 0.01,
                                 'num_leaves' : 4,
                                 'feature_fraction':0.8,
                                 'bagging_fraction': 0.9,
                                 'bagging_freq': 8,
                                 'lambda_l1': 0.6,
                                 'lambda_l2': 0,
            'n_estimators' : 200}

'''
num_leaves:代表的是一棵树上的叶子数
n_estimators:代表的是多少棵树!
- 每棵树4个叶子,然后默认是100棵树!!!!本场景选择200!
'''

model = lgb.LGBMClassifier(
                                 boosting_type='gbdt',
                                 objective = 'binary',
                                 metric = 'auc',
                                 verbose = 0,
                                 learning_rate = 0.01,
                                 num_leaves = 4,
                                 feature_fraction=0.8,
                                 bagging_fraction= 0.9,
                                 bagging_freq= 8,
                                 lambda_l1= 0.6,
                                 lambda_l2= 0,
                                n_estimators = 200
                               )

#  拟合模型
model.fit(X_train, y_train)
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000282 seconds.
You can set `force_col_wise=true` to remove the overhead.





LGBMClassifier(bagging_fraction=0.9, bagging_freq=8, feature_fraction=0.8,
               lambda_l1=0.6, lambda_l2=0, learning_rate=0.01, metric='auc',
               n_estimators=200, num_leaves=4, objective='binary', verbose=0)
model.get_params()
{'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'importance_type': 'split',
 'learning_rate': 0.01,
 'max_depth': -1,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_split_gain': 0.0,
 'n_estimators': 200,
 'n_jobs': -1,
 'num_leaves': 4,
 'objective': 'binary',
 'random_state': None,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'silent': True,
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0,
 'metric': 'auc',
 'verbose': 0,
 'feature_fraction': 0.8,
 'bagging_fraction': 0.9,
 'bagging_freq': 8,
 'lambda_l1': 0.6,
 'lambda_l2': 0}

6.2 LGB的vector导出来!

6.2.1 训练集

import numpy as np

y_pred = model.predict(X_train,pred_leaf=True) 
#  预测结果为该样本最终落在树的哪一个节点上!如果'num_leaves': 4,则可能落在 0 1 2 3 这四个位置上!
train_matrix = np.zeros([len(y_pred), len(y_pred[0])*lgb_param['num_leaves']],dtype=np.int64)
print(train_matrix.shape) # 1000行 800列 因为是1000个样本点,同时200棵树,每棵树4个节点,则800个变量
train_matrix
(2333, 800)





array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)
for i in range(len(y_pred)):
    # 对每一个样本点做循环!然后卡一个点,每隔4个设一个关卡!
    temp = np.arange(len(y_pred[0]))*lgb_param['num_leaves'] + np.array(y_pred[i])
    train_matrix[i][temp] += 1
lgb_output_vec_train = pd.DataFrame(train_matrix)
lgb_output_vec_train.columns = ['leaf_' + str(i) for i in lgb_output_vec_train.columns]
lgb_output_vec_train
leaf_0 leaf_1 leaf_2 leaf_3 leaf_4 leaf_5 leaf_6 leaf_7 leaf_8 leaf_9 ... leaf_790 leaf_791 leaf_792 leaf_793 leaf_794 leaf_795 leaf_796 leaf_797 leaf_798 leaf_799
0 1 0 0 0 0 0 1 0 0 0 ... 1 0 0 0 1 0 0 0 1 0
1 1 0 0 0 0 0 1 0 0 0 ... 1 0 0 0 0 1 0 0 1 0
2 1 0 0 0 1 0 0 0 1 0 ... 1 0 0 0 1 0 0 0 1 0
3 1 0 0 0 1 0 0 0 1 0 ... 1 0 0 0 1 0 0 0 1 0
4 1 0 0 0 1 0 0 0 1 0 ... 0 0 0 0 1 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2328 1 0 0 0 1 0 0 0 1 0 ... 1 0 0 0 0 1 0 0 1 0
2329 1 0 0 0 1 0 0 0 1 0 ... 1 0 0 0 0 1 0 0 1 0
2330 1 0 0 0 0 0 1 0 0 0 ... 1 0 0 0 1 0 0 0 1 0
2331 1 0 0 0 0 0 1 0 0 0 ... 1 0 0 0 1 0 0 0 1 0
2332 1 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 1 0 1 0 0

2333 rows × 800 columns

6.2.2 测试集

import numpy as np

y_pred = model.predict(X_test,pred_leaf=True) 
#  预测结果为该样本最终落在树的哪一个节点上!如果'num_leaves': 4,则可能落在 0 1 2 3 这四个位置上!
test_matrix = np.zeros([len(y_pred), len(y_pred[0])*lgb_param['num_leaves']],dtype=np.int64)
print(test_matrix.shape) # 1000行 800列 因为是1000个样本点,同时200棵树,每棵树4个节点,则800个变量
test_matrix
(1000, 800)





array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)
for i in range(len(y_pred)):
    # 对每一个样本点做循环!然后卡一个点,每隔4个设一个关卡!
    temp = np.arange(len(y_pred[0]))*lgb_param['num_leaves'] + np.array(y_pred[i])
    test_matrix[i][temp] += 1
lgb_output_vec = pd.DataFrame(test_matrix)
lgb_output_vec.columns = ['leaf_' + str(i) for i in lgb_output_vec.columns]
lgb_output_vec
leaf_0 leaf_1 leaf_2 leaf_3 leaf_4 leaf_5 leaf_6 leaf_7 leaf_8 leaf_9 ... leaf_790 leaf_791 leaf_792 leaf_793 leaf_794 leaf_795 leaf_796 leaf_797 leaf_798 leaf_799
0 1 0 0 0 1 0 0 0 1 0 ... 1 0 0 0 0 1 0 0 1 0
1 0 1 0 0 0 1 0 0 0 1 ... 0 1 0 0 1 0 0 0 0 1
2 1 0 0 0 0 0 1 0 0 0 ... 1 0 0 0 1 0 0 0 1 0
3 1 0 0 0 1 0 0 0 1 0 ... 1 0 0 0 0 1 0 0 1 0
4 1 0 0 0 0 0 1 0 0 0 ... 1 0 0 0 1 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 1 0 0 0 1 0 0 0 1 0 ... 1 0 0 0 1 0 0 0 1 0
996 1 0 0 0 1 0 0 0 1 0 ... 0 0 1 0 0 0 1 0 0 0
997 1 0 0 0 1 0 0 0 1 0 ... 1 0 0 0 0 1 0 0 1 0
998 0 0 0 1 0 1 0 0 0 1 ... 0 1 0 0 1 0 0 0 0 1
999 1 0 0 0 1 0 0 0 1 0 ... 1 0 0 0 0 1 0 0 1 0

1000 rows × 800 columns

y_pred[0] # 第一个样本点在100棵树上分别落的位置!
array([0, 0, 0, 3, 3, 0, 3, 0, 3, 3, 0, 3, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 3, 0, 0, 3, 3, 3, 2, 0, 3, 0, 2, 2, 2, 3, 0, 2, 0, 2, 0,
       0, 0, 0, 3, 3, 3, 0, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,
       3, 3, 0, 3, 3, 0, 2, 0, 2, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,
       3, 3, 3, 3, 1, 3, 3, 3, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 2, 0, 0, 3,
       0, 2, 2, 2, 3, 2, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0,
       3, 3, 0, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 0, 3, 3, 0, 3, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 3, 2, 2, 2,
       3, 2])
len(y_pred) # 表示1000个样本点!
1000
len(y_pred[0]) # 表示200棵树
200

6.3 LR+LGB

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# 初始建模
lr = LogisticRegression(random_state = 23)
print(lr)
lr.fit(lgb_output_vec_train, y_train)

# 计算AUC
scores = lr.predict_proba(lgb_output_vec)[:,1]
LR_LGB_auc = metrics.roc_auc_score(y_test, scores) # y_test真实标签 scores为预测为1的概率
LR_LGB_auc
LogisticRegression(random_state=23)





0.58792613217832

7 结果对比

df = pd.DataFrame({'model':['LR', 'LGB', 'LGB+LR'], 'AUC':[LR_auc, LGB_auc, LR_LGB_auc]})
df
model AUC
0 LR 0.583407
1 LGB 0.601793
2 LGB+LR 0.587926

结论:就本案例而言,LGB+LR的效果没有LGB好,所以并不能绝对说某一个模型效果如何好,应该根据不同数据场景选择最优的模型。一般而言,在CTR预估场景下LGB+LR效果还是不错的


Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐