Boruta特征筛选

文章目录前言Boruta介绍1.读入数据2.利用筛选的特征进行建模总结前言Boruta介绍- Boruta算法是一种特征选择方法，使用特征的重要性来选取特征网址：https://github.com/scikit-learn-contrib/boruta_py安装：pip install Boruta提示：以下是本篇文章正文内容1.读入数据代码如下（示例）：import numpy as npfr

sunnuan01

5909人浏览 · 2021-10-03 16:36:42

sunnuan01 · 2021-10-03 16:36:42 发布

文章目录

前言Boruta介绍
- 1.读入数据
- 2.利用筛选的特征进行建模
总结

前言Boruta介绍

- Boruta算法是一种特征选择方法，使用特征的重要性来选取特征

网址：https://github.com/scikit-learn-contrib/boruta_py
安装：pip install Boruta

提示：以下是本篇文章正文内容

1.读入数据

代码如下（示例）：

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
import pandas as pd

data = pd.read_csv(r'F:\教师培训\ppd7\df_Master_merge_clean.csv',encoding='gb18030')
# 特征筛选，原始特征420维度，首先利用boruta算法进行筛选
# 处理数据去掉['Idx', 'target', 'sample_status']

pd_x = data[data.target.notnull()].drop(columns=['Idx', 'target', 'sample_status', 'ListingInfo'])
x = pd_x.values
pd_y = data[data.target.notnull()]['target']
y = np.array(pd_y).ravel()
# 先定义一个随机森林分类器，RandomForest， lightgbm，xgboost也可以
rfc = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
feat_selector = BorutaPy(rfc,n_estimators='auto',random_state=1, max_iter=10)
feat_selector.fit(x, y)

#获取筛选到的特征
dic_ft_select = pd.DataFrame({'name':pd_x.columns, 'select':feat_selector.support_})
selected_feat_name1 = dic_ft_select[dic_ft_select.select]['name']

2.利用筛选的特征进行建模

def roc_auc_plot(clf,x_train,y_train,x_test, y_test):
    train_auc = roc_auc_score(y_train,clf.predict_proba(x_train)[:,1])
    train_fpr, train_tpr, _ = roc_curve(y_train,clf.predict_proba(x_train)[:,1])
    train_ks = abs(train_fpr-train_tpr).max()
    print('train_ks = ', train_ks)
    print('train_auc = ', train_auc)
    
    test_auc = roc_auc_score(y_test,clf.predict_proba(x_test)[:,1])
    test_fpr, test_tpr, _ = roc_curve(y_test,clf.predict_proba(x_test)[:,1])
    test_ks = abs(test_fpr-test_tpr).max()
    print('test_ks = ', test_ks)
    print('test_auc = ', test_auc)
    
    from matplotlib import pyplot as plt
    plt.plot(train_fpr,train_tpr,label = 'train_roc')
    plt.plot(test_fpr,test_tpr,label = 'test_roc')
    plt.plot([0,1],[0,1],'k--', c='r')
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title('ROC Curve')
    plt.legend(loc = 'best')
    plt.show()

x2 = pd_x[selected_feat_name1]
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve
x_train,x_test, y_train, y_test = train_test_split(x2,y,random_state=2,test_size=0.2)

import lightgbm as lgb

lgb_model = lgb.LGBMClassifier(n_estimators=800,
                                boosting_type='gbdt',
                               learning_rate=0.04,
                               min_child_samples=68,
                               min_child_weight=0.01,
                                  max_depth=4,
                              num_leaves=16,
                              colsample_bytree=0.8,
                              subsample=0.8,
                              reg_alpha=0.7777777777777778,
                              reg_lambda=0.3,
                               objective='binary')

clf = lgb_model.fit(x_train, y_train,
              eval_set=[(x_train, y_train),(x_test,y_test)],
              eval_metric='auc',early_stopping_rounds=100)
roc_auc_plot(clf,x_train,y_train,x_test, y_test)

在这里插入图片描述

总结

因为去掉了一部分特征，所以模型的auc有所下降

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

小白也能看懂！手把手教你入门MCP协议，解锁大模型本地应用，速收藏！

2048 AI社区

Kimi新架构训练效率提升25%！马斯克夸赞

月之暗面刚刚发布了新模型架构𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔。在不同模型尺寸上，训练效率均提升了25%。有人声称这一创新，将注意力旋转了90°。马斯克也对这一创新表示惊叹。AI大神Karpathy直言，我们对Transformer开山之作《Attention is All You Need》这篇论文的理解还是不够。月之暗面团队提出注意力残差机制，巧妙化解了