XGB模型可解释性SHAP包实战

可解释机器学习在这几年慢慢成为了机器学习的重要研究方向。作为数据科学家需要防止模型存在偏见，且帮助决策者理解如何正确地使用我们的模型。越是严苛的场景，越需要模型提供证明它们是如何运作且避免错误的证据

SHAP是Python开发的一个"模型解释"包，可以解释任何机器学习模型的输出。其名称来源于SHapley Additive exPlanation，在合作博弈论的启发下SHAP构建一个加性的解释模型，所有的特征都视为“贡献者”。对于每个预测样本，模型都产生一个预测值，SHAP value就是该样本中每个特征所分配到的数值。

数据集（足球运动员身价估计）来源：http://sofasofa.io/competition.php?id=7

# -*- coding: utf-8 -*-"""Created on Tue Mar 23 13:50:38 2021@author: bo.chen"""import shapfrom xgboost import XGBRegressor as XGBRimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltplt.style.use('seaborn')# 读取数据------data = pd.read_csv(r'C:\Users\bo.chen\Desktop\data\train.csv',parse_dates=[3])# 分类变量编码处理------pd.set_option('display.max_rows',500) #调整最大显示行print(data.dtypes) # 找出分类特征，稍后进行OneHotEncodercol = ['club', 'league',  'height_cm', 'weight_kg',       'nationality', 'potential', 'pac', 'sho', 'pas', 'dri', 'def', 'phy',       'international_reputation', 'skill_moves', 'weak_foot', 'preferred_foot', 'crossing', 'finishing',       'heading_accuracy', 'short_passing', 'volleys', 'dribbling', 'curve',       'free_kick_accuracy', 'long_passing', 'ball_control', 'acceleration',       'sprint_speed', 'agility', 'reactions', 'balance', 'shot_power',       'jumping', 'stamina', 'strength', 'long_shots', 'aggression',       'interceptions', 'positioning', 'vision', 'penalties', 'marking',       'standing_tackle', 'sliding_tackle', 'gk_diving', 'gk_handling',       'gk_kicking', 'gk_positioning', 'gk_reflexes', 'rw', 'rb', 'st', 'lw',       'cf', 'cam', 'cm', 'cdm', 'cb', 'lb', 'gk'] # 去掉不相干变量x = data[col].valuesy = data['y'].valuesfrom sklearn.preprocessing import OneHotEncoderenc = OneHotEncoder()array_1 = enc.fit_transform(data[['work_rate_att','work_rate_def']]).toarray()enc_lab = enc.get_feature_names(['work_rate_att','work_rate_def']).tolist() #返回新的特征列名x_new = np.hstack([x,array_1])   # 组合好特征变量array#col_new = [i for i in col if i not in ['work_rate_att','work_rate_def']]col.extend(enc_lab)   # 组合好特征变量列名，稍后画图待用# 建模------model = XGBR(max_depth=4, learning_rate=0.05, n_estimators=150)model.fit(x_new,y)# 特征重要性----plt.bar(range(len(col)),model.feature_importances_)plt.xticks(range(len(col)), col, rotation=-45, fontsize=5)# SHAP实例化------import shapexplainer = shap.TreeExplainer(model)df = pd.DataFrame(x_new,columns=col)shap_values = explainer.shap_values(x_new)  # 计算出每个样本在66个特征上的shap值shap_values.shape  # shap值矩阵(10441, 66)# shap可视化------#单样本特征上的shap值#查看其中一位球员身价的预测值以及其特征对预测值的影响。j = 0 # 以第一个样本为例，查看shap值'''# 可视化第一个prediction的解释   如果不想用JS,传入matplotlib=Trueshap.initjs()shap.force_plot(explainer.expected_value, shap_values[j], x_new[j])'''shap.force_plot(explainer.expected_value, shap_values[j], x_new[j],matplotlib=True)shap.summary_plot(shap_values, df)shap.summary_plot(shap_values,df,plot_type='bar')

参考链接：https://blog.csdn.net/u010970317/article/details/109120788

来源：https://www.icode9.com/content-4-900901.html

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。