SHAP Values — Mini estudo para interpretar seus..

Francke Peixoto
3 min readJul 18, 2020

--

SHAP — SH apley Um dditive ex P lanations

O SHAP é uma técnica usada para interpretar os “black-box models”) e foi desenvolvido por Scott M. Lundberg+developed+by+Scott+M.+Lundberg.&hl=pt-BR&as_sdt=0&as_vis=1&oi=scholart).

O SHAP mede o impacto das variáveis, levando em consideração a interação com outras variáveis.
Os valores de Shapley calculam a importância de um recurso comparando o que um modelo prevê com e sem o recurso. No entanto, como a ordem na qual um modelo vê recursos pode afetar suas previsões, isso é feito em todas as ordens possíveis, para que os recursos sejam comparados de maneira justa. fonte

In [1]:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
data = pd.read_csv('../input/titanic-machine-learning-from-disaster/train.csv')
data.Age.fillna(value=data.Age.median() ,inplace=True)
data.Embarked.fillna(value='S', inplace=True)
dropColumns =["PassengerId","Name","Ticket","Cabin","Embarked"]
for col in dropColumns:
data.drop(columns=[col], inplace=True)
y = data.Survived

In [2]:

feature_names = ["Pclass","Sex","Age","SibSp","Parch","Fare"]
dummies =pd.get_dummies(data[feature_names])
dummies.head(2)

Out[2]:

PclassAgeSibSpParchFareSex_femaleSex_male0322.0107.2500011138.01071.283310

In [3]:

x= dummies
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=1)

In [4]:

model = RandomForestClassifier(random_state=0).fit(train_x, train_y)

Examinando o SHAP Values em algumas linhas de nosos dataset.

In [5]:

row = 50
data_prediction = val_x.iloc[row]
data_prediction_array = data_prediction.values.reshape(1, -1)
model.predict_proba(data_prediction_array)

Out[5]:

array([[0.7, 0.3]])

Usando o SHAP Values para essa única previsão.

import shap !pip install shap

package used to calculate Shap values

In [6]:

import shap 
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(data_prediction)
shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values[1], data_prediction)
Setting feature_perturbation = "tree_path_dependent" because no background data was given.

Out[6]:

0.07250.12250.17250.22250.27250.32250.37250.42250.47250.52250.57250.62250.6725SibSp = 0Fare = 15.05Pclass = 2Sex_male = 1Sex_female = 0Parch = 0base value0.300.30higher→model output value←lower

Kernel SHAP

Previsões do conjunto de testes

In [7]:

k_explainer = shap.KernelExplainer(model.predict_proba, train_x)
k_shap_values = k_explainer.shap_values(data_prediction)
shap.initjs()
shap.force_plot(k_explainer.expected_value[1], k_shap_values[1], data_prediction)
l1_reg="auto" is deprecated and in the next version (v0.29) the behavior will change from a conditional use of AIC to simply "num_features(10)"!

Out[7]:

0.072080.17210.27210.37210.47210.57210.6721SibSp = 0Fare = 15.05Pclass = 2Sex_male = 1Sex_female = 0Age = 28base value0.300.30higher→model output value←lower

In [8]:

import eli5 #!pip install eli5 
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(model, random_state=1).fit(val_x, val_y)
eli5.show_weights(perm, feature_names = val_x.columns.tolist())
The sklearn.metrics.scorer module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.
The sklearn.feature_selection.base module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.feature_selection. Anything that cannot be imported from sklearn.feature_selection is now part of the private API.
The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/).
sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
Using TensorFlow backend.

Out[8]:

WeightFeature0.0709 ± 0.0256Sex_male0.0637 ± 0.0192Pclass0.0556 ± 0.0238Sex_female0.0197 ± 0.0231Age0.0170 ± 0.0174SibSp0.0090 ± 0.0401Fare-0.0045 ± 0.0127Parch

In [9]:

from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots #!pip install pdpbox
feature_names = dummies.columns
pdp_isolate_ = pdp.pdp_isolate(model=model, dataset=val_x, model_features=feature_names, feature='Age')
pdp.pdp_plot(pdp_isolate_, 'Age')
plt.show()

Obtendo os valores so SHAP oara os dados de validação.

In [10]:

shap_values = explainer.shap_values(val_x)
shap.summary_plot(shap_values[1], val_x)

O (SHAP) Shapley calcula a importância de um feature comparando o com seu modelo.

In [11]:

shap_values = explainer.shap_values(x)
shap.dependence_plot('Age', shap_values[1], x, interaction_index="Pclass")

A library SHAP é uma ferramenta poderosa na exploração de padrões que um algoritmo de aprendizadoidentificou.

Fontes:

  1. https://blog.datascienceheroes.com/how-to-interpret-shap-values-in-r/

Notebook

--

--

Francke Peixoto
Francke Peixoto

Written by Francke Peixoto

Software Engineer | Data Engineer | Data & Analytics Enthusiastic | Machine Learning | Azure | Fullstack Developer | Systems Analist | .Net — Acta, non verba

No responses yet