I suspect that the issue is due to the fact that the shap_values
array has slight differences in its output format depending on the model used (e.g., XGBoost vs. RandomForestClassifier).
You can successfully generate SHAP analysis plots simply by adjusting the dimensions of the shap_values
array.
Since I don't have your data, I generated a sample dataset as an example for your reference:
import numpy as np
import pandas as pd
import shap
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Generate sample data
np.random.seed(42)
features = pd.DataFrame({
"feature_1": np.random.randint(18, 70, size=100),
"feature_2": np.random.randint(30000, 100000, size=100),
"feature_3": np.random.randint(1, 4, size=100),
"feature_4": np.random.randint(300, 850, size=100),
"feature_5": np.random.randint(1000, 50000, size=100)
})
target = np.random.randint(0, 2, size=100)
features_names = features.columns.tolist()
# The following code is just like your example.
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
explainer = shap.TreeExplainer(rf_model)
shap_values = explainer.shap_values(X_test)
# Adjust the dimensions of the shap_values object.
shap.summary_plot(shap_values[:,:,0], X_test, feature_names=features_names)
shap.summary_plot(shap_values[:,:,0], X_test, feature_names=features_names, plot_type="bar")
With the above, you can successfully run the SHAP analysis by simply adjusting shap_values
to shap_values[:,:,0]
.
As for what the third dimension of shap_values
represents when using RandomForestClassifier
, you can explore it further on your own.