The issue occurs because SHAP’s scatter
function may improperly handle missing data when using xgb.DMatrix
, as it might convert the sparse matrix to dense, leading to zero imputation. To correctly display missing values (e.g., as rug plot markers), you should use the raw input data (numpy
array or pandas.DataFrame
) instead of xgb.DMatrix
when calculating SHAP values. While the model can be trained with DMatrix
, passing the original X
to the SHAP explainer ensures proper handling of NaN
values and accurate scatter plots.