This repository contains codes for the third arXiv version of the paper
On marginal feature attributions of tree-based models (https://doi.org/10.3934/fods.2024021)
This repository contains the following folders:
- TreeSHAP_sanity_check: This is to confirm the computations done in Example 3.1 regarding the path-dependent TreeSHAP.
- Synthetic_model_shapley and Synthetic_model_owen: The times recorded for our experiments with synthetic data in Section 4.1 are available in these folders along with python scripts for recreating the figures in that section.
- models_metrics: In Section 4.2, we experiment with four public datasets. For each of them, a triple consisting of LightGBM, CatBoost and XGBoost models is trained. The model files are provided in the folder along with the Jupyter notebook
r2_score.ipynb
which replicates their metrics as appeared in Table 5 of the paper. - Retrieve_splits: The goal of the notebook
Retrieve_splits.ipynb
is to take a saved LightGBM, CatBoost or XGBoost model, decompose it into its constituent trees, and create a dictionary for each decision tree containing information such as distinct features appearing in the tree, tree's depth, the regions cut by the tree etc. This procedure is carried out for the ensembles trained for our experiments in Section 4.2, and results appear in Table 6. This process can be repeated for any trained LightGBM, CatBoost or XGBoost model through importing the scriptEnsembleParser.py
. - Explainer and MIC_based_grouping: In Section 4.3, a proprietary implementation of Algorithm 3.12 is used to explain the four CatBoost models previously trained on public datasets. The corresponding look-up tables of marginal Shapley values are stored in the Explainer folder. As a sanity check, the efficiency property of Shapley values is verified for the outputs of the algorithm in the notebook
explanations.ipynb
. Moreover, look-up tables containing marginal Owen values for these models were generated through a proprietary code based on Theorem F.1. They are available in the same folder, and the efficiency axiom is checked for them as well inexplanations.ipynb
. The partitions of the features of the public datasets used for computing Owen values are available in the folder MIC_based_grouping. These are obtained from a hierarchical clustering procedure outlined in the notebookgrouping.ipynb
.