Skip to content

Latest commit

 

History

History
12 lines (9 loc) · 2.59 KB

README.md

File metadata and controls

12 lines (9 loc) · 2.59 KB

Tree-based-paper

This repository contains codes for the third arXiv version of the paper

On marginal feature attributions of tree-based models (https://doi.org/10.3934/fods.2024021)

This repository contains the following folders:

  1. TreeSHAP_sanity_check: This is to confirm the computations done in Example 3.1 regarding the path-dependent TreeSHAP.
  2. Synthetic_model_shapley and Synthetic_model_owen: The times recorded for our experiments with synthetic data in Section 4.1 are available in these folders along with python scripts for recreating the figures in that section.
  3. models_metrics: In Section 4.2, we experiment with four public datasets. For each of them, a triple consisting of LightGBM, CatBoost and XGBoost models is trained. The model files are provided in the folder along with the Jupyter notebook r2_score.ipynb which replicates their metrics as appeared in Table 5 of the paper.
  4. Retrieve_splits: The goal of the notebook Retrieve_splits.ipynb is to take a saved LightGBM, CatBoost or XGBoost model, decompose it into its constituent trees, and create a dictionary for each decision tree containing information such as distinct features appearing in the tree, tree's depth, the regions cut by the tree etc. This procedure is carried out for the ensembles trained for our experiments in Section 4.2, and results appear in Table 6. This process can be repeated for any trained LightGBM, CatBoost or XGBoost model through importing the script EnsembleParser.py.
  5. Explainer and MIC_based_grouping: In Section 4.3, a proprietary implementation of Algorithm 3.12 is used to explain the four CatBoost models previously trained on public datasets. The corresponding look-up tables of marginal Shapley values are stored in the Explainer folder. As a sanity check, the efficiency property of Shapley values is verified for the outputs of the algorithm in the notebook explanations.ipynb. Moreover, look-up tables containing marginal Owen values for these models were generated through a proprietary code based on Theorem F.1. They are available in the same folder, and the efficiency axiom is checked for them as well in explanations.ipynb. The partitions of the features of the public datasets used for computing Owen values are available in the folder MIC_based_grouping. These are obtained from a hierarchical clustering procedure outlined in the notebook grouping.ipynb.