`KernelShap` redundant/incorrect API for OHE of the categorical features - grouping vs aggregation by summation #879

RobertSamoilescu · 2023-02-23T17:19:57Z

The are multiple ways to deal with one hot encoding for the categorical features when using the KernelShap explainer.

The first method is straightforward and incorporates the preprocessor into the prediction function. This approach would look like this:

def predictor(X):
   return model.predict_proba(preprocessor.transform(X))

The second method is to use grouping on the one-hot-encoding representation. This approach can group multiple columns and treat them as one. For example, if we have a categorical column fc with 3 categories, its OHE representation will results in adding 3 columns fc_1, fc_2, fc_3. Without grouping KernelShap will treat each fc_i as an individual player/feature and compute 3 values instead of one. With grouping, we can tell KernelShap to treat all 3 columns as one [fc_1, fc_2, fc_3], basically treating them as a single player. Note that the Shap values should match the ones from the first approach (i.e., incorporating the preprocessor).
alibi exposes another method to compute the Shap values, using aggregation by summation. Following the same example from the second bullet point, given the OHE encoding representation as input, KernelShap computes the Shap values for each fc_i and then aggregates them in a single value by summing them up (i.e., fc = fc_1 + fc_2 + fc_3). Unfortunately, this method is not correct since the results obtained by summation won't converge to the true Shap values as for the first two bullet points. This is probably an heuristic borrowed from TreeShap which cannot use the first two approaches. That being said, we should consider removing this approach since it is redundant and incorrect - can be achieved via bullet 2 in the correct way.

Furthemore, the cat_vars_start_idx and cat_vars_enc_dim which do aggregation for KernelShap and TreeShap are parameters in the explain method. We should consider moving those parameters into the fit or __init__ method for the following reasons:

if we remove them from KernelShap we would have some symmetry between TreeShap and KernelShap in terms of dealing with categorical features (for KernelShap, groups and group_names are arguments in the fit method).
the arguments are not actually used in the computation of the Shap values, but used in the _build_explanation method.
those arguments should probably be fixed once the explainer is initialized or fitted. If someone is interested to experiment with various groups, then another explainer can be initialize (if we decide to move those to __init__) or the explainer should be refitted.

The text was updated successfully, but these errors were encountered:

RobertSamoilescu added TreeShap KernelShap labels Feb 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`KernelShap` redundant/incorrect API for OHE of the categorical features - grouping vs aggregation by summation #879

`KernelShap` redundant/incorrect API for OHE of the categorical features - grouping vs aggregation by summation #879

RobertSamoilescu commented Feb 23, 2023

KernelShap redundant/incorrect API for OHE of the categorical features - grouping vs aggregation by summation #879

KernelShap redundant/incorrect API for OHE of the categorical features - grouping vs aggregation by summation #879

Comments

RobertSamoilescu commented Feb 23, 2023

`KernelShap` redundant/incorrect API for OHE of the categorical features - grouping vs aggregation by summation #879

`KernelShap` redundant/incorrect API for OHE of the categorical features - grouping vs aggregation by summation #879