Skip to content

Commit

Permalink
Merge early stopping with shaprfecv main class (#263)
Browse files Browse the repository at this point in the history
  • Loading branch information
Reinier Koops authored Jul 20, 2024
1 parent 08974ea commit 45f383b
Show file tree
Hide file tree
Showing 6 changed files with 450 additions and 402 deletions.
3 changes: 1 addition & 2 deletions docs/api/feature_elimination.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@

This module focuses on feature elimination and it contains two classes:

- [ShapRFECV][probatus.feature_elimination.feature_elimination.ShapRFECV]: Perform Backwards Recursive Feature Elimination, using SHAP feature importance. It supports binary classification models and hyperparameter optimization at every feature elimination step.
- [EarlyStoppingShapRFECV][probatus.feature_elimination.feature_elimination.EarlyStoppingShapRFECV]: adds support to early stopping of the model fitting process. It can be an alternative regularization technique to hyperparameter optimization of the number of base trees in gradient boosted tree models. Particularly useful when dealing with large datasets.
- [ShapRFECV][probatus.feature_elimination.feature_elimination.ShapRFECV]: Perform Backwards Recursive Feature Elimination, using SHAP feature importance. It supports binary classification, regression models and hyperparameter optimization at every feature elimination step. Also for LGBM, XGBoost and CatBoost it support early stopping of the model fitting process. It can be an alternative regularization technique to hyperparameter optimization of the number of base trees in gradient boosted tree models. Particularly useful when dealing with large datasets.

::: probatus.feature_elimination.feature_elimination
12 changes: 6 additions & 6 deletions docs/tutorials/nb_shap_feature_elimination.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
"\n",
"- Removing lowest [SHAP](https://shap.readthedocs.io/en/latest/) importance feature does not always translate to choosing the feature with the lowest impact on a model's performance. Shap importance illustrates how strongly a given feature affects the output of the model, while disregarding correctness of this prediction.\n",
"- Currently, the functionality only supports tree-based & linear binary classifiers, in the future the scope might be extended.\n",
"- For large datasets, performing hyperparameter optimization can be very computationally expensive. For gradient boosted tree models, one alternative is to use early stopping of the training step. For this, see [EarlyStoppingShapRFECV](#EarlyStoppingShapRFECV)\n",
"- For large datasets, performing hyperparameter optimization can be very computationally expensive. For gradient boosted tree models, one alternative is to use early stopping of the training step. For this use the parameters early_stopping_rounds and eval_metric.\n",
"\n",
"## Setup the dataset\n",
"\n",
Expand Down Expand Up @@ -11232,13 +11232,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## EarlyStoppingShapRFECV\n",
"## Early Stopping ShapRFECV\n",
"\n",
"[Early stopping](https://en.wikipedia.org/wiki/Early_stopping) is a type of regularization, common in [gradient boosted trees](https://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting). Supported packages are: [LightGBM](https://lightgbm.readthedocs.io/en/latest/index.html), [XGBoost](https://xgboost.readthedocs.io/en/latest/index.html) and [CatBoost](https://catboost.ai/en/docs/). It consists of measuring how well the model performs after each base learner is added to the ensemble tree, using a relevant scoring metric. If this metric does not improve after a certain number of training steps, the training can be stopped before the maximum number of base learners is reached. \n",
"\n",
"Early stopping is thus a way of mitigating overfitting in a relatively cheaply, without having to find the ideal regularization hyperparameters. It is particularly useful for handling large datasets, since it reduces the number of training steps which can decrease the modelling time.\n",
"\n",
"`EarlyStoppingShapRFECV` is a child of `ShapRFECV` with limited support for early stopping and the example below shows how to use it with LightGBM."
"Early Stopping requires parameters early_stopping_rounds eval_metric in `ShapRFECV` class and at the moment only supports the three aforementioned libraries. See the example below how to use it with LightGBM."
]
},
{
Expand Down Expand Up @@ -192329,12 +192329,12 @@
],
"source": [
"%%timeit -n 10\n",
"from probatus.feature_elimination import EarlyStoppingShapRFECV\n",
"from probatus.feature_elimination import ShapRFECV\n",
"\n",
"model = lightgbm.LGBMClassifier(n_estimators=200, max_depth=3)\n",
"\n",
"# Run feature elimination\n",
"shap_elimination = EarlyStoppingShapRFECV(\n",
"shap_elimination = ShapRFECV(\n",
" model=search, step=0.2, cv=10, scoring=\"roc_auc\", eval_metric=\"auc\", early_stopping_rounds=5, n_jobs=3\n",
")\n",
"report = shap_elimination.fit_compute(X, y)"
Expand Down Expand Up @@ -192370,7 +192370,7 @@
"source": [
"As it is hinted in the example above, with large datasets and simple base learners, early stopping can be a much faster alternative to hyperparameter optimization of the ideal number of trees.\n",
"\n",
"Note that although `EarlyStoppingShapRFECV` supports hyperparameter search models as input, early stopping is used only during the Shapley value estimation step, and not during hyperparameter search. For this reason, _if you are not using early stopping, you should use the parent class, `ShapRFECV`, instead of `EarlyStoppingShapRFECV`_."
"Note that although Early Stopping `ShapRFECV` supports hyperparameter search models as input, early stopping is used only during the Shapley value estimation step, and not during hyperparameter search."
]
}
],
Expand Down
Loading

0 comments on commit 45f383b

Please sign in to comment.