sklearn_ensemble_cv
is a Python module for performing accurate and efficient ensemble cross-validation methods from various projects.
- The module builds on
scikit-learn
/sklearn
to provide the most flexibility on various base predictors. - The module includes functions for creating ensembles of models, training the ensembles using cross-validation, and making predictions with the ensembles.
- The module also includes utilities for evaluating the performance of the ensembles and the individual models that make up the ensembles.
from sklearn.tree import DecisionTreeRegressor
from sklearn_ensemble_cv import ECV
# Hyperparameters for the base regressor
grid_regr = {
'max_depth':np.array([6,7], dtype=int),
}
# Hyperparameters for the ensemble
grid_ensemble = {
'max_features':np.array([0.9,1.]),
'max_samples':np.array([0.6,0.7]),
'n_jobs':-1 # use all processors for fitting each ensemble
}
# Build 50 trees and get estimates until 100 trees
res_ecv, info_ecv = ECV(
X_train, y_train, DecisionTreeRegressor, grid_regr, grid_ensemble,
M=50, M_max=100, return_df=True
)
It currently supports bagging- and subagging-type ensembles under square loss.
The hyperparameters of the base predictor are listed at sklearn.tree.DecisionTreeRegressor
and the hyperparameters of the ensemble are listed at sklearn.ensemble.BaggingRegressor
.
Using other sklearn Regressors (regr.is_regressor = True
) as base predictors is also supported.
This project is currently in development. More CV methods will be added shortly.
- split CV
- K-fold CV
- ECV
- GCV
- CGCV
- CGCV non-square loss
- ALOCV
Check out Jupyter Notebooks in the tutorials folder:
Name | Description |
---|---|
basics.ipynb | Basics about how to apply ECV/CGCV on risk estimation and hyperparameter tuning for ensemble learning. |
cgcv_l1_huber.ipynb | Custom CGCV for M-estimator: l1-regularized Huber ensembles. |
multitask.ipynb | Apply ECV on risk estimation and hyperparameter tuning for multi-task ensemble learning. |
random_forests.ipynb | Apply ECV on model selection of random forests via a simple utility function. |
The code is tested with scikit-learn == 1.3.1
.
The document is available.
The module can be installed via PyPI:
pip install sklearn-ensemble-cv
If you find this package useful for your research, please consider citing our research paper:
Method | Reference |
---|---|
ECV | Du, J. H., Patil, P., Roeder, K., & Kuchibhotla, A. K. (2024). Extrapolated cross-validation for randomized ensembles. Journal of Computational and Graphical Statistics, 1-12. |
GCV | Du, J. H., Patil, P., & Kuchibhotla, A. K. (2023). Subsample ridge ensembles: equivalences and generalized cross-validation. In Proceedings of the 40th International Conference on Machine Learning (pp. 8585-8631). Patil, P., & Du, J. H. (2024). Generalized equivalences between subsampling and ridge regularization. Advances in Neural Information Processing Systems, 36. |
CGCV | Bellec, P. C., Du, J. H., Koriyama, T., Patil, P., & Tan, K. (2024). Corrected generalized cross-validation for finite ensembles of penalized estimators. Journal of the Royal Statistical Society Series B: Statistical Methodology, qkae092. |
CGCV (non-square loss) | Koriyama, T., Patil, P., Du, J. H., Tan, K., & Bellec, P. C. (2024). Precise asymptotics of bagging regularized M-estimators. arXiv preprint arXiv:2409.15252. |