Topics covered :
1. What is Boosting ?
2. What is a Boosting Algorithms ?
3. Examples of Boosting Algorithm
For end to end implementation refer to : HealthCare Predicting Length of stay with boosting algorithms
Git Hub to pull the code : Github/swapanroy
Boosting ("to Boost", in english meaning help or encourage to increase or improve.) is a method used in machine learning to improve machine models' predictive accuracy and performance.
Ensemble learning or boosting has become one of the most promising approaches in machine learning domain. The ensemble method is based on the principle of generating multiple predictions and average voting among individual classifiers.
Two implementation of Boosting Algorithm
AdaBoost or Adaptive Boosting is the Boosting ensemble model,a statistical classification meta-algorithm refers to a particular method of training a boosted classifier. The method automatically adjusts its parameters to the data based on the actual performance in the current iteration.
CATBoost - Provides a gradient boosting framework which among other features attempts to solve for Categorical features using a permutation driven alternative compared to the classical algorithm. Catboost calculates the residual of each data point and uses the model trained with other data. In this way, each data point gets different residual data. These data are evaluated as targets, and the training times of the general model are as many as the number of iterations. Since many models will be implemented by definition, this computational complexity seems very expensive and takes too much time.
Data Source: Kaggle- HealthCare data to predict length of stay
import pandas as pd import numpy as np
import matplotlib.pyplot as plt import seaborn as sns
from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import AdaBoostClassifier from catboost import CatBoostClassifier, Pool from catboost.utils import get_confusion_matrix
`from sklearn.metrics import confusion_matrix, classification_report, make_scorer, accuracy_score from sklearn.model_selection import cross_val_score, train_test_split``
`df_train=pd.read_csv('../input/av-healthcare-analytics-ii/healthcare/train_data.csv',sep=',')
'plt.figure(figsize=(20,5)) x = sns.countplot(df_train['Stay'], order = df_train['Stay'].value_counts().index) for i in x.containers: x.bar_label(i,)'
df_merge = [df_train, df_test] df_merge[0]
age_value = {'0-10': 0, '11-20': 1, '21-30': 2, '31-40': 3, '41-50': 4, '51-60': 5, '61-70': 6, '71-80': 7, '81-90': 8, '91-100': 9} stay_value = {'0-10': 0, '11-20': 1, '21-30': 2, '31-40': 3, '41-50': 4, '51-60': 5, '61-70': 6, '71-80': 7, '81-90': 8, '91-100': 9, 'More than 100 Days': 10}
df_merge[0]['Age'] = df_merge[0]['Age'].replace(age_value.keys(), age_value.values()) df_merge[0]['Stay'] = df_merge[0]['Stay'].replace(stay_value.keys(), stay_value.values()) df_merge[1]['Age'] = df_merge[1]['Age'].replace(age_value.keys(), age_value.values())
An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of error classified instances are adjusted.
ada_classifier = AdaBoostClassifier(n_estimators=5)
ada_classifier.fit(X_train, y_train)
pred_ada = ada_classifier.predict(X_test)
# Cross-validation
scores = cross_val_score(ada_classifier,X_test,y_test, cv=12)
print('Accuracy score',round(scores.mean() * 100,2))
print('Confusion Matrix\n',confusion_matrix(y_test, pred_ada))
Accuracy score 33.8
Gradient boosting algorithm works by building simpler (weak) prediction models sequentially where each model tries to predict the error left over by the previous model. It find uses in search, recommendation systems, personal assistant, self-driving cars, weather prediction.
- iterations - It accepts integer specifying the number of trees to train. The default is 1000.
- learning_rate - It specifies the learning rate during the training process. The default is 0.03.
- l2_leaf_reg - It accepts float specifying coefficient of L2 regularization of a loss function. The default value is 3.
- loss_function - It accepts string specifying metric used during training. The gradient boosting algorithm will try to minimize/maximize loss function output depending on the situation.
- eval_metric - It accepts string specifying metric to evaluate on evaluation set given during training. It has the same options as that of loss_function.
model = CatBoostClassifier(iterations=1000,
learning_rate=0.3,
depth=10,
l2_leaf_reg = 3,
random_strength =2,
loss_function='MultiClass',
eval_metric='MultiClass')
model.fit(X_train,
y_train,
eval_set=eval_dataset,
verbose=True)
print(model.get_best_score())
cm = get_confusion_matrix(model, eval_dataset)
print(cm)
predict_accuracy_on_test_set = (cm[0,0] + cm[1,1])/(cm[0,0] + cm[1,1]+cm[1,0] + cm[0,1])
ax = sns.heatmap(cm, linewidth=1)
plt.show()
print("catboost Acc : ", predict_accuracy_on_test_set)
Accuracy : 0.40188104509483735
References: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
https://github.com/catboost/catboost
Photo by Jen Theodore on Unsplash
Cross-posting on https://dev.to/swapanroy/boosting-algorithms-1jmb
Revision date: 12/28/2022