Types Of Ensemble Methods
Ensemble Methods can be used for various reasons, mainly to:
- Decrease Variance (Bagging)
- Decrease Bias (Boosting)
- Improve Predictions (Stacking)
Ensemble Methods can also be divided into two groups:
- Sequential Learners, where different models are generated sequentially and the mistakes of previous models are learned by their successors. This aims at exploiting the dependency between models by giving the mislabeled examples higher weights (e.g. AdaBoost).
- Parallel Learners, where base models are generated in parallel. This exploits the independence between models by averaging out the mistakes (e.g. Random Forest).
Boosting in Ensemble Methods
Just as humans learn from their mistakes and try not to repeat them further in life. You start by creating a model from the training data. Then, you create a second model from the previous one by trying to reduce the errors from the previous model.
Types of Boosting Algorithms
- AdaBoost (Adaptive Boosting)
- Gradient Tree Boosting
- XGBoost
AdaBoost
AdaBoost can be applied on top of any classifier to learn from its shortcomings and propose a more accurate model. It is usually called the “best out-of-the-box classifier” for this reason.
We calculate the weighted samples for each data point. AdaBoost assigns weight to each training example to determine its significance in the training dataset. When the assigned weights are high, that set of training data points are likely to have a larger say in the training set. Similarly, when the assigned weights are low, they have a minimal influence in the training dataset.
Initially, all the data points will have the same weighted sample w: $$ w = \frac{1}{N} \in [0, 1] $$ where N is the total number of data points. After this, we calculate the actual influence for this classifier in classifying the data points using the formula: $$ \alpha_t = \frac{1}{2} \ln \frac{1-TotalError}{TotalError} $$ Alpha is how much influence this stump will have in the final classification. Total Error is nothing but the total number of misclassifications for that training set divided by the training set size.
After plugging in the actual values of Total Error for each stump, it's time for us to update the sample weights which we had initially taken as 1/N for every data point. We'll do this using the following formula: $$ w_i = w_{i-1} \times e^{\pm \alpha}. $$