Introduction to Ensemble Learning

Learning Objectives

After this unit, students should be able to

appreciate the need of combining diverse machine learning models.
describe ways to achieve diversity in the machine learning models.
differentiate between weak and unstable learner.
compare and contrast bagging versus boosting.

Different machine learning algorithms come with different assumptions about the patterns in the underlying data. These assumptions can limit the algorithm's effectiveness if they don't match the real-world data. Additionally, data can be complex, with some features having simple relationships and others having more complicated ones. A single algorithm might not be able to handle all these complexities. Even if an algorithm performs well on test data, it might still be biased by the training data. To address these problems, we can use multiple algorithms that look for different patterns in different parts of the data. This can help reduce the impact of assumptions and make the model more adaptable to new data. Different machine learning algorithms come with different assumptions about the patterns in the underlying data. These assumptions can limit the algorithm's effectiveness if they don't match the real-world data. Additionally, data can be complex, with some features having simple relationships and others having more complicated ones. A single algorithm might not be able to handle all these complexities. Even if an algorithm performs well on test data, it might still be biased by the training data. To address these problems, we can use multiple algorithms that look for different patterns in different parts of the data. This can help reduce the impact of assumptions and make the model more adaptable to new data.

ensemble

Diversity

Ensemble models in machine learning refer to methods that combine predictions from multiple models to produce a more accurate and robust prediction than any single model could achieve alone. The strength of an ensemble depends on how diverse the individual models are. If all models are similar, the ensemble will not offer much improvement over individual models. One can achieve diversity through one of the following means to enhance the effectiveness of the ensemble models.

Diverse Models. Ensemble models reduce the inductive bias from individual model assumptions by incorporating diverse models in them. Using diverse model types allows the ensemble to better capture different data patterns and improve its ability to generalize to new, unseen data. This can involve a mix of linear and non-linear models, parametric and non-parametric models, and models with both high bias and high variance. These combinations help create a more balanced and robust ensemble model.
Diverse Hyperparameters. Diversity in ensembles can be achieved by adjusting hyperparameters within the same model type. This involves training multiple models with different hyperparameter configurations, resulting in models that behave differently even when trained on the same dataset. This strategy enables the ensemble to capture a wider range of patterns, improving generalization and robustness. For instance, decision trees can be trained with varying max_depths, gradient descent models can use different learning rates, and Bayesian models can employ diverse prior distributions to enhance diversity.
Diverse Representations. Diversity in ensembles can be achieved by using different data representations altering how the input data is processed and fed into the models. Each model in the ensemble may see a different version of the data, either through transformations, feature selection, or modifications of the input data. This approach helps the ensemble capture a wider variety of patterns, relationships, and structures in the data, leading to improved generaliddation and robustness.
Diverse Data. Diversity in ensembles can be achieved by using different samples from the same population. This ensures that each model learns different patterns, which leads to a more diverse set of predictions when these models are combined. We will learn more about this in the next unit.

Voting

Voting in ensemble models refers to a technique used to combine the predictions from multiple models in an ensemble to make a final decision or prediction. The individual models make their own predictions, and the ensemble then aggregates these predictions in a way that typically improves the overall accuracy, robustness, and generalisation of the final model. There are two types of voting strategies:

Majority Voting. Majority voting in ensemble models is a simple and commonly used technique for combining the predictions of multiple models in classification problems. It works by having each model in the ensemble make a prediction (or classify a data point into a class), and the final prediction is the class that receives the majority of votes from the individual models. If a tie occurs (equal number of votes for different classes), different tie-breaking strategies can be applied (e.g., random selection, or using the prediction from a pre-specified "stronger" model).
Averaging. Averaging in ensemble models is a technique primarily used for combining the predictions of multiple models, particularly in regression tasks. Unlike majority voting, which is used for classification, averaging aggregates predictions by calculating the mean (or average) of the outputs from all models in the ensemble. In some cases, where the models in the ensemble have a diverse effectiveness, weighted averaging can be used in place of simple averaging.

Terminology

Machine learning models are typically referred as learners in the ensemble learning literature. We will interchangeably use this terminology in this section.

Bagging

Bagging (Bootstrap Aggregating) is an ensemble learning technique used to improve the accuracy and stability of machine learning algorithms by reducing variance and preventing overfitting. Bagging trains multiple models (often of the same type) on different bootstrapped subsets, usually of same size, of the training data and then combine their predictions to produce a final output.

Bootstrap Sampling

Bootstrap sampling refers to the samples obtained by drawing datapoints with replacement. This means some data points may be repeated in a subset, while others may be left out. Each model in the ensemble is trained on a different bootstrapped sample.

Bagging algorithms work well on unstable learners. A machine learning model is said to be unstable if a small change in the training data results in a large deviation in the output of the learner. In other words, unstable learners are machine learning models that tend to overfit the training data. Since each unstable learner exposed to different data, each learner learns peculiar patterns in the respective dataset. Additionally, When the training data is noisy or contains outliers, bagging can help by ensuring that different models in the ensemble are exposed to different data points, leading to more stable predictions. The training of the bagged models can be easily parallelised.

Random Forest

Random forest is a classic example of a variant of bagged decision trees. It builds multiple decision trees each on a different bootstrapped sample of the data. The trees are also trained on random subsets of the features to further increase the diversity. The final prediction is made by majority voting (for classification) or averaging (for regression) across all the trees.

Boosting

Boosting is an ensemble learning technique that focuses on converting weak learners into strong learners by training them sequentially. Unlike bagging, which trains models independently and in parallel, boosting trains models in sequence, with each new model trying to correct the errors made by the previous models. Contrary to the unstable learners, weak learners are machine learning model that performs marginally better than random guessing.

Boosting starts with a weak learner trained on the entire training dataset. The focus is given to the datapoints that are misclassified by the model. A new model is trained by correcting the errors made by the previous model. This process continues sequentially, with each model focusing more on the mistakes made by earlier models. The final prediction is, typically, made by computing weighted sum over the outputs of individual model.

AdaBoost

AdaBoost (adaptive boosting) algorithm assign weights to each training datapoint. Initially, all datapoints are equally weighted. After each round of training, the weights of misclassified points are increased, and the next model is trained on this re-weighted dataset. The goal is to focus on the harder-to-classify instances as boosting progresses.

Gradient Boosting

Gradient Boosting trains new models to correct the residual errors (the difference between the true values and the predicted values) of the previous models. Each subsequent model is trained to minimize the errors using a gradient descent approach, where the new model boosts the performance of the ensemble by reducing the overall error.