Introduction

14

Ensemble methods are advanced techniques, often used to solve complex machine learning problems. In simple terms, it is a process where different and independent models (also referred to as the "weak learners") are combined to produce an outcome. The hypothesis is that combining multiple models can produce better results by decreasing the generalization error. Three of the most popular methods for ensemble modeling are Bagging, Boosting, and Voting.

In this guide, you will learn how to implement the following ensemble modeling techniques using scikit-learn:

- Bagged Decision Trees
- Random Forest
- AdaBoost
- Stochastic Gradient Boosting
Voting

We will begin by understanding the Problem Statement and the data.

In this guide, we will try to recognize letters, which is one of the earliest applications of machine learning. In this problem, we will build a model that uses statistics of images of four letters in the Roman alphabet - A, B, P, and R - to predict which letter an image corresponds to.

The data comes from the UCI Machine Learning Repository, and contains 3116 records of 17 variables:

- letter - The letter that the image corresponds to (A, B, P or R). This is the target variable.
- xbox - The horizontal position of where the smallest box covering the letter shape begins.
- ybox - The vertical position of where the smallest box covering the letter shape begins.
- width - The width of the smallest box.
- height - The height of the smallest box.
- onpix - The total number of "on" pixels in the character image.
- xbar - The mean horizontal position of all of the "on" pixels.
- ybar - The mean vertical position of all of the "on" pixels.
- x2bar - The mean squared horizontal position of all of the "on" pixels in the image.
- y2bar - The mean squared vertical position of all of the "on" pixels in the image.
- xybar - The mean of the product of the horizontal and vertical position of all of the "on" pixels in the image.
- x2ybar - The mean of the product of the squared horizontal position and the vertical position of all of the "on" pixels.
- xy2bar - The mean of the product of the horizontal position and the squared vertical position of all of the "on" pixels.
- xedge - The mean number of edges (the number of times an "off" pixel is followed by an "on" pixel, or the image boundary is hit) as the image is scanned from left to right, along the whole vertical length of the image.
- xedgeycor - The mean of the product of the number of horizontal edges at each vertical position and the vertical position.
- yedge - The mean number of edges as the image is scanned from top to bottom, along the whole horizontal length of the image.
yedgexcor = The mean of the product of the number of vertical edges at each horizontal position and the horizontal position.

In this guide, we will follow the following steps:

*Step 1 - Loading the required libraries and modules.*

*Step 2 - Loading the data and performing basic data checks.*

*Step 3 - Creating arrays for the features and the response variable.*

*Step 4 - Building and evaluating a single algorithm.*

*Step 5 - Building, predicting and evaluating the various ensemble models.*

The following sections will cover these steps.

`1 2 3 4 5 6 7 8 9 10 11 12 13 14`

`# Import required libraries import pandas as pd import numpy as np # Import necessary modules from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix, classification_report from sklearn.tree import DecisionTreeClassifier from sklearn.svm import SVC from sklearn.ensemble import VotingClassifier from sklearn import model_selection from sklearn.ensemble import BaggingClassifier from sklearn.ensemble import RandomForestClassifier`

python

The *first line of code* reads in the data as pandas dataframe, while the *second line* prints the shape - 3116 observations of 17 variables. The *third line* gives the first 10 records of the data. Note that this is a multi-class classification problem with 4 classes, or letters, to predict - A, B, R and P.

`1 2 3 4`

`# Load data df = pd.read_csv('letters.csv') print(df.shape) df.head(10)`

python

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14`

`(3116, 17) | | letter | xbox | ybox | width | height | onpix | xbar | ybar | x2bar | y2bar | xybar | x2ybar | xy2bar | xedge | xedgeycor | yedge | yedgexcor | |--- |-------- |------ |------ |------- |-------- |------- |------ |------ |------- |------- |------- |-------- |-------- |------- |----------- |------- |----------- | | 0 | B | 4 | 2 | 5 | 4 | 4 | 8 | 7 | 6 | 6 | 7 | 6 | 6 | 2 | 8 | 7 | 10 | | 1 | A | 1 | 1 | 3 | 2 | 1 | 8 | 2 | 2 | 2 | 8 | 2 | 8 | 1 | 6 | 2 | 7 | | 2 | R | 5 | 9 | 5 | 7 | 6 | 6 | 11 | 7 | 3 | 7 | 3 | 9 | 2 | 7 | 5 | 11 | | 3 | B | 5 | 9 | 7 | 7 | 10 | 9 | 8 | 4 | 4 | 6 | 8 | 6 | 6 | 11 | 8 | 7 | | 4 | P | 3 | 6 | 4 | 4 | 2 | 4 | 14 | 8 | 1 | 11 | 6 | 3 | 0 | 10 | 4 | 8 | | 5 | R | 8 | 10 | 8 | 6 | 6 | 7 | 7 | 3 | 5 | 8 | 4 | 8 | 6 | 6 | 7 | 7 | | 6 | R | 2 | 6 | 4 | 4 | 3 | 6 | 7 | 5 | 5 | 6 | 5 | 7 | 3 | 7 | 5 | 8 | | 7 | A | 3 | 7 | 5 | 5 | 3 | 12 | 2 | 3 | 2 | 10 | 2 | 9 | 2 | 6 | 3 | 8 | | 8 | P | 8 | 14 | 7 | 8 | 4 | 5 | 10 | 6 | 3 | 12 | 5 | 4 | 4 | 10 | 4 | 8 | | 9 | P | 6 | 10 | 8 | 8 | 7 | 8 | 5 | 7 | 5 | 7 | 6 | 6 | 3 | 9 | 8 | 9 |`

The *first line of code* creates an object of the target variable 'y'. The *second line* gives us the list of all the features, excluding the target variable 'letter'.

`1 2 3`

`# Create arrays for the features and the response variable y = df['letter'].values x = df.drop('letter', axis=1).values`

python

The goal of ensemble modeling is to improve the performance over an individual model by combining multiple models. So, we will set the baseline performance measure by starting with one algorithm. In our case, we will build the Logistic Regression Algorithm.

The *first line of code* creates the training and test set, with the 'test_size' argument specifying the percentage of data to be kept in the test data. The *second line* instantiates the Logistic Regression algorithm, while the *third line* fits the model on the training dataset. The *fourth line* generates predictions on the test data, while the *fifth to seventh lines of code* prints the output.

`1 2 3 4 5 6 7 8`

`X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state=10) logreg = LogisticRegression() logreg.fit(X_train, y_train) y_pred = logreg.predict(X_test) print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred)) (216+221+214+219)/(227+244+224+240)`

python

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15`

`[[216 3 1 7] [ 0 221 0 23] [ 2 4 214 4] [ 0 21 0 219]] precision recall f1-score support A 0.99 0.95 0.97 227 B 0.89 0.91 0.90 244 P 1.00 0.96 0.97 224 R 0.87 0.91 0.89 240 avg / total 0.93 0.93 0.93 935 0.93048128342246`

We see that the accuracy of the single model is 93 percent. We will now build various ensemble models and see if it improves the performance.

Bagging or Bootstrap Aggregation is an ensemble method which involves training the same algorithm many times by using different subsets sampled from the training data. The final output prediction is then averaged across the predictions of all the sub-models. The two most popular bagging ensemble techniques are Bagged Decision Trees and Random Forest.

This method performs best with algorithms that have high variance, for example, the decision trees. In scikit-learn, bagging methods are offered as a unified BaggingClassifier meta-estimator.

The *first line of code* creates the kfold cross validation framework. The *second line* instantiates the BaggingClassifier() model, with Decision Tree as the base estimator and 100 as the number of trees. The *third line* generates the cross validated scores on the data, while the *fourth line* prints the mean cross-validation accuracy score.

`1 2 3 4 5`

`# Bagged Decision Trees for Classification kfold = model_selection.KFold(n_splits=10, random_state=10) model_1 = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, random_state=10) results_1 = model_selection.cross_val_score(model_1, x, y, cv=kfold) print(results_1.mean())`

python

Output:

`1`

`0.971121897931`

The accuracy of the BaggingClassifier ensemble is 97.11 percent, a significant improvement over the single logistic regression model.

Random Forest is an extension of bagged decision trees, where the samples of the training dataset are taken with replacement. The trees are constructed with the objective of reducing the correlation between the individual decision trees. In scikit-learn, a random forest model is constructed by using the RandomForestClassifier class.

The *first line of code* creates the kfold cross validation object. The *second line* instantiates the RandomForestClassifier() ensemble. The *third line* generates the cross validated scores on the data, while the *fourth line* prints the mean cross-validation accuracy score.

`1 2 3 4 5`

`# Random Forest Classification kfold_rf = model_selection.KFold(n_splits=10, random_state=10) model_rf = RandomForestClassifier(n_estimators=100, max_features=5) results_rf = model_selection.cross_val_score(model_rf, x, y, cv=kfold_rf) print(results_rf.mean())`

python

Output:

`1`

`0.9890943194`

The accuracy of the RandomForestClassifier ensemble is 98.90 percent, a significant improvement over the other models.

In Boosting, multiple models are trained sequentially and each model learns from the errors of its predecessors. In this guide, we will implement two boosting techniques of AdaBoost and Gradient Boosting.

AdaBoost, short for 'Adaptive Boosting', is the first practical boosting algorithm proposed by Freund and Schapire in 1996. It focuses on classification problems and aims to convert a set of weak classifiers into a strong one.

In scikit-learn, an adaboost model is constructed by using the AdaBoostClassifier class. The *first line of code* creates the kfold cross validation object. The *second line* instantiates the AdaBoostClassifier() ensemble. The *third line* generates the cross validated scores on the data, while the *fourth line* prints the mean cross-validation accuracy score.

`1 2 3 4 5`

`from sklearn.ensemble import AdaBoostClassifier kfold_ada = model_selection.KFold(n_splits=10, random_state=10) model_ada = AdaBoostClassifier(n_estimators=30, random_state=10) results_ada = model_selection.cross_val_score(model_ada, x, y, cv=kfold_ada) print(results_ada.mean())`

python

Output:

`1`

`0.848199563031`

The accuracy of the AdaBoostClassifier ensemble is 84.82 percent, which is lower than the other models.

In scikit-learn, a stochastic gradient boosting model is constructed by using the GradientBoostingClassifier class. The steps to perform this ensembling technique are almost exactly like the ones discussed above, with the exception being the *third line of code*.

Running the codes and looking at the output below, we observe that the accuracy of the ensemble is 98.39 percent.

`1 2 3 4 5`

`from sklearn.ensemble import GradientBoostingClassifier kfold_sgb = model_selection.KFold(n_splits=10, random_state=10) model_sgb = GradientBoostingClassifier(n_estimators=100, random_state=10) results_sgb = model_selection.cross_val_score(model_sgb, x, y, cv=kfold_sgb) print(results_sgb.mean())`

python

Output:

`1`

`0.983958900157`

Voting is a simple but extremely effective ensemble technique that works by combining the predictions from multiple machine learning algorithms. In scikit-learn, it is constructed by using the VotingClassifier class.

The *first line of code* creates the kfold cross validation object. The *second to eight lines of code* instantiates three models - Logistic Regression, Decision Tree, and Support Vector Machine - and appends these algorithms into an object ‘estimator’.

The *ninth line* instantiates the VotingClassifier() ensemble. The *tenth line* generates the cross validated scores on the data, while the *last line of code* prints the mean cross-validation accuracy score.

Looking at the output below, we observe that the accuracy of the ensemble is 98.52 percent.

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15`

`kfold_vc = model_selection.KFold(n_splits=10, random_state=10) # Lines 2 to 8 estimators = [] mod_lr = LogisticRegression() estimators.append(('logistic', mod_lr)) mod_dt = DecisionTreeClassifier() estimators.append(('cart', mod_dt)) mod_sv = SVC() estimators.append(('svm', mod_sv)) # Lines 9 to 11 ensemble = VotingClassifier(estimators) results_vc = model_selection.cross_val_score(ensemble, x, y, cv=kfold_vc) print(results_vc.mean())`

python

Output:

`1`

`0.985241982027`

In this guide, you have learned about Ensemble Modeling with scikit-learn. The performance of the models implemented in the guide is summarized below:

- Logistic Regression: Accuracy of 93 percent
- Bagged Decision Trees: Accuracy of 97.11 percent
- Random Forest: Accuracy of 98.90 percent
- AdaBoost: Accuracy of 84.82 percent
- Stochastic Gradient Boosting: Accuracy of 98.39 percent
Voting Classifier: Accuracy of 98.52 percent

The Single Logistic Regression model achieved a good accuracy of 93 percent, but all the ensemble models outperformed this benchmark and scored more than 97 percent, with the only exception of Adaptive Boosting. There are other iterations that can also be done to improve model performance such as hyperparameter tuning and trying different algorithms. However, the aim of this guide was to demonstrate how ensemble modeling can lead to better performance, which has been established for this problem statement.

To learn more about building machine learning models using **scikit-learn**, please refer to the following guides:

To learn more about building deep learning models using **keras** , please refer to the following guides:
1. Regression with Keras
2. Classification with Keras

14