Ensemble Modeling with scikit-learn

In this guide, you will learn how to implement Bagged Decision Trees, Random Forest, AdaBoost, Stochastic Gradient Boosting, and Voting using scikit-learn.

Jun 11, 2019 • 16 Minute Read

Subscribe to the newsletter

Introduction

Ensemble methods are advanced techniques, often used to solve complex machine learning problems. In simple terms, it is a process where different and independent models (also referred to as the "weak learners") are combined to produce an outcome. The hypothesis is that combining multiple models can produce better results by decreasing the generalization error. Three of the most popular methods for ensemble modeling are Bagging, Boosting, and Voting.

In this guide, you will learn how to implement the following ensemble modeling techniques using scikit-learn:

Bagged Decision Trees
Random Forest
AdaBoost
Stochastic Gradient Boosting
Voting

We will begin by understanding the Problem Statement and the data.

Problem Statement

In this guide, we will try to recognize letters, which is one of the earliest applications of machine learning. In this problem, we will build a model that uses statistics of images of four letters in the Roman alphabet - A, B, P, and R - to predict which letter an image corresponds to.

The data comes from the UCI Machine Learning Repository, and contains 3116 records of 17 variables:

letter - The letter that the image corresponds to (A, B, P or R). This is the target variable.
xbox - The horizontal position of where the smallest box covering the letter shape begins.
ybox - The vertical position of where the smallest box covering the letter shape begins.
width - The width of the smallest box.
height - The height of the smallest box.
onpix - The total number of "on" pixels in the character image.
xbar - The mean horizontal position of all of the "on" pixels.
ybar - The mean vertical position of all of the "on" pixels.
x2bar - The mean squared horizontal position of all of the "on" pixels in the image.
y2bar - The mean squared vertical position of all of the "on" pixels in the image.
xybar - The mean of the product of the horizontal and vertical position of all of the "on" pixels in the image.
x2ybar - The mean of the product of the squared horizontal position and the vertical position of all of the "on" pixels.
xy2bar - The mean of the product of the horizontal position and the squared vertical position of all of the "on" pixels.
xedge - The mean number of edges (the number of times an "off" pixel is followed by an "on" pixel, or the image boundary is hit) as the image is scanned from left to right, along the whole vertical length of the image.
xedgeycor - The mean of the product of the number of horizontal edges at each vertical position and the vertical position.
yedge - The mean number of edges as the image is scanned from top to bottom, along the whole horizontal length of the image.
yedgexcor = The mean of the product of the number of vertical edges at each horizontal position and the horizontal position.

Steps

In this guide, we will follow the following steps:

Step 1 - Loading the required libraries and modules.

Step 2 - Loading the data and performing basic data checks.

Step 3 - Creating arrays for the features and the response variable.

Step 4 - Building and evaluating a single algorithm.

Step 5 - Building, predicting and evaluating the various ensemble models.

The following sections will cover these steps.

Step 1 - Loading the Required Libraries and Modules

      # Import required libraries
import pandas as pd
import numpy as np
 
# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split 
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
    

Step 2 - Reading the Data and Performing Basic Data Checks

The first line of code reads in the data as pandas dataframe, while the second line prints the shape - 3116 observations of 17 variables. The third line gives the first 10 records of the data. Note that this is a multi-class classification problem with 4 classes, or letters, to predict - A, B, R and P.

      # Load data
df = pd.read_csv('letters.csv')
print(df.shape)
df.head(10)
    

Output:

      (3116, 17)
	
|   	| letter    	| xbox | ybox    | width     	| height     	| onpix     	| xbar | ybar    | x2bar     	| y2bar     	| xybar     	| x2ybar     	| xy2bar    	| xedge     	| xedgeycor 	| yedge     	| yedgexcor 	|
|---	|--------   	|------ |------  	|-------    	|--------    	|-------    	|------ |------  	|-------    	|-------    	|-------    	|--------    	|--------   	|-------    	|-----------	|-------    	|-----------	|
| 0 	| B  	   	| 4	 | 2	    	| 5 	    	| 4  	    	| 4 	    	| 8	 | 7	    	| 6 	    	| 6         	| 7 	    	| 6  	    	| 6  	   	| 2 	    	| 8     		| 7 	    	| 10    		|
| 1 	| A  	   	| 1	 | 1	    	| 3 	    	| 2  	    	| 1 	    	| 8	 | 2	    	| 2 	    	| 2 	    	| 8 	    	| 2  	    	| 8  	   	| 1 	    	| 6     		| 2 	    	| 7     		|
| 2 	| R  	   	| 5	 | 9	    	| 5 	    	| 7  	    	| 6 	    	| 6	 | 11       	| 7 	    	| 3 	    	| 7 	    	| 3  	    	| 9  	   	| 2 	    	| 7     		| 5 	    	| 11    		|
| 3 	| B  	   	| 5	 | 9	    	| 7 	    	| 7  	    	| 10	    	| 9	 | 8	    	| 4 	    	| 4 	    	| 6 	    	| 8  	    	| 6  	   	| 6 	    	| 11    		| 8 	    	| 7     		|
| 4 	| P  	   	| 3	 | 6	    	| 4 	    	| 4  	    	| 2 	    	| 4	 | 14       	| 8 	    	| 1 	    	| 11	    	| 6  	    	| 3  	   	| 0 	    	| 10    		| 4 	    	| 8     		|
| 5 	| R  	   	| 8	 | 10       	| 8 	    	| 6  	    	| 6 	    	| 7	 | 7	    	| 3 	    	| 5 	    	| 8 	    	| 4  	    	| 8  	   	| 6 	    	| 6     		| 7 	    	| 7     		|
| 6 	| R  	   	| 2	 | 6	    	| 4 	    	| 4  	    	| 3 	    	| 6	 | 7	    	| 5 	    	| 5 	    	| 6 	    	| 5  	    	| 7  	   	| 3 	    	| 7     		| 5 	    	| 8     		|
| 7 	| A  	   	| 3	 | 7	    	| 5 	    	| 5  	    	| 3 	    	| 12   | 2	    	| 3 	    	| 2 	    	| 10	    	| 2  	    	| 9  	   	| 2 	    	| 6     		| 3 	    	| 8     		|
| 8 	| P  	   	| 8	 | 14       	| 7 	    	| 8  	    	| 4 	    	| 5	 | 10       	| 6 	    	| 3 	    	| 12	    	| 5  	    	| 4  	   	| 4 	    	| 10    		| 4 	    	| 8     		|
| 9 	| P  	   	| 6	 | 10       	| 8 	    	| 8  	    	| 7 	    	| 8	 | 5	    	| 7 	    	| 5 	    	| 7 	    	| 6  	    	| 6  	   	| 3 	    	| 9     		| 8 	    	| 9     		|
    

Step 3 - Creating Arrays for the Features and the Response Variable

The first line of code creates an object of the target variable 'y'. The second line gives us the list of all the features, excluding the target variable 'letter'.

      # Create arrays for the features and the response variable
y = df['letter'].values
x = df.drop('letter', axis=1).values
    

Step 4 - Building and Evaluating a Single Algorithm

The goal of ensemble modeling is to improve the performance over an individual model by combining multiple models. So, we will set the baseline performance measure by starting with one algorithm. In our case, we will build the Logistic Regression Algorithm.

The first line of code creates the training and test set, with the 'test_size' argument specifying the percentage of data to be kept in the test data. The second line instantiates the Logistic Regression algorithm, while the third line fits the model on the training dataset. The fourth line generates predictions on the test data, while the fifth to seventh lines of code prints the output.

      X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state=10)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
 
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
(216+221+214+219)/(227+244+224+240)
    

Output:

      [216   3   1   7]
     [  0 221   0  23]
 	[  2   4 214   4]
 	[  0  21   0 219]]
             	precision	recall  f1-score   support
	
          	A   	0.99      0.95  	0.97   	227
          	B   	0.89      0.91  	0.90   	244
        	  P       1.00  	0.96  	0.97       224
          	R   	0.87      0.91  	0.89   	240
	
	avg / total   	0.93      0.93  	0.93   	935
	
0.93048128342246
    

We see that the accuracy of the single model is 93 percent. We will now build various ensemble models and see if it improves the performance.

Bagging

Bagging or Bootstrap Aggregation is an ensemble method which involves training the same algorithm many times by using different subsets sampled from the training data. The final output prediction is then averaged across the predictions of all the sub-models. The two most popular bagging ensemble techniques are Bagged Decision Trees and Random Forest.

Bagging Classifier

This method performs best with algorithms that have high variance, for example, the decision trees. In scikit-learn, bagging methods are offered as a unified BaggingClassifier meta-estimator.

The first line of code creates the kfold cross validation framework. The second line instantiates the BaggingClassifier() model, with Decision Tree as the base estimator and 100 as the number of trees. The third line generates the cross validated scores on the data, while the fourth line prints the mean cross-validation accuracy score.

      # Bagged Decision Trees for Classification
kfold = model_selection.KFold(n_splits=10, random_state=10)
model_1 = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, random_state=10)
results_1 = model_selection.cross_val_score(model_1, x, y, cv=kfold)
print(results_1.mean())
    

Output:

      0.971121897931

The accuracy of the BaggingClassifier ensemble is 97.11 percent, a significant improvement over the single logistic regression model.

Random Forest

Random Forest is an extension of bagged decision trees, where the samples of the training dataset are taken with replacement. The trees are constructed with the objective of reducing the correlation between the individual decision trees. In scikit-learn, a random forest model is constructed by using the RandomForestClassifier class.

The first line of code creates the kfold cross validation object. The second line instantiates the RandomForestClassifier() ensemble. The third line generates the cross validated scores on the data, while the fourth line prints the mean cross-validation accuracy score.

      # Random Forest Classification
kfold_rf = model_selection.KFold(n_splits=10, random_state=10)
model_rf = RandomForestClassifier(n_estimators=100, max_features=5)
results_rf = model_selection.cross_val_score(model_rf, x, y, cv=kfold_rf)
print(results_rf.mean())
    

Output:

      0.9890943194

The accuracy of the RandomForestClassifier ensemble is 98.90 percent, a significant improvement over the other models.

Boosting

In Boosting, multiple models are trained sequentially and each model learns from the errors of its predecessors. In this guide, we will implement two boosting techniques of AdaBoost and Gradient Boosting.

Adaptive Boosting or AdaBoost

AdaBoost, short for 'Adaptive Boosting', is the first practical boosting algorithm proposed by Freund and Schapire in 1996. It focuses on classification problems and aims to convert a set of weak classifiers into a strong one.

In scikit-learn, an adaboost model is constructed by using the AdaBoostClassifier class. The first line of code creates the kfold cross validation object. The second line instantiates the AdaBoostClassifier() ensemble. The third line generates the cross validated scores on the data, while the fourth line prints the mean cross-validation accuracy score.

      from sklearn.ensemble import AdaBoostClassifier
kfold_ada = model_selection.KFold(n_splits=10, random_state=10)
model_ada = AdaBoostClassifier(n_estimators=30, random_state=10)
results_ada = model_selection.cross_val_score(model_ada, x, y, cv=kfold_ada)
print(results_ada.mean())
    

Output:

      0.848199563031

The accuracy of the AdaBoostClassifier ensemble is 84.82 percent, which is lower than the other models.

Stochastic Gradient Boosting

In scikit-learn, a stochastic gradient boosting model is constructed by using the GradientBoostingClassifier class. The steps to perform this ensembling technique are almost exactly like the ones discussed above, with the exception being the third line of code.

Running the codes and looking at the output below, we observe that the accuracy of the ensemble is 98.39 percent.

      from sklearn.ensemble import GradientBoostingClassifier
kfold_sgb = model_selection.KFold(n_splits=10, random_state=10)
model_sgb = GradientBoostingClassifier(n_estimators=100, random_state=10)
results_sgb = model_selection.cross_val_score(model_sgb, x, y, cv=kfold_sgb)
print(results_sgb.mean())
    

Output:

      0.983958900157

Voting Ensemble

Voting is a simple but extremely effective ensemble technique that works by combining the predictions from multiple machine learning algorithms. In scikit-learn, it is constructed by using the VotingClassifier class.

The first line of code creates the kfold cross validation object. The second to eight lines of code instantiates three models - Logistic Regression, Decision Tree, and Support Vector Machine - and appends these algorithms into an object ‘estimator’.

The ninth line instantiates the VotingClassifier() ensemble. The tenth line generates the cross validated scores on the data, while the last line of code prints the mean cross-validation accuracy score.

Looking at the output below, we observe that the accuracy of the ensemble is 98.52 percent.

      kfold_vc = model_selection.KFold(n_splits=10, random_state=10)
 
# Lines 2 to 8
estimators = []
mod_lr = LogisticRegression()
estimators.append(('logistic', mod_lr))
mod_dt = DecisionTreeClassifier()
estimators.append(('cart', mod_dt))
mod_sv = SVC()
estimators.append(('svm', mod_sv))
 
# Lines 9 to 11
ensemble = VotingClassifier(estimators)
results_vc = model_selection.cross_val_score(ensemble, x, y, cv=kfold_vc)
print(results_vc.mean())
    

Output:

      0.985241982027

Conclusion

In this guide, you have learned about Ensemble Modeling with scikit-learn. The performance of the models implemented in the guide is summarized below:

Logistic Regression: Accuracy of 93 percent
Bagged Decision Trees: Accuracy of 97.11 percent
Random Forest: Accuracy of 98.90 percent
AdaBoost: Accuracy of 84.82 percent
Stochastic Gradient Boosting: Accuracy of 98.39 percent
Voting Classifier: Accuracy of 98.52 percent

The Single Logistic Regression model achieved a good accuracy of 93 percent, but all the ensemble models outperformed this benchmark and scored more than 97 percent, with the only exception of Adaptive Boosting. There are other iterations that can also be done to improve model performance such as hyperparameter tuning and trying different algorithms. However, the aim of this guide was to demonstrate how ensemble modeling can lead to better performance, which has been established for this problem statement.

To learn more about building machine learning models using scikit-learn, please refer to the following guides:

To learn more about building deep learning models using keras , please refer to the following guides: