Ensemble methods are advanced techniques, often used to solve complex machine learning problems. In simple terms, it is a process where different and independent models (also referred to as the "weak learners") are combined to produce an outcome. The hypothesis is that combining multiple models can produce better results by decreasing the generalization error. Three of the most popular methods for ensemble modeling are Bagging, Boosting, and Voting.
In this guide, you will learn how to implement the following ensemble modeling techniques using scikit-learn:
Voting
We will begin by understanding the Problem Statement and the data.
In this guide, we will try to recognize letters, which is one of the earliest applications of machine learning. In this problem, we will build a model that uses statistics of images of four letters in the Roman alphabet - A, B, P, and R - to predict which letter an image corresponds to.
The data comes from the UCI Machine Learning Repository, and contains 3116 records of 17 variables:
yedgexcor = The mean of the product of the number of vertical edges at each horizontal position and the horizontal position.
In this guide, we will follow the following steps:
Step 1 - Loading the required libraries and modules.
Step 2 - Loading the data and performing basic data checks.
Step 3 - Creating arrays for the features and the response variable.
Step 4 - Building and evaluating a single algorithm.
Step 5 - Building, predicting and evaluating the various ensemble models.
The following sections will cover these steps.
1# Import required libraries
2import pandas as pd
3import numpy as np
4
5# Import necessary modules
6from sklearn.linear_model import LogisticRegression
7from sklearn.model_selection import train_test_split
8from sklearn.metrics import confusion_matrix, classification_report
9from sklearn.tree import DecisionTreeClassifier
10from sklearn.svm import SVC
11from sklearn.ensemble import VotingClassifier
12from sklearn import model_selection
13from sklearn.ensemble import BaggingClassifier
14from sklearn.ensemble import RandomForestClassifier
The first line of code reads in the data as pandas dataframe, while the second line prints the shape - 3116 observations of 17 variables. The third line gives the first 10 records of the data. Note that this is a multi-class classification problem with 4 classes, or letters, to predict - A, B, R and P.
1# Load data
2df = pd.read_csv('letters.csv')
3print(df.shape)
4df.head(10)
Output:
1(3116, 17)
2
3| | letter | xbox | ybox | width | height | onpix | xbar | ybar | x2bar | y2bar | xybar | x2ybar | xy2bar | xedge | xedgeycor | yedge | yedgexcor |
4|--- |-------- |------ |------ |------- |-------- |------- |------ |------ |------- |------- |------- |-------- |-------- |------- |----------- |------- |----------- |
5| 0 | B | 4 | 2 | 5 | 4 | 4 | 8 | 7 | 6 | 6 | 7 | 6 | 6 | 2 | 8 | 7 | 10 |
6| 1 | A | 1 | 1 | 3 | 2 | 1 | 8 | 2 | 2 | 2 | 8 | 2 | 8 | 1 | 6 | 2 | 7 |
7| 2 | R | 5 | 9 | 5 | 7 | 6 | 6 | 11 | 7 | 3 | 7 | 3 | 9 | 2 | 7 | 5 | 11 |
8| 3 | B | 5 | 9 | 7 | 7 | 10 | 9 | 8 | 4 | 4 | 6 | 8 | 6 | 6 | 11 | 8 | 7 |
9| 4 | P | 3 | 6 | 4 | 4 | 2 | 4 | 14 | 8 | 1 | 11 | 6 | 3 | 0 | 10 | 4 | 8 |
10| 5 | R | 8 | 10 | 8 | 6 | 6 | 7 | 7 | 3 | 5 | 8 | 4 | 8 | 6 | 6 | 7 | 7 |
11| 6 | R | 2 | 6 | 4 | 4 | 3 | 6 | 7 | 5 | 5 | 6 | 5 | 7 | 3 | 7 | 5 | 8 |
12| 7 | A | 3 | 7 | 5 | 5 | 3 | 12 | 2 | 3 | 2 | 10 | 2 | 9 | 2 | 6 | 3 | 8 |
13| 8 | P | 8 | 14 | 7 | 8 | 4 | 5 | 10 | 6 | 3 | 12 | 5 | 4 | 4 | 10 | 4 | 8 |
14| 9 | P | 6 | 10 | 8 | 8 | 7 | 8 | 5 | 7 | 5 | 7 | 6 | 6 | 3 | 9 | 8 | 9 |
The first line of code creates an object of the target variable 'y'. The second line gives us the list of all the features, excluding the target variable 'letter'.
1# Create arrays for the features and the response variable
2y = df['letter'].values
3x = df.drop('letter', axis=1).values
The goal of ensemble modeling is to improve the performance over an individual model by combining multiple models. So, we will set the baseline performance measure by starting with one algorithm. In our case, we will build the Logistic Regression Algorithm.
The first line of code creates the training and test set, with the 'test_size' argument specifying the percentage of data to be kept in the test data. The second line instantiates the Logistic Regression algorithm, while the third line fits the model on the training dataset. The fourth line generates predictions on the test data, while the fifth to seventh lines of code prints the output.
1X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state=10)
2logreg = LogisticRegression()
3logreg.fit(X_train, y_train)
4y_pred = logreg.predict(X_test)
5
6print(confusion_matrix(y_test, y_pred))
7print(classification_report(y_test, y_pred))
8(216+221+214+219)/(227+244+224+240)
Output:
1
2 [[216 3 1 7]
3 [ 0 221 0 23]
4 [ 2 4 214 4]
5 [ 0 21 0 219]]
6 precision recall f1-score support
7
8 A 0.99 0.95 0.97 227
9 B 0.89 0.91 0.90 244
10 P 1.00 0.96 0.97 224
11 R 0.87 0.91 0.89 240
12
13 avg / total 0.93 0.93 0.93 935
14
150.93048128342246
We see that the accuracy of the single model is 93 percent. We will now build various ensemble models and see if it improves the performance.
Bagging or Bootstrap Aggregation is an ensemble method which involves training the same algorithm many times by using different subsets sampled from the training data. The final output prediction is then averaged across the predictions of all the sub-models. The two most popular bagging ensemble techniques are Bagged Decision Trees and Random Forest.
This method performs best with algorithms that have high variance, for example, the decision trees. In scikit-learn, bagging methods are offered as a unified BaggingClassifier meta-estimator.
The first line of code creates the kfold cross validation framework. The second line instantiates the BaggingClassifier() model, with Decision Tree as the base estimator and 100 as the number of trees. The third line generates the cross validated scores on the data, while the fourth line prints the mean cross-validation accuracy score.
1# Bagged Decision Trees for Classification
2kfold = model_selection.KFold(n_splits=10, random_state=10)
3model_1 = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, random_state=10)
4results_1 = model_selection.cross_val_score(model_1, x, y, cv=kfold)
5print(results_1.mean())
Output:
10.971121897931
The accuracy of the BaggingClassifier ensemble is 97.11 percent, a significant improvement over the single logistic regression model.
Random Forest is an extension of bagged decision trees, where the samples of the training dataset are taken with replacement. The trees are constructed with the objective of reducing the correlation between the individual decision trees. In scikit-learn, a random forest model is constructed by using the RandomForestClassifier class.
The first line of code creates the kfold cross validation object. The second line instantiates the RandomForestClassifier() ensemble. The third line generates the cross validated scores on the data, while the fourth line prints the mean cross-validation accuracy score.
1# Random Forest Classification
2kfold_rf = model_selection.KFold(n_splits=10, random_state=10)
3model_rf = RandomForestClassifier(n_estimators=100, max_features=5)
4results_rf = model_selection.cross_val_score(model_rf, x, y, cv=kfold_rf)
5print(results_rf.mean())
Output:
10.9890943194
The accuracy of the RandomForestClassifier ensemble is 98.90 percent, a significant improvement over the other models.
In Boosting, multiple models are trained sequentially and each model learns from the errors of its predecessors. In this guide, we will implement two boosting techniques of AdaBoost and Gradient Boosting.
AdaBoost, short for 'Adaptive Boosting', is the first practical boosting algorithm proposed by Freund and Schapire in 1996. It focuses on classification problems and aims to convert a set of weak classifiers into a strong one.
In scikit-learn, an adaboost model is constructed by using the AdaBoostClassifier class. The first line of code creates the kfold cross validation object. The second line instantiates the AdaBoostClassifier() ensemble. The third line generates the cross validated scores on the data, while the fourth line prints the mean cross-validation accuracy score.
1from sklearn.ensemble import AdaBoostClassifier
2kfold_ada = model_selection.KFold(n_splits=10, random_state=10)
3model_ada = AdaBoostClassifier(n_estimators=30, random_state=10)
4results_ada = model_selection.cross_val_score(model_ada, x, y, cv=kfold_ada)
5print(results_ada.mean())
Output:
10.848199563031
The accuracy of the AdaBoostClassifier ensemble is 84.82 percent, which is lower than the other models.
In scikit-learn, a stochastic gradient boosting model is constructed by using the GradientBoostingClassifier class. The steps to perform this ensembling technique are almost exactly like the ones discussed above, with the exception being the third line of code.
Running the codes and looking at the output below, we observe that the accuracy of the ensemble is 98.39 percent.
1from sklearn.ensemble import GradientBoostingClassifier
2kfold_sgb = model_selection.KFold(n_splits=10, random_state=10)
3model_sgb = GradientBoostingClassifier(n_estimators=100, random_state=10)
4results_sgb = model_selection.cross_val_score(model_sgb, x, y, cv=kfold_sgb)
5print(results_sgb.mean())
Output:
10.983958900157
Voting is a simple but extremely effective ensemble technique that works by combining the predictions from multiple machine learning algorithms. In scikit-learn, it is constructed by using the VotingClassifier class.
The first line of code creates the kfold cross validation object. The second to eight lines of code instantiates three models - Logistic Regression, Decision Tree, and Support Vector Machine - and appends these algorithms into an object ‘estimator’.
The ninth line instantiates the VotingClassifier() ensemble. The tenth line generates the cross validated scores on the data, while the last line of code prints the mean cross-validation accuracy score.
Looking at the output below, we observe that the accuracy of the ensemble is 98.52 percent.
1kfold_vc = model_selection.KFold(n_splits=10, random_state=10)
2
3# Lines 2 to 8
4estimators = []
5mod_lr = LogisticRegression()
6estimators.append(('logistic', mod_lr))
7mod_dt = DecisionTreeClassifier()
8estimators.append(('cart', mod_dt))
9mod_sv = SVC()
10estimators.append(('svm', mod_sv))
11
12# Lines 9 to 11
13ensemble = VotingClassifier(estimators)
14results_vc = model_selection.cross_val_score(ensemble, x, y, cv=kfold_vc)
15print(results_vc.mean())
Output:
10.985241982027
In this guide, you have learned about Ensemble Modeling with scikit-learn. The performance of the models implemented in the guide is summarized below:
Voting Classifier: Accuracy of 98.52 percent
The Single Logistic Regression model achieved a good accuracy of 93 percent, but all the ensemble models outperformed this benchmark and scored more than 97 percent, with the only exception of Adaptive Boosting. There are other iterations that can also be done to improve model performance such as hyperparameter tuning and trying different algorithms. However, the aim of this guide was to demonstrate how ensemble modeling can lead to better performance, which has been established for this problem statement.
To learn more about building machine learning models using scikit-learn, please refer to the following guides:
To learn more about building deep learning models using keras , please refer to the following guides: 1. Regression with Keras 2. Classification with Keras