Ensemble methods* are techniques that combine the decisions from several base machine learning (ML) models to find a predictive model to achieve optimum results. Consider the fable of the blind men and the elephant depicted in the image below. The blind men are each describing an elephant from their own point of view. Their descriptions are all correct but incomplete. Their understanding of the elephant would be more accurate and realistic if they came together to discuss and combined their descriptions.
The main principle of ensemble methods is to combine weak and strong learners to form strong and versatile learners.
This guide will introduce you to the two main methods of ensemble learning: bagging and boosting. Bagging is a parallel ensemble, while boosting is sequential.
This guide will use the Iris dataset from the sci-kit learn dataset library.
But first, let's talk about bootstrapping and decision trees, both of which are essential for ensemble methods.
The bootstrap method refers to creating small multiple subsets of data from an entire dataset. These subsets of data are randomly sampled and replaced. The replacement of the sample is known as resampling.
Each section of the subsets will have equal probability. It will change the mean and standard deviation of the dataset, making the model more robust. The base learners and classifiers in the ensemble method will be mapped onto these subsets.
In machine learning, decision trees have a huge impact on decision-based analysis problems. They cover both classification and regression. As the name implies, they use a tree-like model containing nodes and leaves.
The decision tree below determines whether a person is fit or unfit.
In the above image, the model's root is at the top—it's an upside-down tree! Here, conditions are internal nodes and outcomes are leaf nodes.
Notice that the decision tree contains several if-else statements in a strict order, making the model inflexible. It gives rise to overfitting, which occurs when a function fits the data too well. That means the model will be accurate only on the data it's been trained on. The model fails to work if there is a slight change in the training data or new data.
Using one decision tree is can be problematic and might not be stable enough; however, using multiple decision trees and combining their results will do great. Combining multiple classifiers in a prediction model is called ensembling. The simple rule of ensemble methods is to reduce the error by reducing the variance.
In this section, you will learn about ensemble techniques including bagging and boosting.
The term "bagging" comes from the words Bootstrap Aggregator.
It fits the base learners (classifiers) on each random subset taken from the original dataset (bootstrapping). Due to the parallel ensemble, all of the classifiers in a training set are independent of each other so that each model will inherit slightly different features.
Next, bagging combines the results of all the learners and adds (aggregates) their prediction by averaging (mean) their outputs to get to final results.
The Random Forest (RF) algorithm can solve the problem of overfitting in decision trees. Random orest is the ensemble of the decision trees. It builds a forest of many random decision trees.
The process of RF and Bagging is almost the same. RF selects only the best features from the subset to split the node.
The diverse outcomes reduce the variance to give smooth predictions.
The boosting technique follows a sequential order. The output of one base learner will be input to another. If a base classifier is misclassified (red box), its weight will get increased (over-weighting) and the next base learner will classify more correctly.
The next logical step is to combine the classifiers to predict the results.
Gradient Descent Boosting, AdaBoost, and XGbooost are some extensions over boosting methods.
Gradient boosting minimizes the loss but adds gradient optimization in the iteration, whereas Adaptive Boosting, or AdaBoost, tweaks the instance of weights for every new predictor.
Boosting—or any other ensemble method, for that matter—will reduce the likelihood of overfitting of the model.
The table below shows the similarities and the differences between the ensemble methods.
Time to implement the code.
The main drivers of building the classifier model are Sci-Kit Learn's
Before starting the classifier implementation, it is important to call the Decision Tree from
sklearn.tree. Only then will the
sklearn.ensemble the function work. Read about the parameters of different ensemble methods here.
1from sklearn.tree import DecisionTreeClassifier 2from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier,AdaBoostClassifier 3from sklearn import datasets # import inbuild datasets 4 5from sklearn.model_selection import train_test_split 6from sklearn import metrics 7 8from sklearn.model_selection import cross_val_score, cross_val_predict 9from sklearn.metrics import confusion_matrix
Load the Iris data from ski-kit learn datasets.
1iris = datasets.load_iris() 2X = iris.data 3y = iris.target
The next step is to split the data set into 70% training and 30% testing using the hold-out method.
1X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
score as it will help you check whether the model is overfitting. A standard process to implement any classifier model is calling the classifier function, fitting the classifier on training data and target data, and finally predicting outcomes on X_test.
1score= 2classifier = DecisionTreeClassifier()
1classifier.score(X_train, y_train),classifier.score(X_test, y_test)
Well, the test score seems good, but the
1.0 score shows the model is overfitting!
Ensemble models to the rescue!
1rf = RandomForestClassifier(n_estimators=100)
1bag_clf = BaggingClassifier(base_estimator=rf, n_estimators=100, 2 bootstrap=True, n_jobs=-1, 3 random_state=42) 4 5bag_clf.fit(X_train, y_train)
The accuracy is around 98%, and the model solves the problem of overfitting. Amazing!
Let's check boosting algorithms before predicting the species.
Sci-kit learn's gradient boosting defaults to the decision tree only. Hence, it is also known as Gradient Boosting Decision Tree.
1gb = GradientBoostingClassifier(n_estimators=100).fit(X_train, y_train) 2gb.fit(X_train, y_train)
1gb.score (X_test,y_test),gb.score (X_train,y_train)
It is overfitting! Same as with the decision tree. Let's check the situation with AdaBoost.
1ad = AdaBoostClassifier(n_estimators=100, learning_rate=0.03) 2 3ad.fit(X_train, y_train)
1ad.score(X_train, y_train), ad.score(X_test, y_test)
Well! The model doesn't overfit, but the test score is not as good as what RF gives.
Predict the target species using the RF classifier.
1y_pred=bag_clf.predict(X_test) 2print(classification_report(y_test,y_pred)) 3print('\n') 4print(confusion_matrix(y_test,y_pred))
So, is bagging better than boosting?
Well, there is no clear winner. It depends on the data and problem statement. Finding a trade-off between variance and bias is necessary to have good results. If the model is performing great on training data but not on test data, then there’s a huge variance that needs to be dealt with. Similarly, if the model is not learning enough on training data, it cannot be used for generalization.
I recommend playing around with your dataset and seeing the changes in the test data score that happen when you tune the parameters. You can also experiment with different ensembles like XGBoost.
Feel free to contact me for help with any machine learning projects.