Ensemble methods are advanced techniques often used to solve complex machine learning problems. In simple terms, an ensemble method is a process where different and independent models (also referred to as the "weak learners") are combined to produce an outcome. The hypothesis is that combining multiple models can produce better results by decreasing generalization error.
Three of the most popular methods for ensemble modeling are bagging, boosting, and stacking. In this guide, you will learn how to implement these techniques with R.
In this guide, we will use a fictitious dataset of loan applicants containing 600 observations and 10 variables, as described below:
Marital_status
: Whether the applicant is married ("Yes") or not ("No")
Is_graduate
: Whether the applicant is a graduate ("Yes") or not ("No")
Income
: Annual Income of the applicant (in USD)
Loan_amount
: Loan amount (in USD) for which the application was submitted
Credit_score
: Whether the applicant's credit score is good ("Good") or not ("Bad")
Approval_status
: Whether the loan application was approved ("Yes") or not ("No")
Age
: The applicant's age in years
Sex
: Whether the applicant is a male ("M") or a female ("F")
Investment
: Total investment in stocks and mutual funds (in USD) as declared by the applicant
Purpose
: Purpose of applying for the loanLet's start by loading the required libraries and the data.
1library(plyr)
2library(readr)
3library(dplyr)
4library(caret)
5library(caretEnsemble)
6library(ROSE)
7
8dat <- read_csv("data_set.csv")
9
10glimpse(dat)
Output:
1Observations: 600
2Variables: 10
3$ Marital_status <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", ...
4$ Is_graduate <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
5$ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
6$ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
7$ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
8$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "...
9$ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
10$ Sex <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M",...
11$ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
12$ Purpose <chr> "Education", "Travel", "Others", "Others", "Travel", "...
The output shows that the dataset has five numerical variables (labeled as int
) and five character variables (labeled as chr
). We will convert these into factor variables using the line of code below.
1names <- c(1,2,5,6,8,10)
2dat[,names] <- lapply(dat[,names] , factor)
3glimpse(dat)
Output:
1Observations: 600
2Variables: 10
3$ Marital_status <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No...
4$ Is_graduate <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Y...
5$ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
6$ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
7$ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
8$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, ...
9$ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
10$ Sex <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, ...
11$ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
12$ Purpose <fct> Education, Travel, Others, Others, Travel, Travel, Tra...
13
The goal of ensemble modeling is to improve performance over a baseline model by combining multiple models. So, we will set the baseline performance measure by starting with one algorithm. In our case, we will build a logistic regression algorithm.
We will build our model on the training set and evaluate its performance on the test set. This is called the holdout-validation method for evaluating model performance.
The first line of code below sets the random seed for reproducibility of results. The second line loads the caTools
package that will be used for data partitioning, while the third to fifth lines create the training and test datasets. The train set contains 70 percent of the data (420 observations of 10 variables) and the test set contains the remaining 30 percent (180 observations of 10 variables).
1set.seed(100)
2library(caTools)
3
4spl = sample.split(dat$approval_status, SplitRatio = 0.7)
5train = subset(dat, spl==TRUE)
6test = subset(dat, spl==FALSE)
7
8print(dim(train)); print(dim(test))
Output:
1[1] 420 10
2
3[1] 180 10
To fit the logistic regression model, the first step is to instantiate the algorithm, which is done using the first line of code below. The second line uses the trained algorithm to make predictions on the test data. The third line generates the confusion matrix, while the fourth line prints the accuracy of the test data.
1model_glm = glm(approval_status ~ . , family="binomial", data = train)
2
3#Predictions on the test set
4predictTest = predict(model_glm, newdata = test, type = "response")
5
6# Confusion matrix on test set
7table(test$approval_status, predictTest >= 0.5)
8158/nrow(test) #Accuracy - 88%
Output:
1[1] 0.8777778
We see that the accuracy of the single model is 88 percent. We will now build various ensemble models and see if they improves performance.
Bagging, or bootstrap aggregation, is an ensemble method that involves training the same algorithm many times by using different subsets sampled from the training data. The final output prediction is then averaged across the predictions of all the sub-models. The two most popular bagging ensemble techniques are Bagged Decision Trees
and Random Forest
.
This method performs best with algorithms that have high variance, for example, decision trees. We start by setting the seed in the first line of code. The second line specifies the parameters used to control the model training process, while the third line trains the bagged tree algorithm. The argument method="treebag"
specifies the algorithm. The fourth to sixth lines of code generate predictions on the test data, create the confusion matrix, and compute the accuracy, which comes out to be 79 percent. This is not better than the logistic regression model.
1set.seed(100)
2
3control1 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)
4
5bagCART_model <- train(approval_status ~., data=train, method="treebag", metric="Accuracy", trControl=control1)
6
7#Predictions on the test set
8predictTest = predict(bagCART_model, newdata = test)
9
10# Confusion matrix on test set
11table(test$approval_status, predictTest)
12142/nrow(test) #Accuracy - 78.9%
Output:
1 predictTest
2 No Yes
3 No 33 24
4 Yes 14 109
5
6[1] 0.7888889
Random Forest is an extension of bagged decision trees, where the samples of the training dataset are taken with replacement. The trees are constructed with the objective of reducing the correlation between the individual decision trees.
We follow the same steps as above, with the exception that while training the algorithm, we set method="rf"
to specify that a random forest model is to be built. The accuracy of the random forest ensemble is 92 percent, a significant improvement over the models built earlier.
1set.seed(100)
2
3control1 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)
4
5rf_model <- train(approval_status ~., data=train, method="rf", metric="Accuracy", trControl=control1)
6
7predictTest = predict(rf_model, newdata = test, type = "raw")
8
9# Confusion matrix on test set
10table(test$approval_status, predictTest)
11165/nrow(test) #Accuracy - 91.67%
Output:
1[1] 0.9166667
In boosting, multiple models are trained sequentially and each model learns from the errors of its predecessors. In this guide, we will implement a gradient boosting algorithm.
The boosting algorithm focuses on classification problems and aims to convert a set of weak classifiers into a strong one.
We follow the same steps as above, with exception that while training the algorithm, we set method="gbm"
to specify that a gradient boosting model is to be built. The accuracy of the model is 78.9 percent, which is lower than the baseline accuracy.
1set.seed(100)
2control2 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)
3
4gbm_model <- train(approval_status ~., data=train, method="gbm", metric="Accuracy", trControl=control2)
5
6predictTest = predict(gbm_model, newdata = test)
7
8table(test$approval_status, predictTest)
9142/nrow(test) #Accuracy - 78.9%
Output:
1[1] 0.7888889
In this approach, the predictions of multiple caret models are combined using the caretEnsemble
package. Then, given the list of models, the caretStack()
function is used to specify a higher-order model to learn how to best combine the predictions of sub-models together. The first line of code sets the seed, while the second line specifies the control parameters to be used while modeling. The third line uses the list of algorithms to use, while the fourth line uses the caretList()
function to train the model. The sixth line prints the summary of the result. The output shows that the SVM
algorithm creates the most accurate model, with a mean accuracy of 92.7 percent.
1set.seed(100)
2
3control_stacking <- trainControl(method="repeatedcv", number=5, repeats=2, savePredictions=TRUE, classProbs=TRUE)
4
5algorithms_to_use <- c('rpart', 'glm', 'knn', 'svmRadial')
6
7stacked_models <- caretList(approval_status ~., data=dat, trControl=control_stacking, methodList=algorithms_to_use)
8
9stacking_results <- resamples(stacked_models)
10
11summary(stacking_results)
Output:
1Call:
2summary.resamples(object = stacking_results)
3
4Models: rpart, glm, knn, svmRadial
5Number of resamples: 10
6
7Accuracy
8 Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
9rpart 0.8333333 0.8562500 0.9000000 0.8975000 0.9250000 0.9833333 0
10glm 0.8333333 0.8729167 0.8958333 0.8925000 0.9208333 0.9416667 0
11knn 0.5666667 0.6020833 0.6583333 0.6375000 0.6645833 0.6750000 0
12svmRadial 0.8833333 0.9041667 0.9208333 0.9266667 0.9395833 0.9916667 0
13
14Kappa
15 Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
16rpart 0.6148909 0.66732995 0.780053451 0.76715842 0.83288101 0.9620253 0
17glm 0.6148909 0.71001041 0.767803348 0.75310834 0.81481075 0.8602794 0
18knn -0.2055641 -0.09025536 0.006059649 -0.02267757 0.05209491 0.1034483 0
19svmRadial 0.7448360 0.78978371 0.826301181 0.83812274 0.86541005 0.9808795 0
Let’s combine the predictions of the classifiers using a simple logistic regression model. The output shows that the accuracy is 91.7 percent.
1# stack using glm
2stackControl <- trainControl(method="repeatedcv", number=5, repeats=3, savePredictions=TRUE, classProbs=TRUE)
3
4set.seed(100)
5
6glm_stack <- caretStack(stacked_models, method="glm", metric="Accuracy", trControl=stackControl)
7
8print(glm_stack)
Output:
1A glm ensemble of 2 base models: lda, rpart, glm, knn, svmRadial
2
3Ensemble results:
4Generalized Linear Model
5
61200 samples
7 5 predictor
8 2 classes: 'No', 'Yes'
9
10No pre-processing
11Resampling: Cross-Validated (5 fold, repeated 3 times)
12Summary of sample sizes: 960, 960, 960, 960, 960, 960, ...
13Resampling results:
14
15 Accuracy Kappa
16 0.9166667 0.8128156
In this guide, you have learned about ensemble modeling with R. The performance of the models implemented in the guide is summarized below:
Logistic Regression: Accuracy of 87.8 percent
Bagged Decision Trees: Accuracy of 78.9 percent
Random Forest: Accuracy of 91.7 percent
Stochastic Gradient Boosting`: Accuracy of 78.9 percent
To learn more about data science using R, please refer to the following guides: