Author avatar

Deepika Singh

Ensemble Modeling with R

Deepika Singh

  • Nov 21, 2019
  • 14 Min read
  • 18 Views
  • Nov 21, 2019
  • 14 Min read
  • 18 Views
Data
R

Introduction

Ensemble methods are advanced techniques often used to solve complex machine learning problems. In simple terms, an ensemble method is a process where different and independent models (also referred to as the "weak learners") are combined to produce an outcome. The hypothesis is that combining multiple models can produce better results by decreasing generalization error.

Three of the most popular methods for ensemble modeling are bagging, boosting, and stacking. In this guide, you will learn how to implement these techniques with R.

Data

In this guide, we will use a fictitious dataset of loan applicants containing 600 observations and 10 variables, as described below:

  1. Marital_status: Whether the applicant is married ("Yes") or not ("No")

  2. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")

  3. Income: Annual Income of the applicant (in USD)

  4. Loan_amount: Loan amount (in USD) for which the application was submitted

  5. Credit_score: Whether the applicant's credit score is good ("Good") or not ("Bad")

  6. Approval_status: Whether the loan application was approved ("Yes") or not ("No")

  7. Age: The applicant's age in years

  8. Sex: Whether the applicant is a male ("M") or a female ("F")

  9. Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant

  10. Purpose: Purpose of applying for the loan

Let's start by loading the required libraries and the data.

1
2
3
4
5
6
7
8
9
10
library(plyr)
library(readr)
library(dplyr)
library(caret)
library(caretEnsemble)
library(ROSE)

dat <- read_csv("data_set.csv")

glimpse(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
Observations: 600
Variables: 10
$ Marital_status  <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", ...
$ Is_graduate     <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Income          <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
$ Loan_amount     <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "...
$ Age             <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
$ Sex             <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M",...
$ Investment      <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
$ Purpose         <chr> "Education", "Travel", "Others", "Others", "Travel", "...

The output shows that the dataset has five numerical variables (labeled as int) and five character variables (labeled as chr). We will convert these into factor variables using the line of code below.

1
2
3
names <- c(1,2,5,6,8,10)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
Observations: 600
Variables: 10
$ Marital_status  <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No...
$ Is_graduate     <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Y...
$ Income          <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
$ Loan_amount     <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
$ Credit_score    <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, ...
$ Age             <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
$ Sex             <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, ...
$ Investment      <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
$ Purpose         <fct> Education, Travel, Others, Others, Travel, Travel, Tra...
 

Building and Evaluating a Single Algorithm

The goal of ensemble modeling is to improve performance over a baseline model by combining multiple models. So, we will set the baseline performance measure by starting with one algorithm. In our case, we will build a logistic regression algorithm.

We will build our model on the training set and evaluate its performance on the test set. This is called the holdout-validation method for evaluating model performance.

The first line of code below sets the random seed for reproducibility of results. The second line loads the caTools package that will be used for data partitioning, while the third to fifth lines create the training and test datasets. The train set contains 70 percent of the data (420 observations of 10 variables) and the test set contains the remaining 30 percent (180 observations of 10 variables).

1
2
3
4
5
6
7
8
set.seed(100)
library(caTools)

spl = sample.split(dat$approval_status, SplitRatio = 0.7)
train = subset(dat, spl==TRUE)
test = subset(dat, spl==FALSE)

print(dim(train)); print(dim(test))
{r}

Output:

1
2
3
[1] 420  10

[1] 180  10

Build, Predict and Evaluate the Model

To fit the logistic regression model, the first step is to instantiate the algorithm, which is done using the first line of code below. The second line uses the trained algorithm to make predictions on the test data. The third line generates the confusion matrix, while the fourth line prints the accuracy of the test data.

1
2
3
4
5
6
7
8
model_glm = glm(approval_status ~ . , family="binomial", data = train)

#Predictions on the test set
predictTest = predict(model_glm, newdata = test, type = "response")

# Confusion matrix on test set
table(test$approval_status, predictTest >= 0.5)
158/nrow(test) #Accuracy - 88%
{r}

Output:

1
[1] 0.8777778

We see that the accuracy of the single model is 88 percent. We will now build various ensemble models and see if they improves performance.

Bagging

Bagging, or bootstrap aggregation, is an ensemble method that involves training the same algorithm many times by using different subsets sampled from the training data. The final output prediction is then averaged across the predictions of all the sub-models. The two most popular bagging ensemble techniques are Bagged Decision Trees and Random Forest.

Bagged Decision Tree

This method performs best with algorithms that have high variance, for example, decision trees. We start by setting the seed in the first line of code. The second line specifies the parameters used to control the model training process, while the third line trains the bagged tree algorithm. The argument method="treebag" specifies the algorithm. The fourth to sixth lines of code generate predictions on the test data, create the confusion matrix, and compute the accuracy, which comes out to be 79 percent. This is not better than the logistic regression model.

1
2
3
4
5
6
7
8
9
10
11
12
set.seed(100)

control1 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)

bagCART_model <- train(approval_status ~., data=train, method="treebag", metric="Accuracy", trControl=control1)

#Predictions on the test set
predictTest = predict(bagCART_model, newdata = test)

# Confusion matrix on test set
table(test$approval_status, predictTest)
142/nrow(test) #Accuracy - 78.9%
{r}

Output:

1
2
3
4
5
6
     predictTest
       No Yes
  No   33  24
  Yes  14 109

[1] 0.7888889

Random Forest

Random Forest is an extension of bagged decision trees, where the samples of the training dataset are taken with replacement. The trees are constructed with the objective of reducing the correlation between the individual decision trees.

We follow the same steps as above, with the exception that while training the algorithm, we set method="rf" to specify that a random forest model is to be built. The accuracy of the random forest ensemble is 92 percent, a significant improvement over the models built earlier.

1
2
3
4
5
6
7
8
9
10
11
set.seed(100)

control1 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)

rf_model <- train(approval_status ~., data=train, method="rf", metric="Accuracy", trControl=control1)

predictTest = predict(rf_model, newdata = test, type = "raw")

# Confusion matrix on test set
table(test$approval_status, predictTest)
165/nrow(test) #Accuracy - 91.67%
{r}

Output:

1
[1] 0.9166667

Boosting

In boosting, multiple models are trained sequentially and each model learns from the errors of its predecessors. In this guide, we will implement a gradient boosting algorithm.

Stochastic Gradient Boosting

The boosting algorithm focuses on classification problems and aims to convert a set of weak classifiers into a strong one.

We follow the same steps as above, with exception that while training the algorithm, we set method="gbm" to specify that a gradient boosting model is to be built. The accuracy of the model is 78.9 percent, which is lower than the baseline accuracy.

1
2
3
4
5
6
7
8
9
set.seed(100)
control2 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)

gbm_model <- train(approval_status ~., data=train, method="gbm", metric="Accuracy", trControl=control2)

predictTest = predict(gbm_model, newdata = test)

table(test$approval_status, predictTest)
142/nrow(test) #Accuracy - 78.9%
{r}

Output:

1
[1] 0.7888889

Stacking

In this approach, the predictions of multiple caret models are combined using the caretEnsemble package. Then, given the list of models, the caretStack() function is used to specify a higher-order model to learn how to best combine the predictions of sub-models together. The first line of code sets the seed, while the second line specifies the control parameters to be used while modeling. The third line uses the list of algorithms to use, while the fourth line uses the caretList() function to train the model. The sixth line prints the summary of the result. The output shows that the SVM algorithm creates the most accurate model, with a mean accuracy of 92.7 percent.

1
2
3
4
5
6
7
8
9
10
11
set.seed(100)

control_stacking <- trainControl(method="repeatedcv", number=5, repeats=2, savePredictions=TRUE, classProbs=TRUE)

algorithms_to_use <- c('rpart', 'glm', 'knn', 'svmRadial')

stacked_models <- caretList(approval_status ~., data=dat, trControl=control_stacking, methodList=algorithms_to_use)

stacking_results <- resamples(stacked_models)

summary(stacking_results)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Call:
summary.resamples(object = stacking_results)

Models: rpart, glm, knn, svmRadial 
Number of resamples: 10 

Accuracy 
               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
rpart     0.8333333 0.8562500 0.9000000 0.8975000 0.9250000 0.9833333    0
glm       0.8333333 0.8729167 0.8958333 0.8925000 0.9208333 0.9416667    0
knn       0.5666667 0.6020833 0.6583333 0.6375000 0.6645833 0.6750000    0
svmRadial 0.8833333 0.9041667 0.9208333 0.9266667 0.9395833 0.9916667    0

Kappa 
                Min.     1st Qu.      Median        Mean    3rd Qu.      Max. NA's
rpart      0.6148909  0.66732995 0.780053451  0.76715842 0.83288101 0.9620253    0
glm        0.6148909  0.71001041 0.767803348  0.75310834 0.81481075 0.8602794    0
knn       -0.2055641 -0.09025536 0.006059649 -0.02267757 0.05209491 0.1034483    0
svmRadial  0.7448360  0.78978371 0.826301181  0.83812274 0.86541005 0.9808795    0

Let’s combine the predictions of the classifiers using a simple logistic regression model. The output shows that the accuracy is 91.7 percent.

1
2
3
4
5
6
7
8
# stack using glm
stackControl <- trainControl(method="repeatedcv", number=5, repeats=3, savePredictions=TRUE, classProbs=TRUE)

set.seed(100)

glm_stack <- caretStack(stacked_models, method="glm", metric="Accuracy", trControl=stackControl)

print(glm_stack)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
A glm ensemble of 2 base models: lda, rpart, glm, knn, svmRadial

Ensemble results:
Generalized Linear Model 

1200 samples
   5 predictor
   2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 960, 960, 960, 960, 960, 960, ... 
Resampling results:

  Accuracy   Kappa    
  0.9166667  0.8128156

Conclusion

In this guide, you have learned about ensemble modeling with R. The performance of the models implemented in the guide is summarized below:

  1. Logistic Regression: Accuracy of 87.8 percent

  2. Bagged Decision Trees: Accuracy of 78.9 percent

  3. Random Forest: Accuracy of 91.7 percent

  4. Stochastic Gradient Boosting`: Accuracy of 78.9 percent

  5. Stacking: Accuracy of 91.7 percent

To learn more about data science using R, please refer to the following guides:

0