Deepika Singh

# Ensemble Modeling with R

• Nov 21, 2019
• 5,238 Views
• Nov 21, 2019
• 5,238 Views
Data
R

## Introduction

Ensemble methods are advanced techniques often used to solve complex machine learning problems. In simple terms, an ensemble method is a process where different and independent models (also referred to as the "weak learners") are combined to produce an outcome. The hypothesis is that combining multiple models can produce better results by decreasing generalization error.

Three of the most popular methods for ensemble modeling are bagging, boosting, and stacking. In this guide, you will learn how to implement these techniques with R.

## Data

In this guide, we will use a fictitious dataset of loan applicants containing 600 observations and 10 variables, as described below:

1. `Marital_status`: Whether the applicant is married ("Yes") or not ("No")

2. `Is_graduate`: Whether the applicant is a graduate ("Yes") or not ("No")

3. `Income`: Annual Income of the applicant (in USD)

4. `Loan_amount`: Loan amount (in USD) for which the application was submitted

5. `Credit_score`: Whether the applicant's credit score is good ("Good") or not ("Bad")

6. `Approval_status`: Whether the loan application was approved ("Yes") or not ("No")

7. `Age`: The applicant's age in years

8. `Sex`: Whether the applicant is a male ("M") or a female ("F")

9. `Investment`: Total investment in stocks and mutual funds (in USD) as declared by the applicant

10. `Purpose`: Purpose of applying for the loan

``````1
2
3
4
5
6
7
8
9
10
``````library(plyr)
library(dplyr)
library(caret)
library(caretEnsemble)
library(ROSE)

glimpse(dat)``````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
``````Observations: 600
Variables: 10
\$ Marital_status  <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", ...
\$ Is_graduate     <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
\$ Income          <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
\$ Loan_amount     <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
\$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
\$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "...
\$ Age             <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
\$ Sex             <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M",...
\$ Investment      <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
\$ Purpose         <chr> "Education", "Travel", "Others", "Others", "Travel", "...``````

The output shows that the dataset has five numerical variables (labeled as `int`) and five character variables (labeled as `chr`). We will convert these into factor variables using the line of code below.

``````1
2
3
``````names <- c(1,2,5,6,8,10)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)``````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
13
``````Observations: 600
Variables: 10
\$ Marital_status  <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No...
\$ Is_graduate     <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Y...
\$ Income          <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
\$ Loan_amount     <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
\$ Credit_score    <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
\$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, ...
\$ Age             <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
\$ Sex             <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, ...
\$ Investment      <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
\$ Purpose         <fct> Education, Travel, Others, Others, Travel, Travel, Tra...
``````

## Building and Evaluating a Single Algorithm

The goal of ensemble modeling is to improve performance over a baseline model by combining multiple models. So, we will set the baseline performance measure by starting with one algorithm. In our case, we will build a logistic regression algorithm.

We will build our model on the training set and evaluate its performance on the test set. This is called the holdout-validation method for evaluating model performance.

The first line of code below sets the random seed for reproducibility of results. The second line loads the `caTools` package that will be used for data partitioning, while the third to fifth lines create the training and test datasets. The train set contains 70 percent of the data (420 observations of 10 variables) and the test set contains the remaining 30 percent (180 observations of 10 variables).

``````1
2
3
4
5
6
7
8
``````set.seed(100)
library(caTools)

spl = sample.split(dat\$approval_status, SplitRatio = 0.7)
train = subset(dat, spl==TRUE)
test = subset(dat, spl==FALSE)

print(dim(train)); print(dim(test))``````
{r}

Output:

``````1
2
3
``````[1] 420  10

[1] 180  10``````

## Build, Predict and Evaluate the Model

To fit the logistic regression model, the first step is to instantiate the algorithm, which is done using the first line of code below. The second line uses the trained algorithm to make predictions on the test data. The third line generates the confusion matrix, while the fourth line prints the accuracy of the test data.

``````1
2
3
4
5
6
7
8
``````model_glm = glm(approval_status ~ . , family="binomial", data = train)

#Predictions on the test set
predictTest = predict(model_glm, newdata = test, type = "response")

# Confusion matrix on test set
table(test\$approval_status, predictTest >= 0.5)
158/nrow(test) #Accuracy - 88%``````
{r}

Output:

``````1
````[1] 0.8777778````

We see that the accuracy of the single model is 88 percent. We will now build various ensemble models and see if they improves performance.

## Bagging

Bagging, or bootstrap aggregation, is an ensemble method that involves training the same algorithm many times by using different subsets sampled from the training data. The final output prediction is then averaged across the predictions of all the sub-models. The two most popular bagging ensemble techniques are `Bagged Decision Trees` and `Random Forest`.

### Bagged Decision Tree

This method performs best with algorithms that have high variance, for example, decision trees. We start by setting the seed in the first line of code. The second line specifies the parameters used to control the model training process, while the third line trains the bagged tree algorithm. The argument `method="treebag"` specifies the algorithm. The fourth to sixth lines of code generate predictions on the test data, create the confusion matrix, and compute the accuracy, which comes out to be 79 percent. This is not better than the logistic regression model.

``````1
2
3
4
5
6
7
8
9
10
11
12
``````set.seed(100)

control1 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)

bagCART_model <- train(approval_status ~., data=train, method="treebag", metric="Accuracy", trControl=control1)

#Predictions on the test set
predictTest = predict(bagCART_model, newdata = test)

# Confusion matrix on test set
table(test\$approval_status, predictTest)
142/nrow(test) #Accuracy - 78.9%``````
{r}

Output:

``````1
2
3
4
5
6
``````     predictTest
No Yes
No   33  24
Yes  14 109

[1] 0.7888889``````

### Random Forest

Random Forest is an extension of bagged decision trees, where the samples of the training dataset are taken with replacement. The trees are constructed with the objective of reducing the correlation between the individual decision trees.

We follow the same steps as above, with the exception that while training the algorithm, we set `method="rf"` to specify that a random forest model is to be built. The accuracy of the random forest ensemble is 92 percent, a significant improvement over the models built earlier.

``````1
2
3
4
5
6
7
8
9
10
11
``````set.seed(100)

control1 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)

rf_model <- train(approval_status ~., data=train, method="rf", metric="Accuracy", trControl=control1)

predictTest = predict(rf_model, newdata = test, type = "raw")

# Confusion matrix on test set
table(test\$approval_status, predictTest)
165/nrow(test) #Accuracy - 91.67%``````
{r}

Output:

``````1
````[1] 0.9166667````

## Boosting

In boosting, multiple models are trained sequentially and each model learns from the errors of its predecessors. In this guide, we will implement a gradient boosting algorithm.

The boosting algorithm focuses on classification problems and aims to convert a set of weak classifiers into a strong one.

We follow the same steps as above, with exception that while training the algorithm, we set `method="gbm"` to specify that a gradient boosting model is to be built. The accuracy of the model is 78.9 percent, which is lower than the baseline accuracy.

``````1
2
3
4
5
6
7
8
9
``````set.seed(100)
control2 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)

gbm_model <- train(approval_status ~., data=train, method="gbm", metric="Accuracy", trControl=control2)

predictTest = predict(gbm_model, newdata = test)

table(test\$approval_status, predictTest)
142/nrow(test) #Accuracy - 78.9%``````
{r}

Output:

``````1
````[1] 0.7888889````

## Stacking

In this approach, the predictions of multiple caret models are combined using the `caretEnsemble` package. Then, given the list of models, the `caretStack()` function is used to specify a higher-order model to learn how to best combine the predictions of sub-models together. The first line of code sets the seed, while the second line specifies the control parameters to be used while modeling. The third line uses the list of algorithms to use, while the fourth line uses the `caretList()` function to train the model. The sixth line prints the summary of the result. The output shows that the `SVM` algorithm creates the most accurate model, with a mean accuracy of 92.7 percent.

``````1
2
3
4
5
6
7
8
9
10
11
``````set.seed(100)

control_stacking <- trainControl(method="repeatedcv", number=5, repeats=2, savePredictions=TRUE, classProbs=TRUE)

algorithms_to_use <- c('rpart', 'glm', 'knn', 'svmRadial')

stacked_models <- caretList(approval_status ~., data=dat, trControl=control_stacking, methodList=algorithms_to_use)

stacking_results <- resamples(stacked_models)

summary(stacking_results)``````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
``````Call:
summary.resamples(object = stacking_results)

Number of resamples: 10

Accuracy
Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
rpart     0.8333333 0.8562500 0.9000000 0.8975000 0.9250000 0.9833333    0
glm       0.8333333 0.8729167 0.8958333 0.8925000 0.9208333 0.9416667    0
knn       0.5666667 0.6020833 0.6583333 0.6375000 0.6645833 0.6750000    0
svmRadial 0.8833333 0.9041667 0.9208333 0.9266667 0.9395833 0.9916667    0

Kappa
Min.     1st Qu.      Median        Mean    3rd Qu.      Max. NA's
rpart      0.6148909  0.66732995 0.780053451  0.76715842 0.83288101 0.9620253    0
glm        0.6148909  0.71001041 0.767803348  0.75310834 0.81481075 0.8602794    0
knn       -0.2055641 -0.09025536 0.006059649 -0.02267757 0.05209491 0.1034483    0
svmRadial  0.7448360  0.78978371 0.826301181  0.83812274 0.86541005 0.9808795    0``````

Let’s combine the predictions of the classifiers using a simple logistic regression model. The output shows that the accuracy is 91.7 percent.

``````1
2
3
4
5
6
7
8
``````# stack using glm
stackControl <- trainControl(method="repeatedcv", number=5, repeats=3, savePredictions=TRUE, classProbs=TRUE)

set.seed(100)

glm_stack <- caretStack(stacked_models, method="glm", metric="Accuracy", trControl=stackControl)

print(glm_stack)``````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
``````A glm ensemble of 2 base models: lda, rpart, glm, knn, svmRadial

Ensemble results:
Generalized Linear Model

1200 samples
5 predictor
2 classes: 'No', 'Yes'

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times)
Summary of sample sizes: 960, 960, 960, 960, 960, 960, ...
Resampling results:

Accuracy   Kappa
0.9166667  0.8128156``````

## Conclusion

In this guide, you have learned about ensemble modeling with R. The performance of the models implemented in the guide is summarized below:

1. Logistic Regression: Accuracy of 87.8 percent

2. Bagged Decision Trees: Accuracy of 78.9 percent

3. Random Forest: Accuracy of 91.7 percent

4. Stochastic Gradient Boosting`: Accuracy of 78.9 percent

5. Stacking: Accuracy of 91.7 percent