Author avatar

Deepika Singh

Validating Machine Learning Models with R

Deepika Singh

  • Dec 12, 2019
  • 12 Min read
  • 1,044 Views
  • Dec 12, 2019
  • 12 Min read
  • 1,044 Views
Data
R

Introduction

Building machine learning models is an important element of predictive modeling. However, without proper model validation, the confidence that the trained model will generalize well on unseen data can never be high. Model validation helps ensure that the model performs well on new data and helps select the best model, the parameters, and the accuracy metrics.

In this guide, we will learn the basics and implementation of several model validation techniques:

  1. Holdout Validation

  2. K-fold Cross-Validation

  3. Repeated K-fold Cross-Validation

  4. Leave-One-Out Cross-Validation

Data

In this guide, we will use a fictitious dataset of loan applicants containing 600 observations and 9 variables, as described below:

  1. Marital_status: Whether the applicant is married ("Yes") or not ("No")

  2. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")

  3. Income: Annual income of the applicant (in USD)

  4. Loan_amount: Loan amount (in USD) for which the application was submitted

  5. Credit_score: Whether the applicant's credit score is good ("Good") or not ("Bad")

  6. Approval_status: Whether the loan application was approved ("Yes") or not ("No")

  7. Age: The applicant's age in years

  8. Sex: Whether the applicant is male ("M") or female ("F")

  9. Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant

Let's start by loading the required libraries and the data.

1
2
3
4
5
6
7
8
9
10
library(plyr)
library(readr)
library(dplyr)
library(caret)
library(klaR)

dat <- read_csv("dataset.csv")
dat$Purpose = NULL

glimpse(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
Observations: 600
Variables: 9
$ Marital_status  <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", ...
$ Is_graduate     <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Income          <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
$ Loan_amount     <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "...
$ Age             <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
$ Sex             <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M",...
$ Investment      <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...

The output shows that the dataset has four numerical (labeled as int) and five character variables (labeled as chr). We will convert these into factor variables using the line of code below.

1
2
3
names <- c(1,2,5,6,8)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
Observations: 600
Variables: 9
$ Marital_status  <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No...
$ Is_graduate     <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Y...
$ Income          <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
$ Loan_amount     <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
$ Credit_score    <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, ...
$ Age             <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
$ Sex             <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, ...
$ Investment      <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...

Holdout Validation

The holdout validation approach involves creating a training set and a holdout set. The training data is used to train the model, while the holdout data is used to validate model performance. The common split ratio is 70:30, while for small datasets, the ratio can be 90:10.

The first line of code below sets the random seed for reproducibility of results. The second line loads the caTools package that will be used for data partitioning, while the third to fifth lines create the training and test sets. The training set contains 70 percent of the data (420 observations of 10 variables) and the test set contains the remaining 30 percent (180 observations of 10 variables).

1
2
3
4
5
6
7
8
library(caTools)
set.seed(100)

spl = sample.split(dat$approval_status, SplitRatio = 0.7)
train = subset(dat, spl==TRUE)
test = subset(dat, spl==FALSE)

print(dim(train)); print(dim(test))
{r}

Output:

1
2
3
[1] 420   9

[1] 180   9

Build, Predict and Evaluate the Model

To fit the logistic regression model, the first step is to instantiate the algorithm. This is done in the first line of code below, while the second line generates predictions on the test data. The third line generates the confusion matrix, while the fourth line computes and prints the accuracy.

1
2
3
4
5
6
7
8
model_glm = glm(approval_status ~ . , family="binomial", data = train)

#Predictions on the test set
predictTest = predict(model_glm, newdata = test, type = "response")

# Confusion matrix on test set
table(test$approval_status, predictTest >= 0.5)
158/nrow(test) #Accuracy - 88%
{r}

Output:

1
2
3
4
5
6
7
     
      FALSE TRUE
  No     35   22
  Yes    10  113


[1] 0.8777778

We can see that the accuracy of the model on the test data is approximately 87.8 percent. The above technique is useful, but it has pitfalls. The split is very important and, if it goes wrong, it can lead to the model over-fitting or under-fitting the new data. This problem can be rectified using resampling methods, which repeat a calculation multiple times using randomly selected subsets of the complete data. We discuss the popular cross-validation techniques in the following sections of the guide.

K-fold Cross-Validation

In k-fold cross-validation, the data is divided into k folds. The model is trained on k-1 folds with one fold held back for testing. This process gets repeated to ensure each fold of the dataset gets the chance to be the held-back set. Once the process is completed, we can summarize the evaluation metric using the mean and/or the standard deviation.

We will use five-fold cross-validation for our problem statement as specified in the first line of code below. The second line trains the algorithm, while the third line prints the model result.

1
2
3
4
5
control <- trainControl(method="cv", number=5)

kfold_model <- train(approval_status ~., data=dat, trControl=control, method="nb")

print(kfold_model)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Naive Bayes 

600 samples
  8 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 480, 480, 480, 480, 480 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa     
  FALSE      0.7616667  0.39489399
   TRUE      0.6816667  0.05721624

Tuning parameter 'fL' was held constant at a value of 0
Tuning parameter
 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were fL = 0, usekernel = FALSE and adjust = 1.

The mean accuracy for the model using k-fold cross-validation is 76.17 percent, which is less than the 88 percent accuracy achieved using the holdout validation approach.

Repeated K-fold Cross-Validation

The process of splitting the data into k folds can be repeated a number of times. This is called repeated k-fold cross-validation*, in which the final model accuracy is taken as the mean of the number of repeats.

The following lines of code use 5-fold cross validation with 3 repeats to estimate Naive Bayes on the dataset.

1
2
3
4
5
control2 <- trainControl(method="repeatedcv", number=5, repeats=3)

repeated_kfold_model <- train(approval_status ~., data=dat, trControl=control2, method="nb")

print(repeated_kfold_model)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Naive Bayes 

600 samples
  8 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 480, 480, 480, 480, 480, 480, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa    
  FALSE      0.7594444  0.3937285
   TRUE      0.6844444  0.0492689

Tuning parameter 'fL' was held constant at a value of 0
Tuning parameter
 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were fL = 0, usekernel = FALSE and adjust = 1.

The mean accuracy for the model using repeated k-fold cross-validation is 75.94 percent.

Leave-One-Out Cross-Validation (LOOCV)

Leave-one-out cross-validation, or LOOCV, is the cross-validation technique in which the size of the fold is “1” with “k” being set to the number of observations in the data. This variation is useful when the training data is of limited size and the number of parameters to be tested is not high. The lines of code below repeat the steps above.

1
2
3
4
5
control3 <- trainControl(method="LOOCV")

loocv_model <- train(approval_status ~., data=dat, trControl=control3, method="nb")

print(loocv_model)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
     
Naive Bayes 

600 samples
  8 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 599, 599, 599, 599, 599, 599, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa     
  FALSE      0.7700000  0.41755768
   TRUE      0.6833333  0.01893287

Tuning parameter 'fL' was held constant at a value of 0
Tuning parameter
 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were fL = 0, usekernel = FALSE and adjust = 1.

The mean accuracy for the model using leave-one-out cross-validation is 77 percent.

Conclusion

In this guide, you have learned about the various model validation techniques in R. The mean accuracy result for the techniques is summarized below:

  1. Holdout Validation Approach: Accuracy of 88%

  2. K-fold Cross-Validation: Mean Accuracy of 76%

  3. Repeated K-fold Cross-Validation: Mean Accuracy of 76%

  4. Leave-One-Out Cross-Validation: Mean Accuracy of 77%

To learn more about data science using R, please refer to the following guides:

3