Building machine learning models is an important element of predictive modeling. However, without proper model validation, the confidence that the trained model will generalize well on unseen data can never be high. Model validation helps ensure that the model performs well on new data and helps select the best model, the parameters, and the accuracy metrics.
In this guide, we will learn the basics and implementation of several model validation techniques:
Holdout Validation
K-fold Cross-Validation
Repeated K-fold Cross-Validation
In this guide, we will use a fictitious dataset of loan applicants containing 600 observations and 9 variables, as described below:
Marital_status
: Whether the applicant is married ("Yes") or not ("No")
Is_graduate
: Whether the applicant is a graduate ("Yes") or not ("No")
Income
: Annual income of the applicant (in USD)
Loan_amount
: Loan amount (in USD) for which the application was submitted
Credit_score
: Whether the applicant's credit score is good ("Good") or not ("Bad")
Approval_status
: Whether the loan application was approved ("Yes") or not ("No")
Age
: The applicant's age in years
Sex
: Whether the applicant is male ("M") or female ("F")
Investment
: Total investment in stocks and mutual funds (in USD) as declared by the applicantLet's start by loading the required libraries and the data.
1library(plyr)
2library(readr)
3library(dplyr)
4library(caret)
5library(klaR)
6
7dat <- read_csv("dataset.csv")
8dat$Purpose = NULL
9
10glimpse(dat)
Output:
1Observations: 600
2Variables: 9
3$ Marital_status <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", ...
4$ Is_graduate <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
5$ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
6$ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
7$ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
8$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "...
9$ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
10$ Sex <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M",...
11$ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
The output shows that the dataset has four numerical (labeled as int
) and five character variables (labeled as chr
). We will convert these into factor variables using the line of code below.
1names <- c(1,2,5,6,8)
2dat[,names] <- lapply(dat[,names] , factor)
3glimpse(dat)
Output:
1Observations: 600
2Variables: 9
3$ Marital_status <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No...
4$ Is_graduate <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Y...
5$ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
6$ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
7$ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
8$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, ...
9$ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
10$ Sex <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, ...
11$ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
The holdout validation approach involves creating a training set and a holdout set. The training data is used to train the model, while the holdout data is used to validate model performance. The common split ratio is 70:30, while for small datasets, the ratio can be 90:10.
The first line of code below sets the random seed for reproducibility of results. The second line loads the caTools
package that will be used for data partitioning, while the third to fifth lines create the training and test sets. The training set contains 70 percent of the data (420 observations of 10 variables) and the test set contains the remaining 30 percent (180 observations of 10 variables).
1library(caTools)
2set.seed(100)
3
4spl = sample.split(dat$approval_status, SplitRatio = 0.7)
5train = subset(dat, spl==TRUE)
6test = subset(dat, spl==FALSE)
7
8print(dim(train)); print(dim(test))
Output:
1[1] 420 9
2
3[1] 180 9
To fit the logistic regression model, the first step is to instantiate the algorithm. This is done in the first line of code below, while the second line generates predictions on the test data. The third line generates the confusion matrix, while the fourth line computes and prints the accuracy.
1model_glm = glm(approval_status ~ . , family="binomial", data = train)
2
3#Predictions on the test set
4predictTest = predict(model_glm, newdata = test, type = "response")
5
6# Confusion matrix on test set
7table(test$approval_status, predictTest >= 0.5)
8158/nrow(test) #Accuracy - 88%
Output:
1
2 FALSE TRUE
3 No 35 22
4 Yes 10 113
5
6
7[1] 0.8777778
We can see that the accuracy of the model on the test data is approximately 87.8 percent. The above technique is useful, but it has pitfalls. The split is very important and, if it goes wrong, it can lead to the model over-fitting or under-fitting the new data. This problem can be rectified using resampling methods, which repeat a calculation multiple times using randomly selected subsets of the complete data. We discuss the popular cross-validation techniques in the following sections of the guide.
In k-fold cross-validation, the data is divided into k folds. The model is trained on k-1 folds with one fold held back for testing. This process gets repeated to ensure each fold of the dataset gets the chance to be the held-back set. Once the process is completed, we can summarize the evaluation metric using the mean and/or the standard deviation.
We will use five-fold cross-validation for our problem statement as specified in the first line of code below. The second line trains the algorithm, while the third line prints the model result.
1control <- trainControl(method="cv", number=5)
2
3kfold_model <- train(approval_status ~., data=dat, trControl=control, method="nb")
4
5print(kfold_model)
Output:
1Naive Bayes
2
3600 samples
4 8 predictor
5 2 classes: 'No', 'Yes'
6
7No pre-processing
8Resampling: Cross-Validated (5 fold)
9Summary of sample sizes: 480, 480, 480, 480, 480
10Resampling results across tuning parameters:
11
12 usekernel Accuracy Kappa
13 FALSE 0.7616667 0.39489399
14 TRUE 0.6816667 0.05721624
15
16Tuning parameter 'fL' was held constant at a value of 0
17Tuning parameter
18 'adjust' was held constant at a value of 1
19Accuracy was used to select the optimal model using the largest value.
20The final values used for the model were fL = 0, usekernel = FALSE and adjust = 1.
The mean accuracy for the model using k-fold cross-validation is 76.17 percent, which is less than the 88 percent accuracy achieved using the holdout validation approach.
The process of splitting the data into k folds can be repeated a number of times. This is called repeated k-fold cross-validation*, in which the final model accuracy is taken as the mean of the number of repeats.
The following lines of code use 5-fold cross validation with 3 repeats to estimate Naive Bayes on the dataset.
1control2 <- trainControl(method="repeatedcv", number=5, repeats=3)
2
3repeated_kfold_model <- train(approval_status ~., data=dat, trControl=control2, method="nb")
4
5print(repeated_kfold_model)
Output:
1Naive Bayes
2
3600 samples
4 8 predictor
5 2 classes: 'No', 'Yes'
6
7No pre-processing
8Resampling: Cross-Validated (5 fold, repeated 3 times)
9Summary of sample sizes: 480, 480, 480, 480, 480, 480, ...
10Resampling results across tuning parameters:
11
12 usekernel Accuracy Kappa
13 FALSE 0.7594444 0.3937285
14 TRUE 0.6844444 0.0492689
15
16Tuning parameter 'fL' was held constant at a value of 0
17Tuning parameter
18 'adjust' was held constant at a value of 1
19Accuracy was used to select the optimal model using the largest value.
20The final values used for the model were fL = 0, usekernel = FALSE and adjust = 1.
The mean accuracy for the model using repeated k-fold cross-validation is 75.94 percent.
Leave-one-out cross-validation, or LOOCV, is the cross-validation technique in which the size of the fold is “1” with “k” being set to the number of observations in the data. This variation is useful when the training data is of limited size and the number of parameters to be tested is not high. The lines of code below repeat the steps above.
1control3 <- trainControl(method="LOOCV")
2
3loocv_model <- train(approval_status ~., data=dat, trControl=control3, method="nb")
4
5print(loocv_model)
Output:
1
2Naive Bayes
3
4600 samples
5 8 predictor
6 2 classes: 'No', 'Yes'
7
8No pre-processing
9Resampling: Leave-One-Out Cross-Validation
10Summary of sample sizes: 599, 599, 599, 599, 599, 599, ...
11Resampling results across tuning parameters:
12
13 usekernel Accuracy Kappa
14 FALSE 0.7700000 0.41755768
15 TRUE 0.6833333 0.01893287
16
17Tuning parameter 'fL' was held constant at a value of 0
18Tuning parameter
19 'adjust' was held constant at a value of 1
20Accuracy was used to select the optimal model using the largest value.
21The final values used for the model were fL = 0, usekernel = FALSE and adjust = 1.
The mean accuracy for the model using leave-one-out cross-validation is 77 percent.
In this guide, you have learned about the various model validation techniques in R. The mean accuracy result for the techniques is summarized below:
Holdout Validation Approach: Accuracy of 88%
K-fold Cross-Validation: Mean Accuracy of 76%
Repeated K-fold Cross-Validation: Mean Accuracy of 76%
To learn more about data science using R, please refer to the following guides: