4
Building machine learning models is an important element of predictive modeling. However, without proper model validation, the confidence that the trained model will generalize well on unseen data can never be high. Model validation helps ensure that the model performs well on new data and helps select the best model, the parameters, and the accuracy metrics.
In this guide, we will learn the basics and implementation of several model validation techniques:
Holdout Validation
K-fold Cross-Validation
Repeated K-fold Cross-Validation
Leave-One-Out Cross-Validation
In this guide, we will use a fictitious dataset of loan applicants containing 600 observations and 9 variables, as described below:
Marital_status
: Whether the applicant is married ("Yes") or not ("No")
Is_graduate
: Whether the applicant is a graduate ("Yes") or not ("No")
Income
: Annual income of the applicant (in USD)
Loan_amount
: Loan amount (in USD) for which the application was submitted
Credit_score
: Whether the applicant's credit score is good ("Good") or not ("Bad")
Approval_status
: Whether the loan application was approved ("Yes") or not ("No")
Age
: The applicant's age in years
Sex
: Whether the applicant is male ("M") or female ("F")
Investment
: Total investment in stocks and mutual funds (in USD) as declared by the applicant
Let's start by loading the required libraries and the data.
1 2 3 4 5 6 7 8 9 10
library(plyr) library(readr) library(dplyr) library(caret) library(klaR) dat <- read_csv("dataset.csv") dat$Purpose = NULL glimpse(dat)
Output:
1 2 3 4 5 6 7 8 9 10 11
Observations: 600 Variables: 9 $ Marital_status <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", ... $ Is_graduate <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",... $ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136... $ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123... $ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis... $ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "... $ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33... $ Sex <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M",... $ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
The output shows that the dataset has four numerical (labeled as int
) and five character variables (labeled as chr
). We will convert these into factor variables using the line of code below.
1 2 3
names <- c(1,2,5,6,8) dat[,names] <- lapply(dat[,names] , factor) glimpse(dat)
Output:
1 2 3 4 5 6 7 8 9 10 11
Observations: 600 Variables: 9 $ Marital_status <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No... $ Is_graduate <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Y... $ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136... $ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123... $ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory... $ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, ... $ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33... $ Sex <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, ... $ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
The holdout validation approach involves creating a training set and a holdout set. The training data is used to train the model, while the holdout data is used to validate model performance. The common split ratio is 70:30, while for small datasets, the ratio can be 90:10.
The first line of code below sets the random seed for reproducibility of results. The second line loads the caTools
package that will be used for data partitioning, while the third to fifth lines create the training and test sets. The training set contains 70 percent of the data (420 observations of 10 variables) and the test set contains the remaining 30 percent (180 observations of 10 variables).
1 2 3 4 5 6 7 8
library(caTools) set.seed(100) spl = sample.split(dat$approval_status, SplitRatio = 0.7) train = subset(dat, spl==TRUE) test = subset(dat, spl==FALSE) print(dim(train)); print(dim(test))
Output:
1 2 3
[1] 420 9 [1] 180 9
To fit the logistic regression model, the first step is to instantiate the algorithm. This is done in the first line of code below, while the second line generates predictions on the test data. The third line generates the confusion matrix, while the fourth line computes and prints the accuracy.
1 2 3 4 5 6 7 8
model_glm = glm(approval_status ~ . , family="binomial", data = train) #Predictions on the test set predictTest = predict(model_glm, newdata = test, type = "response") # Confusion matrix on test set table(test$approval_status, predictTest >= 0.5) 158/nrow(test) #Accuracy - 88%
Output:
1 2 3 4 5 6 7
FALSE TRUE No 35 22 Yes 10 113 [1] 0.8777778
We can see that the accuracy of the model on the test data is approximately 87.8 percent. The above technique is useful, but it has pitfalls. The split is very important and, if it goes wrong, it can lead to the model over-fitting or under-fitting the new data. This problem can be rectified using resampling methods, which repeat a calculation multiple times using randomly selected subsets of the complete data. We discuss the popular cross-validation techniques in the following sections of the guide.
In k-fold cross-validation, the data is divided into k folds. The model is trained on k-1 folds with one fold held back for testing. This process gets repeated to ensure each fold of the dataset gets the chance to be the held-back set. Once the process is completed, we can summarize the evaluation metric using the mean and/or the standard deviation.
We will use five-fold cross-validation for our problem statement as specified in the first line of code below. The second line trains the algorithm, while the third line prints the model result.
1 2 3 4 5
control <- trainControl(method="cv", number=5) kfold_model <- train(approval_status ~., data=dat, trControl=control, method="nb") print(kfold_model)
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Naive Bayes 600 samples 8 predictor 2 classes: 'No', 'Yes' No pre-processing Resampling: Cross-Validated (5 fold) Summary of sample sizes: 480, 480, 480, 480, 480 Resampling results across tuning parameters: usekernel Accuracy Kappa FALSE 0.7616667 0.39489399 TRUE 0.6816667 0.05721624 Tuning parameter 'fL' was held constant at a value of 0 Tuning parameter 'adjust' was held constant at a value of 1 Accuracy was used to select the optimal model using the largest value. The final values used for the model were fL = 0, usekernel = FALSE and adjust = 1.
The mean accuracy for the model using k-fold cross-validation is 76.17 percent, which is less than the 88 percent accuracy achieved using the holdout validation approach.
The process of splitting the data into k folds can be repeated a number of times. This is called repeated k-fold cross-validation*, in which the final model accuracy is taken as the mean of the number of repeats.
The following lines of code use 5-fold cross validation with 3 repeats to estimate Naive Bayes on the dataset.
1 2 3 4 5
control2 <- trainControl(method="repeatedcv", number=5, repeats=3) repeated_kfold_model <- train(approval_status ~., data=dat, trControl=control2, method="nb") print(repeated_kfold_model)
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Naive Bayes 600 samples 8 predictor 2 classes: 'No', 'Yes' No pre-processing Resampling: Cross-Validated (5 fold, repeated 3 times) Summary of sample sizes: 480, 480, 480, 480, 480, 480, ... Resampling results across tuning parameters: usekernel Accuracy Kappa FALSE 0.7594444 0.3937285 TRUE 0.6844444 0.0492689 Tuning parameter 'fL' was held constant at a value of 0 Tuning parameter 'adjust' was held constant at a value of 1 Accuracy was used to select the optimal model using the largest value. The final values used for the model were fL = 0, usekernel = FALSE and adjust = 1.
The mean accuracy for the model using repeated k-fold cross-validation is 75.94 percent.
Leave-one-out cross-validation, or LOOCV, is the cross-validation technique in which the size of the fold is “1” with “k” being set to the number of observations in the data. This variation is useful when the training data is of limited size and the number of parameters to be tested is not high. The lines of code below repeat the steps above.
1 2 3 4 5
control3 <- trainControl(method="LOOCV") loocv_model <- train(approval_status ~., data=dat, trControl=control3, method="nb") print(loocv_model)
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Naive Bayes 600 samples 8 predictor 2 classes: 'No', 'Yes' No pre-processing Resampling: Leave-One-Out Cross-Validation Summary of sample sizes: 599, 599, 599, 599, 599, 599, ... Resampling results across tuning parameters: usekernel Accuracy Kappa FALSE 0.7700000 0.41755768 TRUE 0.6833333 0.01893287 Tuning parameter 'fL' was held constant at a value of 0 Tuning parameter 'adjust' was held constant at a value of 1 Accuracy was used to select the optimal model using the largest value. The final values used for the model were fL = 0, usekernel = FALSE and adjust = 1.
The mean accuracy for the model using leave-one-out cross-validation is 77 percent.
In this guide, you have learned about the various model validation techniques in R. The mean accuracy result for the techniques is summarized below:
Holdout Validation Approach: Accuracy of 88%
K-fold Cross-Validation: Mean Accuracy of 76%
Repeated K-fold Cross-Validation: Mean Accuracy of 76%
Leave-One-Out Cross-Validation: Mean Accuracy of 77%
To learn more about data science using R, please refer to the following guides:
4