Building machine learning models is an important element of predictive modeling. However, without proper model validation, the confidence that the trained model will generalize well on the unseen data can never be high. Model validation helps in ensuring that the model performs well on new data, and helps in selecting the best model, the parameters, and the accuracy metrics.
In this guide, we will learn the basics and implementation of several model validation techniques, mentioned below:
The aim of this guide is to build a classification model to detect diabetes and learn how to validate it using several techniques. We will be using the diabetes dataset which contains 768 observations and 9 variables, as described below:
In this guide, we will follow the following steps:
Step 1 - Loading the required libraries and modules.
Step 2 - Reading the data and performing basic data checks.
Step 3 - Creating arrays for the features and the response variable.
Step 4 - Trying out different model validation techniques.
The following sections will cover these steps.
1# Import required libraries 2import pandas as pd 3import numpy as np 4import matplotlib.pyplot as plt 5import sklearn 6 7# Import necessary modules 8from sklearn.model_selection import train_test_split 9from sklearn.metrics import mean_squared_error 10from math import sqrt 11from sklearn import model_selection 12from sklearn.linear_model import LogisticRegression 13from sklearn.model_selection import KFold 14from sklearn.model_selection import LeaveOneOut 15from sklearn.model_selection import LeavePOut 16from sklearn.model_selection import ShuffleSplit 17from sklearn.model_selection import StratifiedKFold
The first line of code below reads in the data as a pandas dataframe, while the second line prints the shape - 768 observations of 9 variables. The third line gives the transposed summary statistics of the variables.
1dat = pd.read_csv('diabetes.csv') 2print(dat.shape) 3dat.describe().transpose()
1(768, 9) 2 3| | count | mean | std | min | 25% | 50% | 75% | max | 4|-------------|-------|------------|------------|--------|----------|----------|-----------|--------| 5| pregnancies | 768.0 | 3.845052 | 3.369578 | 0.000 | 1.00000 | 3.0000 | 6.00000 | 17.00 | 6| glucose | 768.0 | 120.894531 | 31.972618 | 0.000 | 99.00000 | 117.0000 | 140.25000 | 199.00 | 7| diastolic | 768.0 | 69.105469 | 19.355807 | 0.000 | 62.00000 | 72.0000 | 80.00000 | 122.00 | 8| triceps | 768.0 | 20.536458 | 15.952218 | 0.000 | 0.00000 | 23.0000 | 32.00000 | 99.00 | 9| insulin | 768.0 | 79.799479 | 115.244002 | 0.000 | 0.00000 | 30.5000 | 127.25000 | 846.00 | 10| bmi | 768.0 | 31.992578 | 7.884160 | 0.000 | 27.30000 | 32.0000 | 36.60000 | 67.10 | 11| dpf | 768.0 | 0.471876 | 0.331329 | 0.078 | 0.24375 | 0.3725 | 0.62625 | 2.42 | 12| age | 768.0 | 33.240885 | 11.760232 | 21.000 | 24.00000 | 29.0000 | 41.00000 | 81.00 | 13| diabetes | 768.0 | 0.348958 | 0.476951 | 0.000 | 0.00000 | 0.0000 | 1.00000 | 1.00 |
Looking at the summary for the 'diabetes' variable, we observe that the mean value is 0.35, which means that around 35 percent of the observations in the dataset have diabetes. Therefore, the baseline accuracy is 65 percent, and the model we build should definitely beat this baseline benchmark.
The lines of code below create an array of the features and the dependent variable, respectively.
1x1 = dat.drop('diabetes', axis=1).values 2y1 = dat['diabetes'].values
With the arrays of the features and the response variable created, we will start discussing the various model validation strategies.
The holdout validation approach refers to creating the training and the holdout sets, also referred to as the 'test' or the 'validation' set. The training data is used to train the model while the unseen data is used to validate the model performance. The common split ratio is 70:30, while for small datasets, the ratio can be 90:10.
We will use the 70:30 ratio split for the diabetes dataset. The first line of code splits the data into the training and the test data. The second line instantiates the LogisticRegression() model, while the third line fits the model on the training data. The fourth line uses the trained model to generate scores on the test data, while the fifth line prints the accuracy result.
1# Evaluate using a train and a test set 2X_train, X_test, Y_train, Y_test = model_selection.train_test_split(x1, y1, test_size=0.30, random_state=100) 3model = LogisticRegression() 4model.fit(X_train, Y_train) 5result = model.score(X_test, Y_test) 6print("Accuracy: %.2f%%" % (result*100.0))
We can see that the accuracy for the model on the test data is approximately 74 percent. The above technique is useful but it has pitfalls. The split is very important and, if it goes wrong, it can lead to model overfitting or underfitting the new data. This problem can be rectified using resampling methods, which repeat a calculation multiple times using randomly selected subsets of the complete data. We discuss the popular cross-validation techniques in the following sections of the guide.
In k-fold cross-validation, the data is divided into k folds. The model is trained on k-1 folds with one fold held back for testing. This process gets repeated to ensure each fold of the dataset gets the chance to be the held back set. Once the process is completed, we can summarize the evaluation metric using the mean or/and the standard deviation.
We will use 10-fold cross-validation for our problem statement. The first line of code uses the 'model_selection.KFold' function from 'scikit-learn' and creates 10 folds. The second line instantiates the LogisticRegression() model, while the third line fits the model and generates cross-validation scores. The arguments 'x1' and 'y1' represents the predictor and the response array, respectively. The 'cv' argument specifies the number of cross-validation splits. The fourth line prints the mean accuracy result.
1kfold = model_selection.KFold(n_splits=10, random_state=100) 2model_kfold = LogisticRegression() 3results_kfold = model_selection.cross_val_score(model_kfold, x1, y1, cv=kfold) 4print("Accuracy: %.2f%%" % (results_kfold.mean()*100.0))
The mean accuracy for the model using k-fold cross-validation is 76.95 percent, which is better than the 74 percent we achieved in the holdout validation approach.
Stratified K-Fold approach is a variation of k-fold cross-validation that returns stratified folds, i.e., each set containing approximately the same ratio of target labels as the complete data.
The lines of code below repeat the steps as discussed above for k-fold cross-validation, except for a couple of changes. The first line creates the Stratified KFolds instead of the k-fold, and this adjustment is then passed to the 'cv' argument in the third line of code.
1skfold = StratifiedKFold(n_splits=3, random_state=100) 2model_skfold = LogisticRegression() 3results_skfold = model_selection.cross_val_score(model_skfold, x1, y1, cv=skfold) 4print("Accuracy: %.2f%%" % (results_skfold.mean()*100.0))
The mean accuracy for the model using stratified k-fold cross-validation is 76.96 percent.
LOOCV is the cross-validation technique in which the size of the fold is “1” with “k” being set to the number of observations in the data. This variation is useful when the training data is of limited size and the number of parameters to be tested is not high.
The lines of code below repeat the steps as discussed above, except for a couple of changes. The first line creates the leave-one-out cross-validation instead of the k-fold, and this adjustment is then passed to the 'cv' argument in the third line of code.
1loocv = model_selection.LeaveOneOut() 2model_loocv = LogisticRegression() 3results_loocv = model_selection.cross_val_score(model_loocv, x1, y1, cv=loocv) 4print("Accuracy: %.2f%%" % (results_loocv.mean()*100.0))
The mean accuracy for the model using the leave-one-out cross-validation is 76.82 percent.
This technique is a hybrid of traditional train-test splitting and the k-fold cross-validation method. In this technique, we create random splits of the data in the training-test set manner and then repeat the process of splitting and evaluating the algorithm multiple times, just like the cross-validation method.
The lines of code below repeat the steps as discussed above for LOOCV method, except for a couple of changes in the first and third lines of code.
1kfold2 = model_selection.ShuffleSplit(n_splits=10, test_size=0.30, random_state=100) 2model_shufflecv = LogisticRegression() 3results_4 = model_selection.cross_val_score(model_shufflecv, x1, y1, cv=kfold2) 4print("Accuracy: %.2f%% (%.2f%%)" % (results_4.mean()*100.0, results_4.std()*100.0))
1Accuracy: 74.76% (2.52%)
The mean accuracy for the model using the repeated random train-test split method is 74.76 percent.
In this guide, you have learned about the various model validation techniques using scikit-learn. The guide used the diabetes dataset and built a classifier algorithm to predict the detection of diabetes.
The mean accuracy result for the various techniques is summarised below:
We can conclude that the cross-validation technique improves the performance of the model and is a better model validation strategy. The model can be further improved by doing exploratory data analysis, data pre-processing, feature engineering, or trying out other machine learning algorithms instead of the logistic regression algorithm we built in this guide.
To learn more about building machine learning models using scikit-learn , please refer to the following guides:
To learn more about building deep learning models using keras , please refer to the following guides: