Validating Machine Learning Models with scikit-learn

Jun 6, 2019 • 12 Minute Read

Introduction

Building machine learning models is an important element of predictive modeling. However, without proper model validation, the confidence that the trained model will generalize well on the unseen data can never be high. Model validation helps in ensuring that the model performs well on new data, and helps in selecting the best model, the parameters, and the accuracy metrics.

In this guide, we will learn the basics and implementation of several model validation techniques, mentioned below:

Hold Out Validation
K-fold Cross-Validation.
Stratified K-fold Cross-Validation
Leave One Out Cross-Validation.
Repeated Random Test-Train Splits

Problem Statement

The aim of this guide is to build a classification model to detect diabetes and learn how to validate it using several techniques. We will be using the diabetes dataset which contains 768 observations and 9 variables, as described below:

pregnancies - Number of times pregnant.
glucose - Plasma glucose concentration.
diastolic - Diastolic blood pressure (mm Hg).
triceps - Skinfold thickness (mm).
insulin - Hour serum insulin (mu U/ml).
bmi - BMI (weight in kg/height in m).
dpf - Diabetes pedigree function.
age - Age in years.
diabetes - “1” represents the presence of diabetes while “0” represents the absence of it. This is the target variable.

Steps

In this guide, we will follow the following steps:

Step 1 - Loading the required libraries and modules.

Step 2 - Reading the data and performing basic data checks.

Step 3 - Creating arrays for the features and the response variable.

Step 4 - Trying out different model validation techniques.

The following sections will cover these steps.

Step 1 - Loading the Required Libraries and Modules

      # Import required libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import sklearn

# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import LeavePOut
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import StratifiedKFold
    

Step 2 - Reading the Data and Performing Basic Data Checks

The first line of code below reads in the data as a pandas dataframe, while the second line prints the shape - 768 observations of 9 variables. The third line gives the transposed summary statistics of the variables.

      dat = pd.read_csv('diabetes.csv') 
print(dat.shape)
dat.describe().transpose()
    

Output:

      (768, 9)
    
|             | count | mean       | std        | min    | 25%      | 50%      | 75%       | max    |
|-------------|-------|------------|------------|--------|----------|----------|-----------|--------|
| pregnancies | 768.0 | 3.845052   | 3.369578   | 0.000  | 1.00000  | 3.0000   | 6.00000   | 17.00  |
| glucose     | 768.0 | 120.894531 | 31.972618  | 0.000  | 99.00000 | 117.0000 | 140.25000 | 199.00 |
| diastolic   | 768.0 | 69.105469  | 19.355807  | 0.000  | 62.00000 | 72.0000  | 80.00000  | 122.00 |
| triceps     | 768.0 | 20.536458  | 15.952218  | 0.000  | 0.00000  | 23.0000  | 32.00000  | 99.00  |
| insulin     | 768.0 | 79.799479  | 115.244002 | 0.000  | 0.00000  | 30.5000  | 127.25000 | 846.00 |
| bmi         | 768.0 | 31.992578  | 7.884160   | 0.000  | 27.30000 | 32.0000  | 36.60000  | 67.10  |
| dpf         | 768.0 | 0.471876   | 0.331329   | 0.078  | 0.24375  | 0.3725   | 0.62625   | 2.42   |
| age         | 768.0 | 33.240885  | 11.760232  | 21.000 | 24.00000 | 29.0000  | 41.00000  | 81.00  |
| diabetes    | 768.0 | 0.348958   | 0.476951   | 0.000  | 0.00000  | 0.0000   | 1.00000   | 1.00   |
    

Looking at the summary for the 'diabetes' variable, we observe that the mean value is 0.35, which means that around 35 percent of the observations in the dataset have diabetes. Therefore, the baseline accuracy is 65 percent, and the model we build should definitely beat this baseline benchmark.

Step 3 - Creating Arrays for the Features and the Response Variable

The lines of code below create an array of the features and the dependent variable, respectively.

      x1 = dat.drop('diabetes', axis=1).values 
y1 = dat['diabetes'].values
    

Step 4 - Trying out Different Model Validation Techniques

With the arrays of the features and the response variable created, we will start discussing the various model validation strategies.

Holdout Validation Approach - Train and Test Set Split

The holdout validation approach refers to creating the training and the holdout sets, also referred to as the 'test' or the 'validation' set. The training data is used to train the model while the unseen data is used to validate the model performance. The common split ratio is 70:30, while for small datasets, the ratio can be 90:10.

We will use the 70:30 ratio split for the diabetes dataset. The first line of code splits the data into the training and the test data. The second line instantiates the LogisticRegression() model, while the third line fits the model on the training data. The fourth line uses the trained model to generate scores on the test data, while the fifth line prints the accuracy result.

      # Evaluate using a train and a test set
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(x1, y1, test_size=0.30, random_state=100)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
    

Output:

      Accuracy: 74.46%

We can see that the accuracy for the model on the test data is approximately 74 percent. The above technique is useful but it has pitfalls. The split is very important and, if it goes wrong, it can lead to model overfitting or underfitting the new data. This problem can be rectified using resampling methods, which repeat a calculation multiple times using randomly selected subsets of the complete data. We discuss the popular cross-validation techniques in the following sections of the guide.

K-fold Cross-Validation

In k-fold cross-validation, the data is divided into k folds. The model is trained on k-1 folds with one fold held back for testing. This process gets repeated to ensure each fold of the dataset gets the chance to be the held back set. Once the process is completed, we can summarize the evaluation metric using the mean or/and the standard deviation.

We will use 10-fold cross-validation for our problem statement. The first line of code uses the 'model_selection.KFold' function from 'scikit-learn' and creates 10 folds. The second line instantiates the LogisticRegression() model, while the third line fits the model and generates cross-validation scores. The arguments 'x1' and 'y1' represents the predictor and the response array, respectively. The 'cv' argument specifies the number of cross-validation splits. The fourth line prints the mean accuracy result.

      kfold = model_selection.KFold(n_splits=10, random_state=100)
model_kfold = LogisticRegression()
results_kfold = model_selection.cross_val_score(model_kfold, x1, y1, cv=kfold)
print("Accuracy: %.2f%%" % (results_kfold.mean()*100.0))
    

Output:

      Accuracy: 76.95%

The mean accuracy for the model using k-fold cross-validation is 76.95 percent, which is better than the 74 percent we achieved in the holdout validation approach.

Stratified K-fold Cross-Validation

Stratified K-Fold approach is a variation of k-fold cross-validation that returns stratified folds, i.e., each set containing approximately the same ratio of target labels as the complete data.

The lines of code below repeat the steps as discussed above for k-fold cross-validation, except for a couple of changes. The first line creates the Stratified KFolds instead of the k-fold, and this adjustment is then passed to the 'cv' argument in the third line of code.

      skfold = StratifiedKFold(n_splits=3, random_state=100)
model_skfold = LogisticRegression()
results_skfold = model_selection.cross_val_score(model_skfold, x1, y1, cv=skfold)
print("Accuracy: %.2f%%" % (results_skfold.mean()*100.0))
    

Output:

      Accuracy: 76.96%

The mean accuracy for the model using stratified k-fold cross-validation is 76.96 percent.

Leave One Out Cross-Validation (LOOCV)

LOOCV is the cross-validation technique in which the size of the fold is “1” with “k” being set to the number of observations in the data. This variation is useful when the training data is of limited size and the number of parameters to be tested is not high.

The lines of code below repeat the steps as discussed above, except for a couple of changes. The first line creates the leave-one-out cross-validation instead of the k-fold, and this adjustment is then passed to the 'cv' argument in the third line of code.

      loocv = model_selection.LeaveOneOut()
model_loocv = LogisticRegression()
results_loocv = model_selection.cross_val_score(model_loocv, x1, y1, cv=loocv)
print("Accuracy: %.2f%%" % (results_loocv.mean()*100.0))
    

Output:

      Accuracy: 76.82%

The mean accuracy for the model using the leave-one-out cross-validation is 76.82 percent.

Repeated Random Test-Train Splits

This technique is a hybrid of traditional train-test splitting and the k-fold cross-validation method. In this technique, we create random splits of the data in the training-test set manner and then repeat the process of splitting and evaluating the algorithm multiple times, just like the cross-validation method.

The lines of code below repeat the steps as discussed above for LOOCV method, except for a couple of changes in the first and third lines of code.

      kfold2 = model_selection.ShuffleSplit(n_splits=10, test_size=0.30, random_state=100)
model_shufflecv = LogisticRegression()
results_4 = model_selection.cross_val_score(model_shufflecv, x1, y1, cv=kfold2)
print("Accuracy: %.2f%% (%.2f%%)" % (results_4.mean()*100.0, results_4.std()*100.0))
    

Output:

      Accuracy: 74.76% (2.52%)

The mean accuracy for the model using the repeated random train-test split method is 74.76 percent.

Conclusion

In this guide, you have learned about the various model validation techniques using scikit-learn. The guide used the diabetes dataset and built a classifier algorithm to predict the detection of diabetes.

The mean accuracy result for the various techniques is summarised below:

Holdout Validation Approach: Accuracy of 74.46%
K-fold Cross-Validation: Mean Accuracy of 76.95%
Stratified K-fold Cross-Validation: Mean Accuracy of 76.96%
Leave One Out Cross-Validation: Mean Accuracy of 76.82%
Repeated Random Test-Train Splits: Mean Accuracy of 74.76%

We can conclude that the cross-validation technique improves the performance of the model and is a better model validation strategy. The model can be further improved by doing exploratory data analysis, data pre-processing, feature engineering, or trying out other machine learning algorithms instead of the logistic regression algorithm we built in this guide.

To learn more about building machine learning models using scikit-learn , please refer to the following guides:

To learn more about building deep learning models using keras , please refer to the following guides: