Scikit Machine Learning

Feb 19, 2019 • 10 Minute Read

Introduction

Machine Learning is one of the most sought-after disciplines in today’s Artificial Intelligence driven world. But what is Machine Learning? In simple terms, it is the field of teaching machines and computers to learn from existing data and to make predictions on the new unseen data. There are three types of Machine Learning Algorithms: Supervised, Unsupervised, and Reinforcement Learning.

In Supervised Learning, we have a target/outcome variable which is to be predicted from a given set of features/independent variables. The algorithm works by using the set of features and generating a function that maps inputs to desired outputs. The training process continues until the model achieves the desired level of accuracy on the training data, which is then used on the new unseen data. There are two types of supervised machine learning algorithms – Classification and Regression.

Classification and Regression

Classification models are models which predict a categorical label. Good examples of this are predicting whether a customer will churn or not, or whether a bank loan will default or not.

On the other hand, Regression models are models which predict a continuous label. The goal is to produce a model that represents the ‘best fit’ to some observed data, according to an evaluation criterion we choose. Good examples of this are predicting the price of the house, sales of a retail store, or life expectancy of an individual.

In this guide, we will focus on Classification.

Supervised Learning with Scikit-learn

Scikit - Learn, or sklearn, is one of the most popular libraries in Python for doing supervised machine learning. It integrates well with the SciPy stack, making it robust and powerful. Scikit-learn can be used for both classification and regression problems, however, this guide will focus on the classification problem.

Classification with Scikit-learn

The first step in any machine learning process is understanding the Problem Statement and the Data before jumping into predictive modeling.

Problem Statement

Diabetes is considered one of the serious health issues which cause an increase in blood sugar. Many complications occur if diabetes remains untreated and unidentified. The aim of this guide is to build a classification model to detect diabetes. We will be using the diabetes dataset which contains 768 observations and 9 variables, as described below:

pregnancies - Number of times pregnant
glucose - Plasma glucose concentration
diastolic - diastolic blood pressure (mm Hg)
triceps - Skinfold thickness (mm)
insulin - Hour serum insulin (mu U/ml) in
bmi - BMI (weightinkg/(heightinm)
dpf - Diabetes pedigree function
age - Age in years
diabetes - 1 represents the presence of diabetes while 0 represents the absence of it. This is the target variable.

Also, the classification algorithm selected is the Logistic Regression Model as it is one of the most widely used Classification Algorithms.

Evaluation Metrics

We will evaluate the performance of the model using four metrics

Accuracy Accuracy is the fraction of cases correctly classified. For a binary classifier, it is represented as accuracy = (TP+TN)/(TP+TN+FP+FN), where

True Positive or TP are cases with positive labels which have been correctly classified as positive. True Negative or TN are cases with negative labels which have been correctly classified as negative. False Positive or FP are cases with negative labels which have been incorrectly classified as positive. False Negative or FN are cases with positive labels which have been incorrectly classified as negative.

Precision Precision is the fraction of correctly classified label cases out of all cases classified with that label value. It is represented as Precision = P = TP / (TP+ FP)
Recall Recall is the fraction of cases of a label value correctly classified out of all cases that actually have that label value. It is represented as Recall = R = TP / (TP+FN)
F1-score The F1 statistic is a weighted average of precision and recall. It is represented as F1 = =2*(P*R) / (P+R)

Steps

Following are the steps which are commonly followed while implementing classification with Scikit-learn.

Step 1 - Loading the required libraries and modules.

Step 2 - Loading the data and performing basic data checks.

Step 3 - Creating arrays for the features and the response variable.

Step 4 - Creating the Training and Test datasets.

Step 5 - Create and fit the classifier.

Step 6 - Predict on the test data and compute evaluation metrics.

The following sections will cover these steps.

Step 1 - Loading the Required Libraries and Modules

      # Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Import necessary modules
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split  
from sklearn.metrics import confusion_matrix, classification_report
    

Step 2 - Loading the Data and Performing Basic Data Checks

The first line of code reads in the data as pandas dataframe, while the second line of code prints the shape - 768 observations of 9 variables. The third line gives summary statistics of the numerical variables. We can see that all the variables have 768 as 'count' which is equal to the number of records in the dataset that means we don't have missing values.

      # Load data
df = pd.read_csv("diabetes.csv")
print(df.shape)
df.describe()
    

Output:

      (768, 9)

	pregnancies	glucose	diastolic	triceps	insulin	bmi	dpf	age	diabetes
count	768	768	768	768	768	768	768	768	768
mean	3.8	120.9	69.1	20.5	79.8	32.0	0.5	33.2	0.3
std	3.4	32.0	19.4	16.0	115.2	7.9	0.3	11.8	0.5
min	0	0	0	0	0	0	0.1	21	0
25%	1	99	62	0	0	27.3	0.2	24	0
50%	3	117	72	23	30.5	32	0.4	29	0
75%	6	140.3	80	32	127.3	36.6	0.6	41	1
max	17	199	122	99	846	67.1	2.4	81	1

Step 3 - Creating Arrays for the Features and the Response Variable.

The first line of code creates an array of the target variable, while the second line of code gives us the array of all the features after excluding the target variable 'diabetes'.

      # Create arrays for the features and the response variable
y = df['diabetes'].values
X = df.drop('diabetes', axis=1).values
    

Step 4 - Creating the Training and Test Datasets

The first line of code splits the data into training and test dataset, while the second line of code gives us the shape of the training set (460 observations of 8 variables) and test set (308 observations of 8 variables).

      # Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42) 
X_train.shape, X_test.shape
    

      ((460, 8), (308, 8))

Step 5 - Create and Fit the Classifier

The first line of code instantiates a LogisticRegression classifier called logreg; while the second line of code fits the classifier on the training set.

      # Create the classifier: logreg
logreg = LogisticRegression()

# Fit the classifier to the training data
logreg.fit(X_train, y_train)
    

Output:

      LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
              intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
              penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
              verbose=0, warm_start=False)
    

Step 6 - Predict on the Test Data and Compute Evaluation Metrics;

The first line of code predicts the label on the test data, the second line prints the confusion matrix, while the third line prints the classification report.

      # Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
    

Output:

      [174  32]
     [ 36  66]]
                 precision    recall  f1-score   support
    
              0       0.83      0.84      0.84       206
              1       0.67      0.65      0.66       102
    
    avg / total       0.78      0.78      0.78       308
    

Evaluation of the Model Performance

We will now evaluate the model performance on the basis of the confusion matrix created above. The results of the evaluation are given below:

Accuracy = (174+66)/(174+66+32+36) = 78%
Precision = 66/(66+32) = 67%
Recall = 66/ (66+36) = 65%
F1 Score = 2*(0.65*0.67)/(0.65+0.67) = 66%

Conclusion

In this guide we have given you a brief introduction to supervised machine learning and implementation of one of the most popular classification algorithm ‘Logistic Regression’ in Python using Scikit-learn. The guide used the diabetes dataset and built a classifier algorithm to predict detection of diabetes.

Our model is achieving a decent accuracy of 78%, However because of the imbalance in the data, the Precision, Recall and F1 Score values are in the 65% to 67% range. The model can be further improved by doing cross-validation, features analysis, and feature engineering and, of course, by trying out more advanced machine learning algorithms such as Tree Family of Algorithms (Decision Tree and Random Forest) or Optimization Algorithms (Support Vector Machines and Neural Networks). However, that is not in the scope of this guide which is aimed at being a good starting point for individuals aspiring to start using Python’s Scikit-learn machine learning library for building classification algorithms.