Machine Learning is one of the most sought-after disciplines in today’s Artificial Intelligence driven world. But what is Machine Learning? In simple terms, it is the field of teaching machines and computers to learn from existing data and to make predictions on the new unseen data. There are three types of Machine Learning Algorithms: Supervised, Unsupervised, and Reinforcement Learning.
In Supervised Learning, we have a target/outcome variable which is to be predicted from a given set of features/independent variables. The algorithm works by using the set of features and generating a function that maps inputs to desired outputs. The training process continues until the model achieves the desired level of accuracy on the training data, which is then used on the new unseen data. There are two types of supervised machine learning algorithms – Classification and Regression.
Classification models are models which predict a categorical label. Good examples of this are predicting whether a customer will churn or not, or whether a bank loan will default or not.
On the other hand, Regression models are models which predict a continuous label. The goal is to produce a model that represents the ‘best fit’ to some observed data, according to an evaluation criterion we choose. Good examples of this are predicting the price of the house, sales of a retail store, or life expectancy of an individual.
In this guide, we will focus on Classification.
Scikit - Learn, or sklearn, is one of the most popular libraries in Python for doing supervised machine learning. It integrates well with the SciPy stack, making it robust and powerful. Scikit-learn can be used for both classification and regression problems, however, this guide will focus on the classification problem.
The first step in any machine learning process is understanding the Problem Statement and the Data before jumping into predictive modeling.
Diabetes is considered one of the serious health issues which cause an increase in blood sugar. Many complications occur if diabetes remains untreated and unidentified. The aim of this guide is to build a classification model to detect diabetes. We will be using the diabetes dataset which contains 768 observations and 9 variables, as described below:
age - Age in years
Also, the classification algorithm selected is the Logistic Regression Model as it is one of the most widely used Classification Algorithms.
We will evaluate the performance of the model using four metrics
True Positive or TP are cases with positive labels which have been correctly classified as positive. True Negative or TN are cases with negative labels which have been correctly classified as negative. False Positive or FP are cases with negative labels which have been incorrectly classified as positive. False Negative or FN are cases with positive labels which have been incorrectly classified as negative.
Precision Precision is the fraction of correctly classified label cases out of all cases classified with that label value. It is represented as Precision = P = TP / (TP+ FP)
Recall Recall is the fraction of cases of a label value correctly classified out of all cases that actually have that label value. It is represented as Recall = R = TP / (TP+FN)
Following are the steps which are commonly followed while implementing classification with Scikit-learn.
Step 1 - Loading the required libraries and modules.
Step 2 - Loading the data and performing basic data checks.
Step 3 - Creating arrays for the features and the response variable.
Step 4 - Creating the Training and Test datasets.
Step 5 - Create and fit the classifier.
Step 6 - Predict on the test data and compute evaluation metrics.
The following sections will cover these steps.
1# Import required libraries
2import pandas as pd
3import numpy as np
4import matplotlib.pyplot as plt
5import seaborn as sns
6%matplotlib inline
7
8# Import necessary modules
9from sklearn.linear_model import LogisticRegression
10from sklearn.model_selection import train_test_split
11from sklearn.metrics import confusion_matrix, classification_report
The first line of code reads in the data as pandas dataframe, while the second line of code prints the shape - 768 observations of 9 variables. The third line gives summary statistics of the numerical variables. We can see that all the variables have 768 as 'count' which is equal to the number of records in the dataset that means we don't have missing values.
1# Load data
2df = pd.read_csv("diabetes.csv")
3print(df.shape)
4df.describe()
Output:
1 (768, 9)
pregnancies | glucose | diastolic | triceps | insulin | bmi | dpf | age | diabetes | |
---|---|---|---|---|---|---|---|---|---|
count | 768 | 768 | 768 | 768 | 768 | 768 | 768 | 768 | 768 |
mean | 3.8 | 120.9 | 69.1 | 20.5 | 79.8 | 32.0 | 0.5 | 33.2 | 0.3 |
std | 3.4 | 32.0 | 19.4 | 16.0 | 115.2 | 7.9 | 0.3 | 11.8 | 0.5 |
min | 0 | 0 | 0 | 0 | 0 | 0 | 0.1 | 21 | 0 |
25% | 1 | 99 | 62 | 0 | 0 | 27.3 | 0.2 | 24 | 0 |
50% | 3 | 117 | 72 | 23 | 30.5 | 32 | 0.4 | 29 | 0 |
75% | 6 | 140.3 | 80 | 32 | 127.3 | 36.6 | 0.6 | 41 | 1 |
max | 17 | 199 | 122 | 99 | 846 | 67.1 | 2.4 | 81 | 1 |
The first line of code creates an array of the target variable, while the second line of code gives us the array of all the features after excluding the target variable 'diabetes'.
1# Create arrays for the features and the response variable
2y = df['diabetes'].values
3X = df.drop('diabetes', axis=1).values
The first line of code splits the data into training and test dataset, while the second line of code gives us the shape of the training set (460 observations of 8 variables) and test set (308 observations of 8 variables).
1# Create training and test sets
2X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)
3X_train.shape, X_test.shape
1((460, 8), (308, 8))
The first line of code instantiates a LogisticRegression classifier called logreg; while the second line of code fits the classifier on the training set.
1# Create the classifier: logreg
2logreg = LogisticRegression()
3
4# Fit the classifier to the training data
5logreg.fit(X_train, y_train)
Output:
1 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
2 intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
3 penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
4 verbose=0, warm_start=False)
The first line of code predicts the label on the test data, the second line prints the confusion matrix, while the third line prints the classification report.
1 # Predict the labels of the test set: y_pred
2y_pred = logreg.predict(X_test)
3
4# Compute and print the confusion matrix and classification report
5print(confusion_matrix(y_test, y_pred))
6print(classification_report(y_test, y_pred))
Output:
1 [[174 32]
2 [ 36 66]]
3 precision recall f1-score support
4
5 0 0.83 0.84 0.84 206
6 1 0.67 0.65 0.66 102
7
8 avg / total 0.78 0.78 0.78 308
We will now evaluate the model performance on the basis of the confusion matrix created above. The results of the evaluation are given below:
In this guide we have given you a brief introduction to supervised machine learning and implementation of one of the most popular classification algorithm ‘Logistic Regression’ in Python using Scikit-learn. The guide used the diabetes dataset and built a classifier algorithm to predict detection of diabetes.
Our model is achieving a decent accuracy of 78%, However because of the imbalance in the data, the Precision, Recall and F1 Score values are in the 65% to 67% range. The model can be further improved by doing cross-validation, features analysis, and feature engineering and, of course, by trying out more advanced machine learning algorithms such as Tree Family of Algorithms (Decision Tree and Random Forest) or Optimization Algorithms (Support Vector Machines and Neural Networks). However, that is not in the scope of this guide which is aimed at being a good starting point for individuals aspiring to start using Python’s Scikit-learn machine learning library for building classification algorithms.