Author avatar

Deepika Singh

Machine Learning with Neural Networks Using scikit-learn

Deepika Singh

  • Jun 6, 2019
  • 12 Min read
  • 4,114 Views
  • Jun 6, 2019
  • 12 Min read
  • 4,114 Views
Data
scikit-learn

Introduction

Neural Networks are used to solve a lot of challenging artificial intelligence problems. They often outperform traditional machine learning models because they have the advantages of non-linearity, variable interactions, and customizability. In this guide, we will learn how to build a neural network machine learning model using scikit-learn. But before we start, it is a good idea to have a basic understanding of a neural network.

Neural Network

The process of creating a neural network begins with the perceptron. In simple terms, the perceptron receives inputs, multiplies them by some weights, and then passes them into an activation function (such as logistic, relu, tanh, identity) to produce an output.

Neural networks are created by adding the layers of these perceptrons together, known as a multi-layer perceptron model. There are three layers of a neural network - the input, hidden, and output layers. The input layer directly receives the data, whereas the output layer creates the required output. The layers in between are known as hidden layers where the intermediate computation takes place.

A neural network algorithm can be used for both classification and regression problems. Before we start building the model, we will gain an understanding of the problem statement and the data.

Problem Statement

The aim of this guide is to build a classification model to detect diabetes. We will be using the diabetes dataset which contains 768 observations and 9 variables, as described below:

  1. pregnancies - Number of times pregnant.
  2. glucose - Plasma glucose concentration.
  3. diastolic - Diastolic blood pressure (mm Hg).
  4. triceps - Skinfold thickness (mm).
  5. insulin - Hour serum insulin (mu U/ml).
  6. bmi – Basal metabolic rate (weight in kg/height in m).
  7. dpf - Diabetes pedigree function.
  8. age - Age in years.
  9. diabetes - “1” represents the presence of diabetes while “0” represents the absence of it. This is the target variable.

Evaluation Metric

We will evaluate the performance of the model using accuracy, which represents the percentage of cases correctly classified.

Mathematically, for a binary classifier, it's represented as accuracy = (TP+TN)/(TP+TN+FP+FN), where:

  • True Positive, or TP, are cases with positive labels which have been correctly classified as positive.
  • True Negative, or TN, are cases with negative labels which have been correctly classified as negative.
  • False Positive, or FP, are cases with negative labels which have been incorrectly classified as positive.
  • False Negative, or FN, are cases with positive labels which have been incorrectly classified as negative.

Steps

In this guide, we will follow the following steps:

Step 1 - Loading the required libraries and modules.

Step 2 - Reading the data and performing basic data checks.

Step 3 - Creating arrays for the features and the response variable.

Step 4 - Creating the training and test datasets.

Step 5 - Building , predicting, and evaluating the neural network model.

The following sections will cover these steps.

Step 1 - Loading the Required Libraries and Modules

1
2
3
4
5
6
7
8
9
10
11
12
13
# Import required libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor

# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.metrics import r2_score
python

Step 2 - Reading the Data and Performing Basic Data Checks

The first line of code reads in the data as pandas dataframe, while the second line prints the shape - 768 observations of 9 variables. The third line gives the transposed summary statistics of the variables.

Looking at the summary for the 'diabetes' variable, we observe that the mean value is 0.35, which means that around 35 percent of the observations in the dataset have diabetes. Therefore, the baseline accuracy is 65 percent and our neural network model should definitely beat this baseline benchmark.

1
2
3
df = pd.read_csv('diabetes.csv') 
print(df.shape)
df.describe().transpose()
python

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
(768, 9)
    
|             | count | mean       | std        | min    | 25%      | 50%      | 75%       | max    |
|-------------|-------|------------|------------|--------|----------|----------|-----------|--------|
| pregnancies | 768.0 | 3.845052   | 3.369578   | 0.000  | 1.00000  | 3.0000   | 6.00000   | 17.00  |
| glucose     | 768.0 | 120.894531 | 31.972618  | 0.000  | 99.00000 | 117.0000 | 140.25000 | 199.00 |
| diastolic   | 768.0 | 69.105469  | 19.355807  | 0.000  | 62.00000 | 72.0000  | 80.00000  | 122.00 |
| triceps     | 768.0 | 20.536458  | 15.952218  | 0.000  | 0.00000  | 23.0000  | 32.00000  | 99.00  |
| insulin     | 768.0 | 79.799479  | 115.244002 | 0.000  | 0.00000  | 30.5000  | 127.25000 | 846.00 |
| bmi         | 768.0 | 31.992578  | 7.884160   | 0.000  | 27.30000 | 32.0000  | 36.60000  | 67.10  |
| dpf         | 768.0 | 0.471876   | 0.331329   | 0.078  | 0.24375  | 0.3725   | 0.62625   | 2.42   |
| age         | 768.0 | 33.240885  | 11.760232  | 21.000 | 24.00000 | 29.0000  | 41.00000  | 81.00  |
| diabetes    | 768.0 | 0.348958   | 0.476951   | 0.000  | 0.00000  | 0.0000   | 1.00000   | 1.00   |

Step 3 - Creating Arrays for the Features and the Response Variable

The first line of code creates an object of the target variable called 'target_column'. The second line gives us the list of all the features, excluding the target variable 'unemploy', while the third line normalizes the predictors.

The fourth line displays the summary of the normalized data. We can see that all the independent variables have now been scaled between 0 and 1. The target variable remains unchanged.

1
2
3
4
target_column = ['diabetes'] 
predictors = list(set(list(df.columns))-set(target_column))
df[predictors] = df[predictors]/df[predictors].max()
df.describe().transpose()
python

Output:

1
2
3
4
5
6
7
8
9
10
11
|             | count | mean     | std      | min      | 25%      | 50%      | 75%      | max |
|-------------|-------|----------|----------|----------|----------|----------|----------|-----|
| pregnancies | 768.0 | 0.226180 | 0.198210 | 0.000000 | 0.058824 | 0.176471 | 0.352941 | 1.0 |
| glucose     | 768.0 | 0.607510 | 0.160666 | 0.000000 | 0.497487 | 0.587940 | 0.704774 | 1.0 |
| diastolic   | 768.0 | 0.566438 | 0.158654 | 0.000000 | 0.508197 | 0.590164 | 0.655738 | 1.0 |
| triceps     | 768.0 | 0.207439 | 0.161134 | 0.000000 | 0.000000 | 0.232323 | 0.323232 | 1.0 |
| insulin     | 768.0 | 0.094326 | 0.136222 | 0.000000 | 0.000000 | 0.036052 | 0.150414 | 1.0 |
| bmi         | 768.0 | 0.476790 | 0.117499 | 0.000000 | 0.406855 | 0.476900 | 0.545455 | 1.0 |
| dpf         | 768.0 | 0.194990 | 0.136913 | 0.032231 | 0.100723 | 0.153926 | 0.258781 | 1.0 |
| age         | 768.0 | 0.410381 | 0.145188 | 0.259259 | 0.296296 | 0.358025 | 0.506173 | 1.0 |
| diabetes    | 768.0 | 0.348958 | 0.476951 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.0 |

Step 4 - Creating the Training and Test Datasets

The first couple of lines of code below create arrays of the independent (X) and dependent (y) variables, respectively. The third line splits the data into training and test dataset, and the fourth line prints the shape of the training and the test data.

1
2
3
4
5
X = df[predictors].values
y = df[target_column].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)
python

Output:

1
2
(537, 8)
(231, 8)

Step 5 - Building, Predicting, and Evaluating the Neural Network Model

In this step, we will build the neural network model using the scikit-learn library's estimator object, 'Multi-Layer Perceptron Classifier'. The first line of code (shown below) imports 'MLPClassifier'.

The second line instantiates the model with the 'hidden_layer_sizes' argument set to three layers, which has the same number of neurons as the count of features in the dataset. We will also select 'relu' as the activation function and 'adam' as the solver for weight optimization. To learn more about 'relu' and 'adam', please refer to the Deep Learning with Keras guides, the links of which are given at the end of this guide.

The third line of code fits the model to the training data, while the fourth and fifth lines use the trained model to generate predictions on the training and test dataset, respectively.

1
2
3
4
5
6
7
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(8,8,8), activation='relu', solver='adam', max_iter=500)
mlp.fit(X_train,y_train)

predict_train = mlp.predict(X_train)
predict_test = mlp.predict(X_test)
python

Once the predictions are generated, we can evaluate the performance of the model. Being a classification algorithm, we will first import the required modules, which is done in the first line of code below. The second and third lines of code print the confusion matrix and the confusion report results on the training data.

1
2
3
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_train,predict_train))
print(classification_report(y_train,predict_train))
python

Output:

1
2
3
4
5
6
7
8
[[319  39]
[ 78 101]]
                 precision    recall  f1-score   support
    
              0       0.80      0.89      0.85       358
              1       0.72      0.56      0.63       179
    
    avg / total       0.78      0.78      0.77       537

The above output shows the performance of the model on training data. The accuracy and the F1 score is around 0.78 and 0.77, respectively. Ideally, the perfect model will have the value of 1 for both these metrics, but that is next to impossible in real-world scenarios.

The next step is to evaluate the performance of the model on the test data that is done with the lines of code below.

1
2
print(confusion_matrix(y_test,predict_test))
print(classification_report(y_test,predict_test))
python

Output:

1
2
3
4
5
6
7
8
[[123  19]
[ 38  51]]
                 precision    recall  f1-score   support
    
              0       0.76      0.87      0.81       142
              1       0.73      0.57      0.64        89
    
    avg / total       0.75      0.75      0.75       231

The above output shows the performance of the model on test data. The accuracy and F1 scores both around 0.75.

Conclusion

In this guide, you have learned about building a neural network model using scikit-learn. The guide used the diabetes dataset and built a classifier algorithm to predict the detection of diabetes.

Our model is achieving a decent accuracy of 78 percent and 75 percent on training and test data, respectively. We observe that the model accuracy is higher than the baseline accuracy of 66 percent. The model can be further improved by doing cross-validation, feature engineering, or changing the arguments in the neural network estimator.

Note that we have built a classification model in this guide. However, building the regression model also follows the same structure, with a couple of adjustments. The first being that instead of the estimator 'MLPClassifier', we will instantiate the estimator 'MLPRegressor'. The second adjustment is that, instead of using accuracy as the evaluation metric, we will use RMSE or R-squared value for model evaluation.

To learn more about building machine learning models using scikit-learn , please refer to the following guides:

To learn more about building deep learning models using keras, please refer to the following guides:

8