Author avatar

Deepika Singh

Predictive Analytics with PyTorch

Deepika Singh

  • Apr 1, 2020
  • 13 Min read
  • 1,345 Views
  • Apr 1, 2020
  • 13 Min read
  • 1,345 Views
Data
Pytorch

Introduction

PyTorch is an open-source machine learning library that is widely used for developing predictive models. Predictive modeling is the phase of analytics that uses statistical algorithms to predict outcomes. The model takes data containing independent variables as inputs, and using machine learning algorithms, makes predictions for the target variable. It is often used by statisticians, data science, and machine learning professionals to make predictions.

In this guide, you will learn the basics of building predictive models using Pytorch.

Data

In this guide, you'll use a fictitious dataset of loan applicants containing 600 observations and 8 variables, as described below:

  1. Is_graduate: Whether the applicant is graduate ("1") or not ("0").

  2. Income: Annual Income of the applicant (in USD).

  3. Loan_amount: Loan amount (in USD) for which the application was submitted.

  4. Credit_score: Whether the applicants credit score is satisfactory ("1") or not ("0").

  5. approval_status: Whether the loan application was approved ("1") or not ("0").

  6. Age: The applicant's age in years.

  7. Sex: Whether the applicant is female ("1") or a male ("0").

  8. Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant.

Let's start by loading the baseline libraries.

1
2
3
4
5
6
7
8
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import sklearn

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
python

After installing libraries, the next step is to load data and look at the basic statistics of the variables.

1
2
3
df = pd.read_csv('data.csv') 
print(df.shape)
df.describe()
python

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 
(600, 8)


|       	| Is_graduate 	| Income        	| Loan_amount   	| Credit_score 	| Age        	| Sex        	| Investment    	| approval_status 	|
|-------	|-------------	|---------------	|---------------	|--------------	|------------	|------------	|---------------	|-----------------	|
| count 	| 600.000000  	| 600.000000    	| 600.000000    	| 600.000000   	| 600.000000 	| 600.000000 	| 600.000000    	| 600.000000      	|
| mean  	| 0.690000    	| 65861.466667  	| 145511.975833 	| 0.696667     	| 48.701667  	| 0.185000   	| 34417.668333  	| 0.683333        	|
| std   	| 0.462879    	| 48628.106723  	| 86728.364583  	| 0.460082     	| 14.778362  	| 0.388622   	| 29742.580389  	| 0.465564        	|
| min   	| 0.000000    	| 3000.000000   	| 6000.000000   	| 0.000000     	| 22.000000  	| 0.000000   	| 2100.000000   	| 0.000000        	|
| 25%   	| 0.000000    	| 38175.000000  	| 111232.500000 	| 0.000000     	| 35.000000  	| 0.000000   	| 16678.000000  	| 0.000000        	|
| 50%   	| 1.000000    	| 50080.000000  	| 134295.000000 	| 1.000000     	| 49.000000  	| 0.000000   	| 26439.000000  	| 1.000000        	|
| 75%   	| 1.000000    	| 76040.000000  	| 168715.000000 	| 1.000000     	| 61.000000  	| 0.000000   	| 35000.000000  	| 1.000000        	|
| max   	| 1.000000    	| 317370.000000 	| 466660.000000 	| 1.000000     	| 76.000000  	| 1.000000   	| 190422.000000 	| 1.000000        	|

Data Preparation

Before initiating the model, it is important to prepare the data. The lines of code below create arrays for the features and response variable.

1
2
3
4
5
target_column = ['approval_status'] 
predictors = list(set(list(df.columns))-set(target_column))

print(target_column)
print(predictors)
python

Output:

1
2
3
 ['approval_status']
 
 ['Sex', 'Credit_score', 'Age', 'Investment', 'Income', 'Loan_amount', 'Is_graduate']

The next step is to create the train and test datasets. This is done using the code below. The last line of code prints the shape of the training set (420 observations of 7 variables) and test set (180 observations of 7 variables).

1
2
3
4
5
X = df[predictors].values
y = df[target_column].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 30)
print(X_train.shape); print(X_test.shape)
python

Output:

1
2
3
 (420, 7)
 
 (180, 7)

Model Building

You have created the train and test sets and are ready to train the model. You'll start by importing the required libraries to work with Pytorch library.

1
2
3
4
5
import torch
import torch.utils.data
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
python

You are all set to build the predictive model using the Artificial Neural Network (or ANN) algorithm. The basic architecture of an ANN consists of three main components.

  1. Input Layer: This is where the training observations are fed.

  2. Hidden Layers: These are the intermediate layers between the input and output layers. The model learns about the relationships involved in data in these layers.

  3. Output Layer: This is the layer where the final output is extracted from what’s happening in the previous layers.

The first step for creating the ANN model is to create a class, ANN, that inherits from the nn.Module class. The next step is to define the layers of the network using the __init__() method.

In this case, the model has four layers. Each layer will expect the first parameter to be the input size, which is seven in this case. You'll repeat the process for the remaining layers. The only change in the last stage will be that the output is one variable, representing the target column. You'll also add a dropout layer to avoid overfitting.

Once you have defined the layers, then define how they interact with each other with the def forward(self, x) function, as shown below. This means you're building a fully connected, feed-forward neural network that goes from input to output in a forward manner. The forward step begins with the activation function relu, or Rectified Linear Activation.

For the output layer, you'll use the sigmoid function to convert the probabilities to the classes one and zero.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class ANN(nn.Module):
    def __init__(self, input_dim = 7, output_dim = 1):
        super(ANN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 32)
        self.fc4 = nn.Linear(32, 32)
        self.output_layer = nn.Linear(32,1)
        self.dropout = nn.Dropout(0.15)
        
     def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = F.relu(self.fc3(x))
        x = F.relu(self.fc4(x))
        x = self.output_layer(x)
        
        return nn.Sigmoid()(x)
python

Now that you have defined the architecture of the model above, instantiate the model using the code below.

1
2
3
model = ANN(input_dim = 7, output_dim = 1)

print(model)
python

Output:

1
2
3
4
5
6
7
8
ANN(
      (fc1): Linear(in_features=7, out_features=64, bias=True)
      (fc2): Linear(in_features=64, out_features=64, bias=True)
      (fc3): Linear(in_features=64, out_features=32, bias=True)
      (fc4): Linear(in_features=32, out_features=32, bias=True)
      (output_layer): Linear(in_features=32, out_features=1, bias=True)
      (dropout): Dropout(p=0.2, inplace=False)
    )

You've created the model, and now you need to make the data ready for the Pytorch library. The lines of code below carry out the conversion on the train and test arrays.

1
2
3
4
5
6
X_train = torch.from_numpy(X_train)
y_train = torch.from_numpy(y_train).view(-1,1)


X_test = torch.from_numpy(X_test)
y_test = torch.from_numpy(y_test).view(-1,1)
python

The next step is to make this data iterable. In simple terms, this means that the model will iterate over the dataset to generate predictions. You'll use the torch.utils API provided by Pytorch to perform this task, as shown below.

1
2
3
4
5
train = torch.utils.data.TensorDataset(X_train,y_train)
test = torch.utils.data.TensorDataset(X_test,y_test)

train_loader = torch.utils.data.DataLoader(train, batch_size = 64, shuffle = True)
test_loader = torch.utils.data.DataLoader(test, batch_size = 64, shuffle = True)
python

Model Evaluation

The fully connected ANN is ready for predictive modeling, and you've transformed the train and test arrays in the format required by Pytorch. Model evaluation is the next step. This is done by computing loss, which essentially measures the distance between the predicted and actual labels. In this case, use Binary Cross-Entropy Loss using the nn.BCELoss() function. You also need to optimize the network using the stochastic gradient descent optimizer. This is done using the lines of code below. The lr argument specifies the learning rate of the optimizer function.

1
2
3
import torch.optim as optim
loss_fn = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, weight_decay= 1e-6, momentum = 0.8)
python

After defining the loss function, the next step is to perform model evaluation on the training data using the code below. Start by defining the epoch in the first line of code, while lines two to six create lists that'll keep track of loss and accuracy during each epoch. The code from line seven onwards is used to train the model, calculate loss and accuracy for each epoch, and finally print the output.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# lines 1 to 6
epochs = 2000
epoch_list = []
train_loss_list = []
val_loss_list = []
train_acc_list = []
val_acc_list = []

# lines 7 onwards
model.train() # prepare model for training

for epoch in range(epochs):
    trainloss = 0.0
    valloss = 0.0
    
    correct = 0
    total = 0
    for data,target in train_loader:
        data = Variable(data).float()
        target = Variable(target).type(torch.FloatTensor)
        optimizer.zero_grad()
        output = model(data)
        predicted = (torch.round(output.data[0]))
        total += len(target)
        correct += (predicted == target).sum()

        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
        trainloss += loss.item()*data.size(0)

    trainloss = trainloss/len(train_loader.dataset)
    accuracy = 100 * correct / float(total)
    train_acc_list.append(accuracy)
    trainloss_list.append(train_loss)
    print('Epoch: {} \tTraining Loss: {:.4f}\t Acc: {:.2f}%'.format(
        epoch+1, 
        train_loss,
        accuracy
        ))
    epoch_list.append(epoch + 1)
python

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
#Truncated output for sake of brevity

    Epoch: 1 	Training Loss: 10.3845	 Acc: 62.86%
    Epoch: 2 	Training Loss: 9.0788	 Acc: 67.14%
    Epoch: 3 	Training Loss: 9.0788	 Acc: 67.14%
    Epoch: 4 	Training Loss: 9.0788	 Acc: 67.14%
    Epoch: 5 	Training Loss: 9.0788	 Acc: 67.14%
    
    Epoch: 1996 	Training Loss: 9.0788	 Acc: 67.14%
    Epoch: 1997 	Training Loss: 9.0788	 Acc: 67.14%
    Epoch: 1998 	Training Loss: 9.0788	 Acc: 67.14%
    Epoch: 1999 	Training Loss: 9.0788	 Acc: 67.14%
    Epoch: 2000 	Training Loss: 9.0788	 Acc: 67.14%

The output shows that the training data accuracy is around 67 percent. You'll now evaluate the model performance of the test data using the lines of code below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
correct = 0
total = 0
valloss = 0
model.eval() 

with torch.no_grad():
    for data, target in test_loader:
        data = Variable(data).float()
        target = Variable(target).type(torch.FloatTensor)

        output = model(data)
        loss = loss_fn(output, target)
        valloss += loss.item()*data.size(0)
        
        predicted = (torch.round(output.data[0]))
        total += len(target)
        correct += (predicted == target).sum()
    
    valloss = valloss/len(test_loader.dataset)
    accuracy = 100 * correct/ float(total)
    print(accuracy) 
python

Output:

1
 0.7111111450195312

The above output shows that the test set accuracy comes out to be 71 percent. You can further fine-tune the algorithm to improve model performance.

14