Author avatar

Deepika Singh

Machine Learning with Neural Networks Using R

Deepika Singh

  • Nov 21, 2019
  • 9 Min read
  • 24 Views
  • Nov 21, 2019
  • 9 Min read
  • 24 Views
Data
R

Introduction

Neural networks are used to solve many challenging artificial intelligence problems. They often outperform traditional machine learning models because they have the advantages of non-linearity, variable interactions, and customization. In this guide, you will learn the steps to build a neural network machine learning model using R.

Data

The aim of this guide is to build a neural network model to predict approval status of loan applicants. We will use a fictitious dataset containing 600 observations and 10 variables, as described below:

  1. Marital_status: Whether the applicant is married ("Yes") or not ("No")

  2. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")

  3. Income: Annual Income of the applicant (in USD)

  4. Loan_amount: Loan amount (in USD) for which the application was submitted

  5. Credit_score: Whether the applicant's credit score is good ("Good") or not ("Bad")

  6. Approval_status: Whether the loan application was approved ("Yes") or not ("No")

  7. Age: The applicant's age in years

  8. Sex: Whether the applicant is a male ("M") or a female ("F")

  9. Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant

  10. Purpose: Purpose of applying for the loan

Evaluation Metric

We will evaluate the performance of the model using accuracy, which represents the percentage of cases correctly classified. Mathematically, for a binary classifier, it's represented as accuracy = (TP+TN)/(TP+TN+FP+FN), where:

  1. True Positive, or TP: cases with positive labels which have been correctly classified as positive.

  2. True Negative, or TN: cases with negative labels which have been correctly classified as negative.

  3. False Positive, or FP: cases with negative labels which have been incorrectly classified as positive.

  4. False Negative, or FN: cases with positive labels which have been incorrectly classified as negative.

Let's start by loading the required libraries and the data.

1
2
3
4
5
6
7
8
9
10
library(plyr)
library(readr)
library(dplyr)
library(caret)
library(neuralnet)
library(nnet)

dat <- read_csv("data_2.csv")

glimpse(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
Observations: 600
Variables: 10
$ Marital_status  <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", ...
$ Is_graduate     <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Income          <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
$ Loan_amount     <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "...
$ Age             <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
$ Sex             <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M",...
$ Investment      <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
$ Purpose         <chr> "Education", "Travel", "Others", "Others", "Travel", "...

The output shows that the dataset has four numerical variables (labeled as int) and six character variables (labeled as chr).

Data Partitioning

We will build our model on the training set and evaluate its performance on the test set. This is called the holdout-validation method for evaluating model performance. The first line of code below sets the random seed for reproducibility of results. The second line loads the caTools package that will be used for data partitioning, while the third to fifth lines create the training and test datasets. The train dataset contains 70 percent of the data (420 observations of 10 variables) and the test dataset contains the remaining 30 percent (180 observations of 10 variables).

1
2
3
4
5
6
7
8
library(caTools)
set.seed(100)

spl = sample.split(dat$approval_status, SplitRatio = 0.7)
train = subset(dat, spl==TRUE)
test = subset(dat, spl==FALSE)

print(dim(train)); print(dim(test))
{r}

Output:

1
2
3
[1] 420  10

[1] 180  10

Build, Predict, and Evaluate the Model

The first line of code below uses the trainControl() function to control the parameters being used for training the algorithm. The method argument specifies the resampling technique. In this case, we use repeated cross-validation. The number argument specifies the number of folds or resampling iterations, which is ten in this case. The repeat argument specifies the number of complete sets of folds to compute, which is five in this case. We can have a look at the complete set of arguments with the ?trainControl command.

The second line of code below trains the neural network algorithm using the nnet algorithm. The preProcess argument performs the standardization on the numerical variables.

1
2
3
4
5
6
7
8
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5)

nnet_model <- train(train[,-6], train$approval_status,
                 method = "nnet",
                 trControl= train_params,
                 preProcess=c("scale","center"),
                 na.action = na.omit
)
{r}

The algorithm training is complete, and the next step is model evaluation. Let's start by computing the baseline accuracy using the code below. Since the majority class of the target variables has a proportion of 0.68, the baseline accuracy is 68 percent.

1
prop.table(table(train$approval_status))   #Baseline Accuracy
{r}

Output:

1
2
       No       Yes 
0.3166667 0.6833333

Let's now evaluate the model performance, which should be better than the baseline accuracy. We start with the training data, where the first line of code generates predictions on the train set. The second line of code creates the confusion matrix, and the third line prints the accuracy of the model on the training data using the confusion matrix. The training data set accuracy comes out to 96 percent.

We'll repeat this process on the test data, and the accuracy will come out to 87.2 percent.

1
2
3
4
5
6
7
8
9
10
11
12
13
# Predictions on the training set
nnet_predictions_train <-predict(nnet_model, train)

# Confusion matrix on training data
table(train$approval_status, nnet_predictions_train)
(278+125)/nrow(train)                    

#Predictions on the test set
nnet_predictions_test <-predict(nnet_model, test)

# Confusion matrix on test set
table(test$approval_status, nnet_predictions_test)
157/nrow(test)  
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Confusion matrix and accuracy on train data

    nnet_predictions_train
       No Yes
  No  125   8
  Yes   9 278


[1] 0.9595238

# Confusion matrix and accuracy on test data
    nnet_predictions_test
       No Yes
  No   42  15
  Yes   8 115

[1] 0.8722222

Conclusion

In this guide, you have learned about building a machine learning model with the neural network library in R. The baseline accuracy for the data was 68 percent, while the accuracy on the training and test datasets was 96 percent and 87 percent, respectively. Overall, the neural network model is performing well and beating the baseline accuracy by a big margin on both the train and test sets.

To learn more about data science using R, please refer to the following guides:

0