 Deepika Singh

# Machine Learning with Neural Networks Using R

• Nov 21, 2019
• 4,957 Views
• Nov 21, 2019
• 4,957 Views
Data
R

## Introduction

Neural networks are used to solve many challenging artificial intelligence problems. They often outperform traditional machine learning models because they have the advantages of non-linearity, variable interactions, and customization. In this guide, you will learn the steps to build a neural network machine learning model using R.

## Data

The aim of this guide is to build a neural network model to predict approval status of loan applicants. We will use a fictitious dataset containing 600 observations and 10 variables, as described below:

1. `Marital_status`: Whether the applicant is married ("Yes") or not ("No")

2. `Is_graduate`: Whether the applicant is a graduate ("Yes") or not ("No")

3. `Income`: Annual Income of the applicant (in USD)

4. `Loan_amount`: Loan amount (in USD) for which the application was submitted

5. `Credit_score`: Whether the applicant's credit score is good ("Good") or not ("Bad")

6. `Approval_status`: Whether the loan application was approved ("Yes") or not ("No")

7. `Age`: The applicant's age in years

8. `Sex`: Whether the applicant is a male ("M") or a female ("F")

9. `Investment`: Total investment in stocks and mutual funds (in USD) as declared by the applicant

10. `Purpose`: Purpose of applying for the loan

### Evaluation Metric

We will evaluate the performance of the model using accuracy, which represents the percentage of cases correctly classified. Mathematically, for a binary classifier, it's represented as `accuracy = (TP+TN)/(TP+TN+FP+FN)`, where:

1. `True Positive, or TP`: cases with positive labels which have been correctly classified as positive.

2. `True Negative, or TN`: cases with negative labels which have been correctly classified as negative.

3. `False Positive, or FP`: cases with negative labels which have been incorrectly classified as positive.

4. `False Negative, or FN`: cases with positive labels which have been incorrectly classified as negative.

``````1library(plyr)
3library(dplyr)
4library(caret)
5library(neuralnet)
6library(nnet)
7
9
10glimpse(dat)
``````
{r}

Output:

``````1Observations: 600
2Variables: 10
3\$ Marital_status  <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", ...
4\$ Is_graduate     <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
5\$ Income          <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
6\$ Loan_amount     <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
7\$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
8\$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "...
9\$ Age             <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
10\$ Sex             <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M",...
11\$ Investment      <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
12\$ Purpose         <chr> "Education", "Travel", "Others", "Others", "Travel", "...
``````

The output shows that the dataset has four numerical variables (labeled as `int`) and six character variables (labeled as `chr`).

## Data Partitioning

We will build our model on the training set and evaluate its performance on the test set. This is called the holdout-validation method for evaluating model performance. The first line of code below sets the random seed for reproducibility of results. The second line loads the `caTools` package that will be used for data partitioning, while the third to fifth lines create the training and test datasets. The train dataset contains 70 percent of the data (420 observations of 10 variables) and the test dataset contains the remaining 30 percent (180 observations of 10 variables).

``````1library(caTools)
2set.seed(100)
3
4spl = sample.split(dat\$approval_status, SplitRatio = 0.7)
5train = subset(dat, spl==TRUE)
6test = subset(dat, spl==FALSE)
7
8print(dim(train)); print(dim(test))
``````
{r}

Output:

``````1 420  10
2
3 180  10
``````

## Build, Predict, and Evaluate the Model

The first line of code below uses the `trainControl()` function to control the parameters being used for training the algorithm. The `method` argument specifies the resampling technique. In this case, we use repeated cross-validation. The `number` argument specifies the number of folds or resampling iterations, which is ten in this case. The `repeat` argument specifies the number of complete sets of folds to compute, which is five in this case. We can have a look at the complete set of arguments with the `?trainControl` command.

The second line of code below trains the neural network algorithm using the `nnet` algorithm. The `preProcess` argument performs the standardization on the numerical variables.

``````1train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5)
2
3nnet_model <- train(train[,-6], train\$approval_status,
4                 method = "nnet",
5                 trControl= train_params,
6                 preProcess=c("scale","center"),
7                 na.action = na.omit
8)
``````
{r}

The algorithm training is complete, and the next step is model evaluation. Let's start by computing the baseline accuracy using the code below. Since the majority class of the target variables has a proportion of 0.68, the baseline accuracy is 68 percent.

``1prop.table(table(train\$approval_status))   #Baseline Accuracy``
{r}

Output:

``````1       No       Yes
20.3166667 0.6833333
``````

Let's now evaluate the model performance, which should be better than the baseline accuracy. We start with the training data, where the first line of code generates predictions on the train set. The second line of code creates the confusion matrix, and the third line prints the accuracy of the model on the training data using the confusion matrix. The training data set accuracy comes out to 96 percent.

We'll repeat this process on the test data, and the accuracy will come out to 87.2 percent.

``````1# Predictions on the training set
2nnet_predictions_train <-predict(nnet_model, train)
3
4# Confusion matrix on training data
5table(train\$approval_status, nnet_predictions_train)
6(278+125)/nrow(train)
7
8#Predictions on the test set
9nnet_predictions_test <-predict(nnet_model, test)
10
11# Confusion matrix on test set
12table(test\$approval_status, nnet_predictions_test)
13157/nrow(test)
``````
{r}

Output:

``````1# Confusion matrix and accuracy on train data
2
3    nnet_predictions_train
4       No Yes
5  No  125   8
6  Yes   9 278
7
8
9 0.9595238
10
11# Confusion matrix and accuracy on test data
12    nnet_predictions_test
13       No Yes
14  No   42  15
15  Yes   8 115
16
17 0.8722222
``````

## Conclusion

In this guide, you have learned about building a machine learning model with the neural network library in R. The baseline accuracy for the data was 68 percent, while the accuracy on the training and test datasets was 96 percent and 87 percent, respectively. Overall, the neural network model is performing well and beating the baseline accuracy by a big margin on both the train and test sets.