 Deepika Singh

# Building Classification Models in R

• Nov 18, 2019
• 2,026 Views
• Nov 18, 2019
• 2,026 Views
Data
R

## Introduction

Building classification models is one of the most important data science use cases. Classification models are models that predict a categorical label. A few examples of this include predicting whether a customer will churn or whether a bank loan will default. In this guide, you will learn how to build and evaluate a classification model in R. We will train the logistic regression algorithm, which is one of the oldest yet most powerful classification algorithms.

## Data

In this guide, we will use a fictitious dataset of loan applicants containing 600 observations and 10 variables, as described below:

1. `Marital_status`: Whether the applicant is married ("Yes") or not ("No")

2. `Is_graduate`: Whether the applicant is a graduate ("Yes") or not ("No")

3. `Income`: Annual Income of the applicant (in USD)

4. `Loan_amount`: Loan amount (in USD) for which the application was submitted

5. `Credit_score`: Whether the applicant's credit score is good ("Good") or not ("Bad")

6. `Approval_status`: Whether the loan application was approved ("Yes") or not ("No")

7. `Age`: The applicant's age in years

8. `Sex`: Whether the applicant is a male ("M") or a female ("F")

9. `Investment`: Total investment in stocks and mutual funds (in USD) as declared by the applicant

10. `Purpose`: Purpose of applying for the loan

``````1
2
3
4
5
6
7
``````library(plyr)
library(dplyr)
library(caret)

glimpse(dat)``````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
``````Observations: 600
Variables: 10
\$ Marital_status  <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye...
\$ Is_graduate     <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Y...
\$ Income          <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136700, 17320...
\$ Loan_amount     <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123030, 15588...
\$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satisfactory", ...
\$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "Yes", "No"...
\$ Age             <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 30, ...
\$ Sex             <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M", "F", "F",...
\$ Investment      <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690, 121240, ...
\$ Purpose         <chr> "Education", "Travel", "Others", "Others", "Travel", "Travel", "...``````

The output shows that the dataset has four numerical (labeled as `int`) and six character variables (labeled as `chr`). We will convert these into factor variables using the line of code below.

``````1
2
3
``````names <- c(1,2,5,6,8,10)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)``````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
13
``````Observations: 600
Variables: 10
\$ Marital_status  <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes...
\$ Is_graduate     <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Y...
\$ Income          <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136700, 17320...
\$ Loan_amount     <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123030, 15588...
\$ Credit_score    <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory, Satisfac...
\$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, Yes, No, Y...
\$ Age             <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 30, ...
\$ Sex             <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, M, M, M, M...
\$ Investment      <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690, 121240, ...
\$ Purpose         <fct> Education, Travel, Others, Others, Travel, Travel, Travel, Educa...
``````

## Data Partitioning

We will build our model on the training dataset and evaluate its performance on the test dataset. This is called the holdout-validation approach to evaluating model performance.

The first line of code below sets the random seed for reproducibility of results. The second line loads the `caTools` package that will be used for data partitioning, while the third to fifth lines create the training and test datasets. The train dataset contains 70 percent of the data (420 observations of 10 variables) while the test data contains the remaining 30 percent (180 observations of 10 variables).

``````1
2
3
4
5
6
7
8
``````set.seed(100)
library(caTools)

spl = sample.split(dat\$approval_status, SplitRatio = 0.7)
train = subset(dat, spl==TRUE)
test = subset(dat, spl==FALSE)

print(dim(train)); print(dim(test))``````
{r}

Output:

``````1
2
3
`````` 420  10

 180  10``````

## Build, Predict and Evaluate the Model

To fit the logistic regression model, the first step is to instantiate the algorithm. This is done in the first line of code below with the `glm()` function. The second line prints the summary of the trained model.

``````1
2
``````model_glm = glm(approval_status ~ . , family="binomial", data = train)
summary(model_glm)``````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
``````Call:
glm(formula = approval_status ~ ., family = "binomial", data = train)

Deviance Residuals:
Min        1Q    Median        3Q       Max
-2.19539  -0.00004   0.00004   0.00008   2.47763

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)               6.238e-02  9.052e+03   0.000   1.0000
Marital_statusYes         4.757e-01  4.682e-01   1.016   0.3096
Income                    2.244e-06  1.018e-06   2.204   0.0275 *
Loan_amount              -3.081e-07  3.550e-07  -0.868   0.3854
Credit_scoreSatisfactory  2.364e+01  8.839e+03   0.003   0.9979
Age                      -7.985e-02  1.360e-02  -5.870 4.35e-09 ***
SexM                     -5.879e-01  6.482e-01  -0.907   0.3644
Investment               -2.595e-06  1.476e-06  -1.758   0.0787 .
PurposeHome               2.599e+00  9.052e+03   0.000   0.9998
PurposeOthers            -4.172e+01  3.039e+03  -0.014   0.9890
PurposePersonal           1.577e+00  2.503e+03   0.001   0.9995
PurposeTravel            -1.986e+01  1.954e+03  -0.010   0.9919
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 524.44  on 419  degrees of freedom
Residual deviance: 166.96  on 407  degrees of freedom
AIC: 192.96

Number of Fisher Scoring iterations: 19``````

The significance code `‘***’` in the above output shows the relative importance of the feature variables. Let's evaluate the model further, starting by setting the baseline accuracy using the code below. Since the majority class of the target variable has a proportion of 0.68, the baseline accuracy is 68 percent.

``````1
2
``````#Baseline Accuracy
prop.table(table(train\$approval_status))``````
{r}

Output:

``````1
2
``````       No       Yes
0.3166667 0.6833333``````

Let's now evaluate the model performance on the training and test data, which should ideally be better than the baseline accuracy. We start by generating predictions on the training data, using the first line of code below. The second line creates the confusion matrix with a threshold of 0.5, which means that for probability predictions equal to or greater than 0.5, the algorithm will predict the `Yes` response for the `approval_status` variable. The third line prints the accuracy of the model on the training data, using the confusion matrix, and the accuracy comes out to be 91 percent.

We then repeat this process on the test data, and the accuracy comes out to be 88 percent.

``````1
2
3
4
5
6
7
8
9
10
11
12
13
``````# Predictions on the training set
predictTrain = predict(model_glm, data = train, type = "response")

# Confusion matrix on training data
table(train\$approval_status, predictTrain >= 0.5)
(114+268)/nrow(train) #Accuracy - 91%

#Predictions on the test set
predictTest = predict(model_glm, newdata = test, type = "response")

# Confusion matrix on test set
table(test\$approval_status, predictTest >= 0.5)
158/nrow(test) #Accuracy - 88%``````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
``````# Confusion matrix and accuracy on training data

FALSE TRUE
No    114   19
Yes    19  268

 0.9095238

# Confusion matrix and accuracy on testing data
FALSE TRUE
No     44   13
Yes     9  114

 0.8777778``````

## Conclusion

In this guide, you have learned techniques of building a classification model in R using the powerful logistic regression algorithm. The baseline accuracy for the data was 68 percent, while the accuracy on the training and test data was 91 percent, and 88 percent, respectively. Overall, the logistic regression model is beating the baseline accuracy by a big margin on both the train and test datasets, and the results are very good.