Author avatar

Deepika Singh

Building Classification Models in R

Deepika Singh

  • Nov 18, 2019
  • 10 Min read
  • 26 Views
  • Nov 18, 2019
  • 10 Min read
  • 26 Views
Data
R

Introduction

Building classification models is one of the most important data science use cases. Classification models are models that predict a categorical label. A few examples of this include predicting whether a customer will churn or whether a bank loan will default. In this guide, you will learn how to build and evaluate a classification model in R. We will train the logistic regression algorithm, which is one of the oldest yet most powerful classification algorithms.

Data

In this guide, we will use a fictitious dataset of loan applicants containing 600 observations and 10 variables, as described below:

  1. Marital_status: Whether the applicant is married ("Yes") or not ("No")

  2. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")

  3. Income: Annual Income of the applicant (in USD)

  4. Loan_amount: Loan amount (in USD) for which the application was submitted

  5. Credit_score: Whether the applicant's credit score is good ("Good") or not ("Bad")

  6. Approval_status: Whether the loan application was approved ("Yes") or not ("No")

  7. Age: The applicant's age in years

  8. Sex: Whether the applicant is a male ("M") or a female ("F")

  9. Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant

  10. Purpose: Purpose of applying for the loan

Let's start by loading the required libraries and the data.

1
2
3
4
5
6
7
library(plyr)
library(readr)
library(dplyr)
library(caret)

dat <- read_csv("data.csv")
glimpse(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
Observations: 600
Variables: 10
$ Marital_status  <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye...
$ Is_graduate     <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Y...
$ Income          <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136700, 17320...
$ Loan_amount     <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123030, 15588...
$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satisfactory", ...
$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "Yes", "No"...
$ Age             <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 30, ...
$ Sex             <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M", "F", "F",...
$ Investment      <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690, 121240, ...
$ Purpose         <chr> "Education", "Travel", "Others", "Others", "Travel", "Travel", "...

The output shows that the dataset has four numerical (labeled as int) and six character variables (labeled as chr). We will convert these into factor variables using the line of code below.

1
2
3
names <- c(1,2,5,6,8,10)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
Observations: 600
Variables: 10
$ Marital_status  <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes...
$ Is_graduate     <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Y...
$ Income          <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136700, 17320...
$ Loan_amount     <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123030, 15588...
$ Credit_score    <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory, Satisfac...
$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, Yes, No, Y...
$ Age             <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 30, ...
$ Sex             <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, M, M, M, M...
$ Investment      <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690, 121240, ...
$ Purpose         <fct> Education, Travel, Others, Others, Travel, Travel, Travel, Educa...
 

Data Partitioning

We will build our model on the training dataset and evaluate its performance on the test dataset. This is called the holdout-validation approach to evaluating model performance.

The first line of code below sets the random seed for reproducibility of results. The second line loads the caTools package that will be used for data partitioning, while the third to fifth lines create the training and test datasets. The train dataset contains 70 percent of the data (420 observations of 10 variables) while the test data contains the remaining 30 percent (180 observations of 10 variables).

1
2
3
4
5
6
7
8
set.seed(100)
library(caTools)

spl = sample.split(dat$approval_status, SplitRatio = 0.7)
train = subset(dat, spl==TRUE)
test = subset(dat, spl==FALSE)

print(dim(train)); print(dim(test))
{r}

Output:

1
2
3
[1] 420  10

[1] 180  10

Build, Predict and Evaluate the Model

To fit the logistic regression model, the first step is to instantiate the algorithm. This is done in the first line of code below with the glm() function. The second line prints the summary of the trained model.

1
2
model_glm = glm(approval_status ~ . , family="binomial", data = train)
summary(model_glm)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Call:
glm(formula = approval_status ~ ., family = "binomial", data = train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.19539  -0.00004   0.00004   0.00008   2.47763  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)               6.238e-02  9.052e+03   0.000   1.0000    
Marital_statusYes         4.757e-01  4.682e-01   1.016   0.3096    
Is_graduateYes            5.647e-01  4.548e-01   1.242   0.2144    
Income                    2.244e-06  1.018e-06   2.204   0.0275 *  
Loan_amount              -3.081e-07  3.550e-07  -0.868   0.3854    
Credit_scoreSatisfactory  2.364e+01  8.839e+03   0.003   0.9979    
Age                      -7.985e-02  1.360e-02  -5.870 4.35e-09 ***
SexM                     -5.879e-01  6.482e-01  -0.907   0.3644    
Investment               -2.595e-06  1.476e-06  -1.758   0.0787 .  
PurposeHome               2.599e+00  9.052e+03   0.000   0.9998    
PurposeOthers            -4.172e+01  3.039e+03  -0.014   0.9890    
PurposePersonal           1.577e+00  2.503e+03   0.001   0.9995    
PurposeTravel            -1.986e+01  1.954e+03  -0.010   0.9919    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 524.44  on 419  degrees of freedom
Residual deviance: 166.96  on 407  degrees of freedom
AIC: 192.96

Number of Fisher Scoring iterations: 19

The significance code ‘***’ in the above output shows the relative importance of the feature variables. Let's evaluate the model further, starting by setting the baseline accuracy using the code below. Since the majority class of the target variable has a proportion of 0.68, the baseline accuracy is 68 percent.

1
2
#Baseline Accuracy
prop.table(table(train$approval_status))
{r}

Output:

1
2
       No       Yes 
0.3166667 0.6833333

Let's now evaluate the model performance on the training and test data, which should ideally be better than the baseline accuracy. We start by generating predictions on the training data, using the first line of code below. The second line creates the confusion matrix with a threshold of 0.5, which means that for probability predictions equal to or greater than 0.5, the algorithm will predict the Yes response for the approval_status variable. The third line prints the accuracy of the model on the training data, using the confusion matrix, and the accuracy comes out to be 91 percent.

We then repeat this process on the test data, and the accuracy comes out to be 88 percent.

1
2
3
4
5
6
7
8
9
10
11
12
13
# Predictions on the training set
predictTrain = predict(model_glm, data = train, type = "response")

# Confusion matrix on training data
table(train$approval_status, predictTrain >= 0.5)
(114+268)/nrow(train) #Accuracy - 91%

#Predictions on the test set
predictTest = predict(model_glm, newdata = test, type = "response")

# Confusion matrix on test set
table(test$approval_status, predictTest >= 0.5)
158/nrow(test) #Accuracy - 88%
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Confusion matrix and accuracy on training data

     FALSE TRUE
  No    114   19
  Yes    19  268


[1] 0.9095238


# Confusion matrix and accuracy on testing data
     FALSE TRUE
  No     44   13
  Yes     9  114

[1] 0.8777778

Conclusion

In this guide, you have learned techniques of building a classification model in R using the powerful logistic regression algorithm. The baseline accuracy for the data was 68 percent, while the accuracy on the training and test data was 91 percent, and 88 percent, respectively. Overall, the logistic regression model is beating the baseline accuracy by a big margin on both the train and test datasets, and the results are very good.

To learn more about data science using R, please refer to the following guides:

0