Building classification models is one of the most important data science use cases. Classification models are models that predict a categorical label. A few examples of this include predicting whether a customer will churn or whether a bank loan will default. In this guide, you will learn how to build and evaluate a classification model in R. We will train the logistic regression algorithm, which is one of the oldest yet most powerful classification algorithms.
In this guide, we will use a fictitious dataset of loan applicants containing 600 observations and 10 variables, as described below:
Marital_status
: Whether the applicant is married ("Yes") or not ("No")
Is_graduate
: Whether the applicant is a graduate ("Yes") or not ("No")
Income
: Annual Income of the applicant (in USD)
Loan_amount
: Loan amount (in USD) for which the application was submitted
Credit_score
: Whether the applicant's credit score is good ("Good") or not ("Bad")
Approval_status
: Whether the loan application was approved ("Yes") or not ("No")
Age
: The applicant's age in years
Sex
: Whether the applicant is a male ("M") or a female ("F")
Investment
: Total investment in stocks and mutual funds (in USD) as declared by the applicant
Purpose
: Purpose of applying for the loanLet's start by loading the required libraries and the data.
1library(plyr)
2library(readr)
3library(dplyr)
4library(caret)
5
6dat <- read_csv("data.csv")
7glimpse(dat)
Output:
1Observations: 600
2Variables: 10
3$ Marital_status <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye...
4$ Is_graduate <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Y...
5$ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136700, 17320...
6$ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123030, 15588...
7$ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satisfactory", ...
8$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "Yes", "No"...
9$ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 30, ...
10$ Sex <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M", "F", "F",...
11$ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690, 121240, ...
12$ Purpose <chr> "Education", "Travel", "Others", "Others", "Travel", "Travel", "...
The output shows that the dataset has four numerical (labeled as int
) and six character variables (labeled as chr
). We will convert these into factor variables using the line of code below.
1names <- c(1,2,5,6,8,10)
2dat[,names] <- lapply(dat[,names] , factor)
3glimpse(dat)
Output:
1Observations: 600
2Variables: 10
3$ Marital_status <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes...
4$ Is_graduate <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Y...
5$ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136700, 17320...
6$ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123030, 15588...
7$ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory, Satisfac...
8$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, Yes, No, Y...
9$ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 30, ...
10$ Sex <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, M, M, M, M...
11$ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690, 121240, ...
12$ Purpose <fct> Education, Travel, Others, Others, Travel, Travel, Travel, Educa...
13
We will build our model on the training dataset and evaluate its performance on the test dataset. This is called the holdout-validation approach to evaluating model performance.
The first line of code below sets the random seed for reproducibility of results. The second line loads the caTools
package that will be used for data partitioning, while the third to fifth lines create the training and test datasets. The train dataset contains 70 percent of the data (420 observations of 10 variables) while the test data contains the remaining 30 percent (180 observations of 10 variables).
1set.seed(100)
2library(caTools)
3
4spl = sample.split(dat$approval_status, SplitRatio = 0.7)
5train = subset(dat, spl==TRUE)
6test = subset(dat, spl==FALSE)
7
8print(dim(train)); print(dim(test))
Output:
1[1] 420 10
2
3[1] 180 10
To fit the logistic regression model, the first step is to instantiate the algorithm. This is done in the first line of code below with the glm()
function. The second line prints the summary of the trained model.
1model_glm = glm(approval_status ~ . , family="binomial", data = train)
2summary(model_glm)
Output:
1Call:
2glm(formula = approval_status ~ ., family = "binomial", data = train)
3
4Deviance Residuals:
5 Min 1Q Median 3Q Max
6-2.19539 -0.00004 0.00004 0.00008 2.47763
7
8Coefficients:
9 Estimate Std. Error z value Pr(>|z|)
10(Intercept) 6.238e-02 9.052e+03 0.000 1.0000
11Marital_statusYes 4.757e-01 4.682e-01 1.016 0.3096
12Is_graduateYes 5.647e-01 4.548e-01 1.242 0.2144
13Income 2.244e-06 1.018e-06 2.204 0.0275 *
14Loan_amount -3.081e-07 3.550e-07 -0.868 0.3854
15Credit_scoreSatisfactory 2.364e+01 8.839e+03 0.003 0.9979
16Age -7.985e-02 1.360e-02 -5.870 4.35e-09 ***
17SexM -5.879e-01 6.482e-01 -0.907 0.3644
18Investment -2.595e-06 1.476e-06 -1.758 0.0787 .
19PurposeHome 2.599e+00 9.052e+03 0.000 0.9998
20PurposeOthers -4.172e+01 3.039e+03 -0.014 0.9890
21PurposePersonal 1.577e+00 2.503e+03 0.001 0.9995
22PurposeTravel -1.986e+01 1.954e+03 -0.010 0.9919
23---
24Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
25
26(Dispersion parameter for binomial family taken to be 1)
27
28 Null deviance: 524.44 on 419 degrees of freedom
29Residual deviance: 166.96 on 407 degrees of freedom
30AIC: 192.96
31
32Number of Fisher Scoring iterations: 19
The significance code ‘***’
in the above output shows the relative importance of the feature variables. Let's evaluate the model further, starting by setting the baseline accuracy using the code below. Since the majority class of the target variable has a proportion of 0.68, the baseline accuracy is 68 percent.
1#Baseline Accuracy
2prop.table(table(train$approval_status))
Output:
1 No Yes
20.3166667 0.6833333
Let's now evaluate the model performance on the training and test data, which should ideally be better than the baseline accuracy. We start by generating predictions on the training data, using the first line of code below. The second line creates the confusion matrix with a threshold of 0.5, which means that for probability predictions equal to or greater than 0.5, the algorithm will predict the Yes
response for the approval_status
variable. The third line prints the accuracy of the model on the training data, using the confusion matrix, and the accuracy comes out to be 91 percent.
We then repeat this process on the test data, and the accuracy comes out to be 88 percent.
1# Predictions on the training set
2predictTrain = predict(model_glm, data = train, type = "response")
3
4# Confusion matrix on training data
5table(train$approval_status, predictTrain >= 0.5)
6(114+268)/nrow(train) #Accuracy - 91%
7
8#Predictions on the test set
9predictTest = predict(model_glm, newdata = test, type = "response")
10
11# Confusion matrix on test set
12table(test$approval_status, predictTest >= 0.5)
13158/nrow(test) #Accuracy - 88%
Output:
1# Confusion matrix and accuracy on training data
2
3 FALSE TRUE
4 No 114 19
5 Yes 19 268
6
7
8[1] 0.9095238
9
10
11# Confusion matrix and accuracy on testing data
12 FALSE TRUE
13 No 44 13
14 Yes 9 114
15
16[1] 0.8777778
In this guide, you have learned techniques of building a classification model in R using the powerful logistic regression algorithm. The baseline accuracy for the data was 68 percent, while the accuracy on the training and test data was 91 percent, and 88 percent, respectively. Overall, the logistic regression model is beating the baseline accuracy by a big margin on both the train and test datasets, and the results are very good.
To learn more about data science using R, please refer to the following guides: