5
Building classification models is one of the most important data science use cases. Classification models are models that predict a categorical label. A few examples of this include predicting whether a customer will churn or whether a bank loan will default. In this guide, you will learn how to build and evaluate a classification model in R. We will train the logistic regression algorithm, which is one of the oldest yet most powerful classification algorithms.
In this guide, we will use a fictitious dataset of loan applicants containing 600 observations and 10 variables, as described below:
Marital_status
: Whether the applicant is married ("Yes") or not ("No")
Is_graduate
: Whether the applicant is a graduate ("Yes") or not ("No")
Income
: Annual Income of the applicant (in USD)
Loan_amount
: Loan amount (in USD) for which the application was submitted
Credit_score
: Whether the applicant's credit score is good ("Good") or not ("Bad")
Approval_status
: Whether the loan application was approved ("Yes") or not ("No")
Age
: The applicant's age in years
Sex
: Whether the applicant is a male ("M") or a female ("F")
Investment
: Total investment in stocks and mutual funds (in USD) as declared by the applicant
Purpose
: Purpose of applying for the loan
Let's start by loading the required libraries and the data.
1 2 3 4 5 6 7
library(plyr) library(readr) library(dplyr) library(caret) dat <- read_csv("data.csv") glimpse(dat)
Output:
1 2 3 4 5 6 7 8 9 10 11 12
Observations: 600 Variables: 10 $ Marital_status <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye... $ Is_graduate <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Y... $ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136700, 17320... $ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123030, 15588... $ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satisfactory", ... $ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "Yes", "No"... $ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 30, ... $ Sex <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M", "F", "F",... $ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690, 121240, ... $ Purpose <chr> "Education", "Travel", "Others", "Others", "Travel", "Travel", "...
The output shows that the dataset has four numerical (labeled as int
) and six character variables (labeled as chr
). We will convert these into factor variables using the line of code below.
1 2 3
names <- c(1,2,5,6,8,10) dat[,names] <- lapply(dat[,names] , factor) glimpse(dat)
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13
Observations: 600 Variables: 10 $ Marital_status <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes... $ Is_graduate <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Y... $ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136700, 17320... $ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123030, 15588... $ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory, Satisfac... $ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, Yes, No, Y... $ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 30, ... $ Sex <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, M, M, M, M... $ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690, 121240, ... $ Purpose <fct> Education, Travel, Others, Others, Travel, Travel, Travel, Educa...
We will build our model on the training dataset and evaluate its performance on the test dataset. This is called the holdout-validation approach to evaluating model performance.
The first line of code below sets the random seed for reproducibility of results. The second line loads the caTools
package that will be used for data partitioning, while the third to fifth lines create the training and test datasets. The train dataset contains 70 percent of the data (420 observations of 10 variables) while the test data contains the remaining 30 percent (180 observations of 10 variables).
1 2 3 4 5 6 7 8
set.seed(100) library(caTools) spl = sample.split(dat$approval_status, SplitRatio = 0.7) train = subset(dat, spl==TRUE) test = subset(dat, spl==FALSE) print(dim(train)); print(dim(test))
Output:
1 2 3
[1] 420 10 [1] 180 10
To fit the logistic regression model, the first step is to instantiate the algorithm. This is done in the first line of code below with the glm()
function. The second line prints the summary of the trained model.
1 2
model_glm = glm(approval_status ~ . , family="binomial", data = train) summary(model_glm)
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Call: glm(formula = approval_status ~ ., family = "binomial", data = train) Deviance Residuals: Min 1Q Median 3Q Max -2.19539 -0.00004 0.00004 0.00008 2.47763 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.238e-02 9.052e+03 0.000 1.0000 Marital_statusYes 4.757e-01 4.682e-01 1.016 0.3096 Is_graduateYes 5.647e-01 4.548e-01 1.242 0.2144 Income 2.244e-06 1.018e-06 2.204 0.0275 * Loan_amount -3.081e-07 3.550e-07 -0.868 0.3854 Credit_scoreSatisfactory 2.364e+01 8.839e+03 0.003 0.9979 Age -7.985e-02 1.360e-02 -5.870 4.35e-09 *** SexM -5.879e-01 6.482e-01 -0.907 0.3644 Investment -2.595e-06 1.476e-06 -1.758 0.0787 . PurposeHome 2.599e+00 9.052e+03 0.000 0.9998 PurposeOthers -4.172e+01 3.039e+03 -0.014 0.9890 PurposePersonal 1.577e+00 2.503e+03 0.001 0.9995 PurposeTravel -1.986e+01 1.954e+03 -0.010 0.9919 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 524.44 on 419 degrees of freedom Residual deviance: 166.96 on 407 degrees of freedom AIC: 192.96 Number of Fisher Scoring iterations: 19
The significance code ‘***’
in the above output shows the relative importance of the feature variables. Let's evaluate the model further, starting by setting the baseline accuracy using the code below. Since the majority class of the target variable has a proportion of 0.68, the baseline accuracy is 68 percent.
1 2
#Baseline Accuracy prop.table(table(train$approval_status))
Output:
1 2
No Yes 0.3166667 0.6833333
Let's now evaluate the model performance on the training and test data, which should ideally be better than the baseline accuracy. We start by generating predictions on the training data, using the first line of code below. The second line creates the confusion matrix with a threshold of 0.5, which means that for probability predictions equal to or greater than 0.5, the algorithm will predict the Yes
response for the approval_status
variable. The third line prints the accuracy of the model on the training data, using the confusion matrix, and the accuracy comes out to be 91 percent.
We then repeat this process on the test data, and the accuracy comes out to be 88 percent.
1 2 3 4 5 6 7 8 9 10 11 12 13
# Predictions on the training set predictTrain = predict(model_glm, data = train, type = "response") # Confusion matrix on training data table(train$approval_status, predictTrain >= 0.5) (114+268)/nrow(train) #Accuracy - 91% #Predictions on the test set predictTest = predict(model_glm, newdata = test, type = "response") # Confusion matrix on test set table(test$approval_status, predictTest >= 0.5) 158/nrow(test) #Accuracy - 88%
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# Confusion matrix and accuracy on training data FALSE TRUE No 114 19 Yes 19 268 [1] 0.9095238 # Confusion matrix and accuracy on testing data FALSE TRUE No 44 13 Yes 9 114 [1] 0.8777778
In this guide, you have learned techniques of building a classification model in R using the powerful logistic regression algorithm. The baseline accuracy for the data was 68 percent, while the accuracy on the training and test data was 91 percent, and 88 percent, respectively. Overall, the logistic regression model is beating the baseline accuracy by a big margin on both the train and test datasets, and the results are very good.
To learn more about data science using R, please refer to the following guides:
5