Author avatar

Deepika Singh

Explore R Libraries: CARET

Deepika Singh

  • Jun 26, 2020
  • 11 Min read
  • 35 Views
  • Jun 26, 2020
  • 11 Min read
  • 35 Views
Data
Data Analytics
Machine Learning

Introduction

R is a powerful programming language for data science that provides a wide number of libraries for machine learning. One of the most powerful and popular packages is the caret library, which follows a consistent syntax for data preparation, model building, and model evaluation, making it easy for data science practitioners.

Caret stands for classification and regression training and is arguably the biggest project in R. This package is sufficient to solve almost any classification or regression machine learning problem. It supports approximately 200 machine learning algorithms and makes it easy to perform critical tasks such as data preparation, data cleaning, feature selection, and model validation.

In this guide, you will learn how to work with the caret library in R.

Data

In this guide, you will use a fictitious dataset of loan applicants containing 600 observations and 8 variables, as described below:

  1. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")

  2. Income: Annual Income of the applicant (in USD)

  3. Loan_amount: Loan amount (in USD) for which the application was submitted

  4. Credit_score: Whether the applicant's credit score is satisfactory ("Satisfactory") or not ("Not_Satisfactory")

  5. approval_status: Whether the loan application was approved ("Yes") or not ("No")

  6. Age: The applicant's age in years

  7. Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant

  8. Purpose: Purpose of applying for the loan

The first step is to load the required libraries and the data.

1
2
3
4
5
6
7
8
9
library(caret)
library(plyr)
library(readr)
library(dplyr)
library(ROSE)

dat <- read_csv("data.csv")

glimpse(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
Observations: 600
Variables: 8
$ Is_graduate     <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Income          <int> 3000, 3000, 3000, 3000, 8990, 13330, 13670, 13670, 173...
$ Loan_amount     <dbl> 6000, 9000, 9000, 9000, 8091, 11997, 12303, 12303, 155...
$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Not _...
$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "...
$ Age             <int> 27, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
$ Investment      <dbl> 9331, 9569, 2100, 2100, 6293, 9331, 9569, 9569, 12124,...
$ Purpose         <chr> "Education", "Travel", "Others", "Others", "Travel", "...

The output shows that the dataset has four numerical and four character variables. Convert these into factor variables using the line of code below.

1
2
3
names <- c(1,4,5,8)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
Observations: 600
Variables: 8
$ Is_graduate     <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Y...
$ Income          <int> 3000, 3000, 3000, 3000, 8990, 13330, 13670, 13670, 173...
$ Loan_amount     <dbl> 6000, 9000, 9000, 9000, 8091, 11997, 12303, 12303, 155...
$ Credit_score    <fct> Satisfactory, Satisfactory, Satisfactory, Not _satisfa...
$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, ...
$ Age             <int> 27, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
$ Investment      <dbl> 9331, 9569, 2100, 2100, 6293, 9331, 9569, 9569, 12124,...
$ Purpose         <fct> Education, Travel, Others, Others, Travel, Travel, Tra...

Data Partition

The createDataPartition function is extremely useful for splitting the data into training and test datasets. This data partition is required because you will build the model on the training set and evaluate its performance on the test set. This is called the holdout-validation method for evaluating model performance.

The first line of code below sets the random seed for reproducibility of results. The second line performs the data partition, while the third and fourth lines create the training and test sets. The training set contains 70 percent of the data (420 observations of 10 variables) and the test set contains the remaining 30 percent (180 observations of 10 variables).

1
2
3
4
5
set.seed(100)
trainRowNumbers <- createDataPartition(dat$approval_status, p=0.7, list=FALSE)
train <- dat[trainRowNumbers,]
test <- dat[-trainRowNumbers,]
dim(train); dim(test) 
{r}

Output:

1
2
3
[1] 420   8

[1] 180   8

Feature Scaling

The numeric features need to be scaled because the units of the variables differ significantly and may influence the modeling process. The first line of code below creates a list that contains the names of numeric variables. The second line uses the preProcess function from the caret library to complete the task. The method is to center and scale the numeric features, and the pre-processing object is fit only to the training data.

The third and fourth lines of code apply the scaling to both the train and test data partitions. The fifth line prints the summary of the preprocessed train set. The output shows that now all the numeric features have a mean value of zero.

1
2
3
4
5
6
7
8
cols = c('Income', 'Loan_amount', 'Age', 'Investment')

pre_proc_val <- preProcess(train[,cols], method = c("center", "scale"))

train[,cols] = predict(pre_proc_val, train[,cols])
test[,cols] = predict(pre_proc_val, test[,cols])

summary(train)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Is_graduate     Income         Loan_amount                 Credit_score
 No : 90     Min.   :-1.3309   Min.   :-1.6568   Not _satisfactory: 97  
 Yes:330     1st Qu.:-0.5840   1st Qu.:-0.3821   Satisfactory     :323  
             Median :-0.3190   Median :-0.1459                          
             Mean   : 0.0000   Mean   : 0.0000                          
             3rd Qu.: 0.2341   3rd Qu.: 0.2778                          
             Max.   : 5.2695   Max.   : 3.7541                          
 approval_status      Age               Investment            Purpose   
 No :133         Min.   :-1.7607181   Min.   :-1.09348   Education: 76  
 Yes:287         1st Qu.:-0.8807620   1st Qu.:-0.60103   Home     :100  
                 Median :-0.0008058   Median :-0.28779   Others   : 45  
                 Mean   : 0.0000000   Mean   : 0.00000   Personal :113  
                 3rd Qu.: 0.8114614   3rd Qu.: 0.02928   Travel   : 86  
                 Max.   : 1.8944843   Max.   : 4.54891                  

Model Building

There are several machine learning models available in caret. You can have a look at these models with the code below.

1
2
available_models <- paste(names(getModelInfo()), collapse=',  ')
available_models
{r}

Output:

1
[1] "ada,  AdaBag,  AdaBoost.M1,  adaboost,  amdai,  ANFIS,  avNNet,  awnb,  awtan,  bag,  bagEarth,  bagEarthGCV,  bagFDA,  bagFDAGCV,  bam,  bartMachine,  bayesglm,  binda,  blackboost,  blasso,  blassoAveraged,  bridge,  brnn,  BstLm,  bstSm,  bstTree,  C5.0,  C5.0Cost,  C5.0Rules,  C5.0Tree,  cforest,  chaid,  CSimca,  ctree,  ctree2,  cubist,  dda,  deepboost,  DENFIS,  dnn,  dwdLinear,  dwdPoly,  dwdRadial,  earth,  elm,  enet,  evtree,  extraTrees,  fda,  FH.GBML,  FIR.DM,  foba,  FRBCS.CHI,  FRBCS.W,  FS.HGD,  gam,  gamboost,  gamLoess,  gamSpline,  gaussprLinear,  gaussprPoly,  gaussprRadial,  gbm_h2o,  gbm,  gcvEarth,  GFS.FR.MOGUL,  GFS.GCCL,  GFS.LT.RS,  GFS.THRIFT,  glm.nb,  glm,  glmboost,  glmnet_h2o,  glmnet,  glmStepAIC,  gpls,  hda,  hdda,  hdrda,  HYFIS,  icr,  J48,  JRip,  kernelpls,  kknn,  knn,  krlsPoly,  krlsRadial,  lars,  lars2,  lasso,  lda,  lda2,  leapBackward,  leapForward,  leapSeq,  Linda,  lm,  lmStepAIC,  LMT,  loclda,  logicBag,  LogitBoost,  logreg, ... <truncated>

The next step is to build the random forest algorithm. Start by setting the seed in the first line of code below. The second line specifies the parameters used to control the model training process. This is done with the trainControl function.

The third line trains the random forest algorithm specified by the argument method="rf". Accuracy is selected as the evaluation criteria.

1
2
3
4
5
set.seed(100)

control1 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)

rf_model <- train(approval_status ~., data=train, method="rf", metric="Accuracy", trControl=control1)
{r}

You can examine the model with the command below.

1
rf_model 
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Random Forest 

420 samples
  7 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 5 times) 
Summary of sample sizes: 336, 336, 336, 336, 336, 336, ... 
Addtional sampling using ROSE

Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.8799087  0.7300565
   6    0.7163380  0.4289620
  10    0.6675567  0.3352061

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

Model Evaluation

After building the algorithm on the training data, the next step is to evaluate its performance on the test dataset. The lines of code below generate predictions on the test set and print the confusion matrix.

1
2
3
predictTest = predict(rf_model, newdata = test, type = "raw")

table(test$approval_status, predictTest) 
{r}

Output:

1
2
3
4
    predictTest
       No Yes
  No   56   1
  Yes  10 113

The accuracy can be calculated from the confusion matrix with the code below.

1
(113+56)/nrow(test)
{r}

Output:

1
[1] 0.9388889

The output shows that the accuracy is 94%, which indicates that the model performed well.

Conclusion

0