R is a powerful programming language for data science that provides a wide number of libraries for machine learning. One of the most powerful and popular packages is the caret library, which follows a consistent syntax for data preparation, model building, and model evaluation, making it easy for data science practitioners.
Caret stands for classification and regression training and is arguably the biggest project in R. This package is sufficient to solve almost any classification or regression machine learning problem. It supports approximately 200 machine learning algorithms and makes it easy to perform critical tasks such as data preparation, data cleaning, feature selection, and model validation.
In this guide, you will learn how to work with the caret library in R.
In this guide, you will use a fictitious dataset of loan applicants containing 600 observations and 8 variables, as described below:
Is_graduate
: Whether the applicant is a graduate ("Yes") or not ("No")
Income
: Annual Income of the applicant (in USD)
Loan_amount
: Loan amount (in USD) for which the application was submitted
Credit_score
: Whether the applicant's credit score is satisfactory ("Satisfactory") or not ("Not_Satisfactory")
approval_status
: Whether the loan application was approved ("Yes") or not ("No")
Age
: The applicant's age in years
Investment
: Total investment in stocks and mutual funds (in USD) as declared by the applicant
Purpose
: Purpose of applying for the loanThe first step is to load the required libraries and the data.
1library(caret)
2library(plyr)
3library(readr)
4library(dplyr)
5library(ROSE)
6
7dat <- read_csv("data.csv")
8
9glimpse(dat)
Output:
1Observations: 600
2Variables: 8
3$ Is_graduate <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
4$ Income <int> 3000, 3000, 3000, 3000, 8990, 13330, 13670, 13670, 173...
5$ Loan_amount <dbl> 6000, 9000, 9000, 9000, 8091, 11997, 12303, 12303, 155...
6$ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Not _...
7$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "...
8$ Age <int> 27, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
9$ Investment <dbl> 9331, 9569, 2100, 2100, 6293, 9331, 9569, 9569, 12124,...
10$ Purpose <chr> "Education", "Travel", "Others", "Others", "Travel", "...
The output shows that the dataset has four numerical and four character variables. Convert these into factor variables using the line of code below.
1names <- c(1,4,5,8)
2dat[,names] <- lapply(dat[,names] , factor)
3glimpse(dat)
Output:
1Observations: 600
2Variables: 8
3$ Is_graduate <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Y...
4$ Income <int> 3000, 3000, 3000, 3000, 8990, 13330, 13670, 13670, 173...
5$ Loan_amount <dbl> 6000, 9000, 9000, 9000, 8091, 11997, 12303, 12303, 155...
6$ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Not _satisfa...
7$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, ...
8$ Age <int> 27, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
9$ Investment <dbl> 9331, 9569, 2100, 2100, 6293, 9331, 9569, 9569, 12124,...
10$ Purpose <fct> Education, Travel, Others, Others, Travel, Travel, Tra...
The createDataPartition
function is extremely useful for splitting the data into training and test datasets. This data partition is required because you will build the model on the training set and evaluate its performance on the test set. This is called the holdout-validation method for evaluating model performance.
The first line of code below sets the random seed for reproducibility of results. The second line performs the data partition, while the third and fourth lines create the training and test sets. The training set contains 70 percent of the data (420 observations of 10 variables) and the test set contains the remaining 30 percent (180 observations of 10 variables).
1set.seed(100)
2trainRowNumbers <- createDataPartition(dat$approval_status, p=0.7, list=FALSE)
3train <- dat[trainRowNumbers,]
4test <- dat[-trainRowNumbers,]
5dim(train); dim(test)
Output:
1[1] 420 8
2
3[1] 180 8
The numeric features need to be scaled because the units of the variables differ significantly and may influence the modeling process. The first line of code below creates a list that contains the names of numeric variables. The second line uses the preProcess
function from the caret
library to complete the task. The method is to center and scale the numeric features, and the pre-processing object is fit only to the training data.
The third and fourth lines of code apply the scaling to both the train and test data partitions. The fifth line prints the summary of the preprocessed train set. The output shows that now all the numeric features have a mean value of zero.
1cols = c('Income', 'Loan_amount', 'Age', 'Investment')
2
3pre_proc_val <- preProcess(train[,cols], method = c("center", "scale"))
4
5train[,cols] = predict(pre_proc_val, train[,cols])
6test[,cols] = predict(pre_proc_val, test[,cols])
7
8summary(train)
Output:
1Is_graduate Income Loan_amount Credit_score
2 No : 90 Min. :-1.3309 Min. :-1.6568 Not _satisfactory: 97
3 Yes:330 1st Qu.:-0.5840 1st Qu.:-0.3821 Satisfactory :323
4 Median :-0.3190 Median :-0.1459
5 Mean : 0.0000 Mean : 0.0000
6 3rd Qu.: 0.2341 3rd Qu.: 0.2778
7 Max. : 5.2695 Max. : 3.7541
8 approval_status Age Investment Purpose
9 No :133 Min. :-1.7607181 Min. :-1.09348 Education: 76
10 Yes:287 1st Qu.:-0.8807620 1st Qu.:-0.60103 Home :100
11 Median :-0.0008058 Median :-0.28779 Others : 45
12 Mean : 0.0000000 Mean : 0.00000 Personal :113
13 3rd Qu.: 0.8114614 3rd Qu.: 0.02928 Travel : 86
14 Max. : 1.8944843 Max. : 4.54891
There are several machine learning models available in caret. You can have a look at these models with the code below.
1available_models <- paste(names(getModelInfo()), collapse=', ')
2available_models
Output:
1[1] "ada, AdaBag, AdaBoost.M1, adaboost, amdai, ANFIS, avNNet, awnb, awtan, bag, bagEarth, bagEarthGCV, bagFDA, bagFDAGCV, bam, bartMachine, bayesglm, binda, blackboost, blasso, blassoAveraged, bridge, brnn, BstLm, bstSm, bstTree, C5.0, C5.0Cost, C5.0Rules, C5.0Tree, cforest, chaid, CSimca, ctree, ctree2, cubist, dda, deepboost, DENFIS, dnn, dwdLinear, dwdPoly, dwdRadial, earth, elm, enet, evtree, extraTrees, fda, FH.GBML, FIR.DM, foba, FRBCS.CHI, FRBCS.W, FS.HGD, gam, gamboost, gamLoess, gamSpline, gaussprLinear, gaussprPoly, gaussprRadial, gbm_h2o, gbm, gcvEarth, GFS.FR.MOGUL, GFS.GCCL, GFS.LT.RS, GFS.THRIFT, glm.nb, glm, glmboost, glmnet_h2o, glmnet, glmStepAIC, gpls, hda, hdda, hdrda, HYFIS, icr, J48, JRip, kernelpls, kknn, knn, krlsPoly, krlsRadial, lars, lars2, lasso, lda, lda2, leapBackward, leapForward, leapSeq, Linda, lm, lmStepAIC, LMT, loclda, logicBag, LogitBoost, logreg, ... <truncated>
The next step is to build the random forest algorithm. Start by setting the seed in the first line of code below. The second line specifies the parameters used to control the model training process. This is done with the trainControl
function.
The third line trains the random forest algorithm specified by the argument method="rf"
. Accuracy is selected as the evaluation criteria.
1set.seed(100)
2
3control1 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)
4
5rf_model <- train(approval_status ~., data=train, method="rf", metric="Accuracy", trControl=control1)
You can examine the model with the command below.
1rf_model
Output:
1Random Forest
2
3420 samples
4 7 predictor
5 2 classes: 'No', 'Yes'
6
7No pre-processing
8Resampling: Cross-Validated (5 fold, repeated 5 times)
9Summary of sample sizes: 336, 336, 336, 336, 336, 336, ...
10Addtional sampling using ROSE
11
12Resampling results across tuning parameters:
13
14 mtry Accuracy Kappa
15 2 0.8799087 0.7300565
16 6 0.7163380 0.4289620
17 10 0.6675567 0.3352061
18
19Accuracy was used to select the optimal model using the largest value.
20The final value used for the model was mtry = 2.
After building the algorithm on the training data, the next step is to evaluate its performance on the test dataset. The lines of code below generate predictions on the test set and print the confusion matrix.
1predictTest = predict(rf_model, newdata = test, type = "raw")
2
3table(test$approval_status, predictTest)
Output:
1 predictTest
2 No Yes
3 No 56 1
4 Yes 10 113
The accuracy can be calculated from the confusion matrix with the code below.
1(113+56)/nrow(test)
Output:
1[1] 0.9388889
The output shows that the accuracy is 94%, which indicates that the model performed well.
In this guide, you learned about the caret library, which is one of the most powerful packages in R. You also learned how to scale features, create data partitions, and train and evaluate machine learning algorithms.
To learn more about data science and machine learning with R, please refer to the following guides: