Author avatar

Deepika Singh

Explore R Libraries: MICE

Deepika Singh

  • Jul 22, 2020
  • 16 Min read
  • 362 Views
  • Jul 22, 2020
  • 16 Min read
  • 362 Views
Data
Data Analytics
Machine Learning

Introduction

Dealing with missing values is a common task for data scientists when building machine learning models. There are several methods of dealing with missing values, and if you want to use advanced techniques, the mice library in R is a great option.

MICE stands for Multivariate Imputation by Chained Equations, and it works by creating multiple imputations (replacement values) for multivariate missing data. The MICE algorithm can be used with different data types such as continuous, binary, unordered categorical, and ordered categorical data.

In this guide, you will learn how to work with the mice library in R.

Data

In this guide, you will use a fictitious data of loan applicants containing 600 observations and eight variables, as described below:

  1. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")

  2. Income: Annual Income of the applicant (in USD)

  3. Loan_amount: Loan amount (in USD) for which the application was submitted

  4. Credit_score: Whether the applicant's credit score is satisfactory ("Satisfactory") or not ("Not_Satisfactory")

  5. approval_status: Whether the loan application was approved ("Yes") or not ("No")

  6. Age: The applicant's age in years

  7. Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant

  8. Purpose: Purpose of applying for the loan

The first step is to load the required libraries and the data.

1library(plyr)
2library(readr)
3library(dplyr)
4library(caret)
5library(mice)
6library(VIM)
7
8dat <- read_csv("C:/Notes_Old/A_Resources/data_qna/Content writing/R guides/caret package/data_mice.csv")
9
10glimpse(dat)
{r}

Output:

1Observations: 600
2Variables: 8
3$ Is_graduate     <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye...
4$ Income          <int> 3000, 3000, 3000, 3000, 8990, NA, NA, NA, NA, NA, N...
5$ Loan_amount     <dbl> 6000, NA, NA, NA, 8091, NA, NA, NA, NA, NA, NA, NA,...
6$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", NA,...
7$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes"...
8$ Age             <int> 27, 29, 27, 33, 29, NA, 29, 27, 33, 29, NA, 29, 27,...
9$ Investment      <dbl> 9331, 9569, 2100, 2100, 6293, 9331, 9569, 9569, 121...
10$ Purpose         <chr> "Education", "Travel", "Others", "Others", "Travel"...

The output shows that the dataset has four numerical and four character variables. You will convert these into factor variables with the code below.

1names <- c(1,4,5,8)
2dat[,names] <- lapply(dat[,names] , factor)
3glimpse(dat)
{r}

Output:

1Observations: 600
2Variables: 8
3$ Is_graduate     <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No...
4$ Income          <int> 3000, 3000, 3000, 3000, 8990, NA, NA, NA, NA, NA, N...
5$ Loan_amount     <dbl> 6000, NA, NA, NA, 8091, NA, NA, NA, NA, NA, NA, NA,...
6$ Credit_score    <fct> Satisfactory, Satisfactory, Satisfactory, NA, NA, S...
7$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, N...
8$ Age             <int> 27, 29, 27, 33, 29, NA, 29, 27, 33, 29, NA, 29, 27,...
9$ Investment      <dbl> 9331, 9569, 2100, 2100, 6293, 9331, 9569, 9569, 121...
10$ Purpose         <fct> Education, Travel, Others, Others, Travel, Travel, ...

Missing Data Pattern Analysis

The summary() function provides a quick overview of the variables and missing values, if any.

1summary(dat) 
{r}

Output:

1Is_graduate     Income        Loan_amount                Credit_score
2 No :130     Min.   :  3000   Min.   :  6000   Not _satisfactory:123  
3 Yes:470     1st Qu.: 39045   1st Qu.:115665   Satisfactory     :458  
4             Median : 50995   Median :135990   NA's             : 19  
5             Mean   : 65901   Mean   :149313                          
6             3rd Qu.: 76170   3rd Qu.:170740                          
7             Max.   :277770   Max.   :466660                          
8             NA's   :20       NA's   :17                              
9 approval_status      Age          Investment          Purpose   
10 No :190         Min.   :22.00   Min.   :  2100   Education: 94  
11 Yes:410         1st Qu.:35.00   1st Qu.: 16678   Home     :132  
12                 Median :50.00   Median : 26439   Others   : 64  
13                 Mean   :48.82   Mean   : 34442   Personal :174  
14                 3rd Qu.:61.00   3rd Qu.: 35000   Travel   :118  
15                 Max.   :76.00   Max.   :190422   NA's     : 18  
16                 NA's   :19                                      

The output above shows that some of the variables have missing values, represented by NA's. To understand the pattern of missing values better, you can use the md.pattern() function.

1md.pattern(dat)
{r}

Output:

1     Is_graduate approval_status Investment Loan_amount Purpose
2 559           1               1          1           1       1
3 4             1               1          1           1       1
4 4             1               1          1           1       1
5 3             1               1          1           1       1
6 10            1               1          1           1       0
7 3             1               1          1           1       0
8 2             1               1          1           0       1
9 2             1               1          1           0       1
10 1             1               1          1           0       1
11 1             1               1          1           0       1
12 6             1               1          1           0       1
13 4             1               1          1           0       0
14 1             1               1          1           0       0
15               0               0          0          17      18
16
17     Credit_score Age Income   
18 559            1   1      1  0
19 4              1   0      1  1
20 4              0   1      1  1
21 3              0   1      0  2
22 10             1   0      1  2
23 3              1   0      0  3
24 2              1   1      1  1
25 2              1   1      0  2
26 1              1   0      0  3
27 1              0   1      1  2
28 6              0   1      0  3
29 4              0   1      0  4
30 1              0   0      0  5
31               19  19     20 93

The topmost row of the output indicates that there are 559 records with no missing values. There are 10 records that have missing values only in the Income variable, which overall has twenty missing values.

The missing value pattern can also be analyzed with the code below.

1plot1 <- aggr(dat, col=c('blue','red'), numbers=TRUE, sortVars=TRUE, labels=names(dat), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
{r}

Output:

1Variables sorted by number of missings: 
2        Variable      Count
3          Income 0.03333333
4    Credit_score 0.03166667
5             Age 0.03166667
6         Purpose 0.03000000
7     Loan_amount 0.02833333
8     Is_graduate 0.00000000
9 approval_status 0.00000000
10      Investment 0.00000000

The output above prints the percentage of missing values in each of the variables. Overall, 93% of the data does not have missing values, which can be seen from the right-hand side plot below.

plot

The number of missing values is not large, and you can remove these observations. But the objective is to use the mice library to treat missing values.

MICE Imputation

The mice() function is used to impute missing values. Some of the important arguments used in the code are explained below.

  1. data: A data frame or a matrix containing the incomplete data. Missing values are coded as NA.
  1. m: Number of multiple imputations. The default value is five.
  1. method: Specifies the imputation method to be used for each column in data. In this case, you are using predictive mean matching (PMM) as an imputation method.
  1. maxit: A scalar giving the number of iterations. The default value is five.

The above arguments are passed to the imputation function.

1imputed_data <- mice(dat,m=5,maxit=50,meth='pmm',seed=500)
2summary(imputed_data)
{r}

Output:

1Class: mids
2Number of multiple imputations:  5 
3Imputation methods:
4    Is_graduate          Income     Loan_amount    Credit_score approval_status 
5             ""           "pmm"           "pmm"           "pmm"              "" 
6            Age      Investment         Purpose 
7          "pmm"              ""           "pmm" 
8PredictorMatrix:
9                Is_graduate Income Loan_amount Credit_score approval_status Age
10Is_graduate               0      1           1            1               1   1
11Income                    1      0           1            1               1   1
12Loan_amount               1      1           0            1               1   1
13Credit_score              1      1           1            0               1   1
14approval_status           1      1           1            1               0   1
15Age                       1      1           1            1               1   0
16                Investment Purpose
17Is_graduate              1       1
18Income                   1       1
19Loan_amount              1       1
20Credit_score             1       1
21approval_status          1       1
22Age                      1       1

If you want to look at a specific variable's imputed data—for instance, the variable Purpose—you can do that with the code below.

1imputed_data$imp$Purpose
{r}

Output:

1             1         2        3        4         5
2 9      Travel Education   Travel Personal    Travel
3 10  Education      Home     Home     Home      Home
4 11       Home  Personal     Home     Home    Travel
5 12       Home      Home   Travel     Home      Home
6 13  Education Education   Travel   Travel      Home
7 588    Travel    Others   Travel   Travel  Personal
8 589    Travel    Travel Personal Personal  Personal
9 590    Travel    Travel   Travel   Travel  Personal
10 591    Travel  Personal   Travel   Travel    Others
11 592    Travel  Personal   Travel   Travel Education
12 593      Home Education Personal   Travel Education
13 594      Home      Home     Home     Home      Home
14 595  Personal Education   Travel   Travel Education
15 596    Travel    Travel   Travel   Travel      Home
16 597  Personal    Travel   Travel   Travel      Home
17 598      Home Education   Travel   Travel Education
18 599    Others  Personal Personal   Travel  Personal
19 600    Others    Travel   Travel   Travel    Travel

The above output shows that for the 18 missing values in the Purpose variable, there are five sets of imputations available.

The next step is to complete the missing value imputation on the entire data with the code below. The missing values will be replaced with the values in the first of the five imputed datasets, indicated by the value of one in the second argument.

1completeddata1 <- complete(imputed_data,1)
2summary(completeddata1)
{r}

Output:

1Is_graduate     Income        Loan_amount                Credit_score
2 No :130     Min.   :  3000   Min.   :  6000   Not _satisfactory:129  
3 Yes:470     1st Qu.: 38498   1st Qu.:112973   Satisfactory     :471  
4             Median : 50835   Median :134385                          
5             Mean   : 65819   Mean   :146552                          
6             3rd Qu.: 76040   3rd Qu.:168715                          
7             Max.   :277770   Max.   :466660                          
8 approval_status      Age          Investment          Purpose   
9 No :190         Min.   :22.00   Min.   :  2100   Education: 96  
10 Yes:410         1st Qu.:35.00   1st Qu.: 16678   Home     :137  
11                 Median :50.00   Median : 26439   Others   : 66  
12                 Mean   :49.18   Mean   : 34442   Personal :176  
13                 3rd Qu.:61.25   3rd Qu.: 35000   Travel   :125  
14                 Max.   :76.00   Max.   :190422                  

The summary of the new data shows the absence of any missing values, indicating that the missing value imputation is complete. You can go ahead and use the new data for model building to check model performance on the imputed data.

Model Building with Imputed Data

The lines of code below create a data partition, build the random forest algorithm on the training data set, and evaluate the model on the test data set.

1# Create Data Partition 
2set.seed(100)
3trainRowNumbers <- createDataPartition(completeddata1$approval_status, p=0.7, list=FALSE)
4train <- completeddata1[trainRowNumbers,]
5test <- completeddata1[-trainRowNumbers,]
6
7# Build Random Forest Algorithm
8
9control1 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)
10rf_model <- train(approval_status ~., data=train, method="rf", metric="Accuracy", trControl=control1)
11
12
13# Model Evaluation
14predictTest = predict(rf_model, newdata = test, type = "raw")
15table(test$approval_status, predictTest) 
{r}

Output:

1     predictTest
2       No Yes
3  No   45  12
4  Yes  11 112
5  
6  

The accuracy can be calculated from the above confusion matrix with the code below.

1(112+45)/nrow(test)
{r}

Output:

1[1] 0.8722222

The output shows that the accuracy on the test data is 87%, which indicates that the model performance is good.

Conclusion

In this guide, you learned about the mice library, which is one of the advanced packages in R for missing value imputation. You learned how to identify and visualize the patterns of missing values in data, and to impute them with the mice librray. This will help you in data preprocessing and preparation for machine learning.

To learn more about data science and machine learning with R, please refer to the following guides: