Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Preparing Data for Modeling with R

Nov 21, 2019 • 17 Minute Read

Introduction

Data preparation often takes up to eighty percent of a data scientist's time in a data science project. This demonstrates its importance in the machine learning life cycle. In this guide, you will learn the basics and implementation of several data preparation techniques in R:

  1. Dealing with Incorrect Entries

  2. Missing Value Treatment

  3. Encoding Categorical Labels

  4. Handling Outliers

  5. Logarithmic Transformation

  6. Standardization

  7. Converting Column Types

Data

In this guide, we will use a fictitious dataset of loan applicants that contains 600 observations and 10 variables, as described below:

  1. Marital_status: Whether the applicant is married ("1") or not ("0")

  2. Dependents: Number of dependents claimed by the applicant

  3. Is_graduate: Whether the applicant is a graduate ("1") or not ("0")

  4. Income: Annual Income of the applicant (in hundreds of dollars)

  5. Loan_amount: Loan amount (in hundreds of dollars) for which the application was submitted

  6. Term_months: Tenure of the loan (in months)

  7. Credit_score: Whether the applicant's credit score was good ("1") or not ("0")

  8. Age: The applicant’s age in years

  9. Sex: Whether the applicant is female ("F") or male ("M")

  10. Approval_status: Whether the loan application was approved ("1") or not ("0")

Let's start by loading the required libraries and the data.

      library(plyr)
library(readr)
library(dplyr)
library(caret)
library(ggplot2)
library(repr)

dat <- read_csv("data_prep.csv")
glimpse(dat)
    

Output:

      Observations: 600
Variables: 10
$ Marital_status  <int> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
$ Dependents      <int> 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, ...
$ Is_graduate     <int> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...
$ Income          <int> 5635, 2708, 3333, 5324, 12677, 50292, 9523, 3685, 3107...
$ Loan_amount     <int> 122, 126, 107, 230, 208, 169, 153, 122, 111, 107, 98, ...
$ Term_months     <int> 384, 504, 504, 384, 384, 384, 384, 384, 384, 384, 384,...
$ Credit_score    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
$ approval_status <int> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, ...
$ Age             <int> 0, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23,...
$ Sex             <chr> "M", "F", "M", "M", "M", "M", "F", "F", "F", "M", "F",...
    

The output shows that the dataset has nine numerical variables (labeled as int) and one character variable (labeled as chr). We will analyze the data types in the subsequent section. For now, let's look at the summary of the data.

      summary(dat)
    

Output:

      Marital_status     Dependents      Is_graduate          Income      
 Min.   :0.0000   Min.   :0.0000   Min.   :  0.000   Min.   :   200  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:  1.000   1st Qu.:  3832  
 Median :1.0000   Median :0.0000   Median :  1.000   Median :  5075  
 Mean   :0.6517   Mean   :0.7308   Mean   :  2.449   Mean   :  7211  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:  1.000   3rd Qu.:  7642  
 Max.   :1.0000   Max.   :3.0000   Max.   :999.000   Max.   :108000  
                  NA's   :2        NA's   :1                         
  Loan_amount     Term_months     Credit_score    approval_status 
 Min.   : 10.0   Min.   : 36.0   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:111.0   1st Qu.:384.0   1st Qu.:1.0000   1st Qu.:0.0000  
 Median :140.0   Median :384.0   Median :1.0000   Median :1.0000  
 Mean   :161.6   Mean   :367.1   Mean   :0.7883   Mean   :0.6867  
 3rd Qu.:180.5   3rd Qu.:384.0   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :778.0   Max.   :504.0   Max.   :1.0000   Max.   :1.0000  
                                                                  
      Age             Sex           
 Min.   :  0.00   Length:600        
 1st Qu.: 36.00   Class :character  
 Median : 51.00   Mode  :character  
 Mean   : 51.77                     
 3rd Qu.: 64.00                     
 Max.   :200.00
    

Dealing with Incorrect Entries

The above output shows that the variable Age has minimum and maximum value of 0 and 200, respectively. Also, the variable Is_graduate has a maximum value of 999, instead of the binary values of 0 and 1. These entries are incorrect and need correction. One approach would be to delete these records, but instead, we will treat these records as missing values and replace them with a measure of central tendency—mean, median, or mode.

Starting with the Age variable, the first two lines of code below convert the incorrect values 0 and 200 to missing records. We repeat the same process for the variable Is_graduate in the third line of code. The fourth and fifth lines print the information about the variables.

      dat[, 9][dat[, 9] == 0] <- NA
dat[, 9][dat[, 9] == 200] <- NA
dat[, 3][dat[, 3] == 999] <- NA
summary(dat$Age)
summary(dat$Is_graduate)
    

Output:

      Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  22.00   36.00   50.50   50.61   64.00   80.00       6 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  1.0000  1.0000  0.7826  1.0000  1.0000       2
    

Now, the variables Age and Is_graduate have 594 and 598 records, respectively. The left-out entries have been tagged as missing, which we’ll handle in the next section.

Missing Value Treatment

Missing value imputation is one of the most common data preparation steps. It's easy to count the number of missing values in each variable using the sapply() function.

      sapply(dat, function(x) sum(is.na(x)))
    

Output:

      Marital_status      Dependents     Is_graduate          Income     Loan_amount 
              0               2               2               0               0 
    Term_months    Credit_score approval_status             Age             Sex 
              0               0               0               6               5
    

There are several techniques for handling missing values. The most widely used is replacing the values with the measure of central tendency. The first line of code below replaces the missing values of the Age variable with the median of the remaining values. The second line replaces the missing values of the Is_graduate variable with the value of '1', which indicates that the applicant's education status is graduate. The last two lines print the summary statistics of the variables.

      dat$Age[is.na(dat$Age)] <- median(dat$Age, na.rm = TRUE)
dat$Is_graduate[is.na(dat$Is_graduate)] <- 1

summary(dat$Age)
table(dat$Is_graduate)
    

Output:

      Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     
  22.00   36.00   50.50   50.61   64.00   80.00  


  0   1 
130 470
    

The missing value treatment is complete for both the variables. The data also has a variable, Sex, with five missing values. Since this is a categorical variable, we will check the distribution of labels, which is done in the line of code below.

      table(dat$Sex)
    

Output:

      F   M 
111 484
    

The output shows that 484 out of 595 applicants are male, so we will replace the missing values with the label M. The first line of code below performs this task, while the second line prints the distribution of the variable. The output shows 600 records for the Sex variable, indicating that the missing values have been imputed.

      dat$Sex[is.na(dat$Sex)] <- "M"
table(dat$Sex)
    

Output:

      Length     Class      Mode 
      600 character character
    

We will now check if there are any more missing values with the line of code below. The output shows that we still have two missing values in the variable Dependents.

      sapply(dat, function(x) sum(is.na(x)))
    

Output:

      Marital_status      Dependents     Is_graduate          Income     Loan_amount 
              0               2               0               0               0 
    Term_months    Credit_score approval_status             Age             Sex 
              0               0               0               0               0
    

There are two missing values left in the data set, and we'll use another approach of treating missing values by dropping the records. The first line of code below uses the complete.cases() function to drop rows with any missing values in them, while the second line checks the information about the missing values in the data set. The third line prints the resulting dimension of the data.

      dat = dat[complete.cases(dat), ]
sapply(dat, function(x) sum(is.na(x)))
dim(dat)
    

Output:

      Marital_status      Dependents     Is_graduate          Income     Loan_amount 
              0               0               0               0               0 
    Term_months    Credit_score approval_status             Age             Sex 
              0               0               0               0               0 
      LogIncome 
              0 


[1] 598  10
    

Encoding Categorical Labels

Certain machine learning algorithms like xgboost require all variables to be numeric. In this data, the variable Sex has labels (M and F), so we will have to change them to a numeric format. Since there are two labels, we can do binary encoding, which is done in the first line of code below. The output from the second line shows that we have successfully performed the encoding.

      dat$Sex <- ifelse(dat$Sex == "M",1,0)
table(dat$Sex)
    

Output:

      0   1 
111 487
    

Handling Outliers

One of the biggest obstacles in predictive modeling can be the presence of outliers, which are extreme values that are different from the other data points. Outliers are often a problem because they mislead the training process and lead to inaccurate models.

For numeric variables, we can identify outliers numerically through the skewness value. The lines of code below print the skewness value for the Income variable.

      library(e1071) 
skewness(dat$Income)
    

Output:

      1] 6.455884
    

The output above shows that the variable Income has a right-skewed distribution with a skewness value of 6.5. Ideally, this value should be between -1 and 1. Apart from the income variable, we also have other variables, such as Loan_amount and Age, that have differences in scale and require normalization. We'll learn techniques in the next couple of sections to deal with this problem.

Logarithmic Transformation of Numerical Variables

One of the techniques to make a skewed distribution normal is logarithmic transformation. The first line of code below creates a new variable, LogIncome, while the second line computes and prints the skewness value of this new variable.

      dat$LogIncome = log(dat$Income)
skewness(dat$LogIncome)
    

Output:

      1] 0.4860717
    

The above chart shows that taking the log of the Income variable makes the distribution roughly normal and reduces the skewness value. We can use the same transformation for other numerical variables, but instead, we'll learn another transformation technique called standardization.

Standardization

Several machine learning algorithms use some form of a distance matrix to learn from the data. However, when the features are using different scales, such as Age in years and Income in hundreds of dollars, the features using larger scales can unduly influence the model. As a result, we want the features to use a similar scale that can be achieved through scaling techniques.

One such technique is standardization, in which all the features are centered around zero and have, roughly, unit variance. The first line of code below uses the mutate-at function from the dplyr library to perform the scaling. The result is that these variables are standardized with zero mean and unit variance.

      dat_2 = dat %>% mutate_at(scale, .vars = vars("Income","Loan_amount", "Age"))

sd(dat_2$Income)
sd(dat_2$Loan_amount)
sd(dat_2$Age)
    

Output:

      1] 1

[1] 1

[1] 1
    

Let's now look at the variables after all the steps we've implemented.

      glimpse(dat_2)
    

Output:

      Observations: 598
Variables: 11
$ Marital_status  <int> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
$ Dependents      <int> 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, ...
$ Is_graduate     <dbl> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...
$ Income          <dbl> -0.192200887, -0.547541541, -0.471665931, -0.229956590...
$ Loan_amount     <dbl> -0.42074229, -0.37791245, -0.58135418, 0.73566336, 0.5...
$ Term_months     <int> 384, 504, 504, 384, 384, 384, 384, 384, 384, 384, 384,...
$ Credit_score    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
$ approval_status <int> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, ...
$ Age             <dbl> -0.003567868, -1.766094863, -1.766094863, -1.766094863...
$ Sex             <dbl> 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, ...
$ LogIncome       <dbl> 8.636752, 7.903966, 8.111628, 8.579980, 9.447545, 10.8...
    

Converting Column Types

The output above indicates that there are many variables for which the data type is incorrect. These variables are Marital_status, Is_graduate, Credit_score, approval_status, and Sex. These are categorical variables labeled as character variables, and they need to be converted to the factor type for modeling purposes. The lines of code below perform this task.

      names <- c(1,3,7,8,10)
dat_2[,names] <- lapply(dat_2[,names] , factor)
glimpse(dat_2)
    

Output:

      Observations: 598
Variables: 11
$ Marital_status  <fct> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
$ Dependents      <int> 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, ...
$ Is_graduate     <fct> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...
$ Income          <dbl> -0.192200887, -0.547541541, -0.471665931, -0.229956590...
$ Loan_amount     <dbl> -0.42074229, -0.37791245, -0.58135418, 0.73566336, 0.5...
$ Term_months     <int> 384, 504, 504, 384, 384, 384, 384, 384, 384, 384, 384,...
$ Credit_score    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
$ approval_status <fct> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, ...
$ Age             <dbl> -0.003567868, -1.766094863, -1.766094863, -1.766094863...
$ Sex             <fct> 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, ...
$ LogIncome       <dbl> 8.636752, 7.903966, 8.111628, 8.579980, 9.447545, 10.8...
    

Conclusion