Deepika Singh

# Preparing Data for Modeling with R

• Nov 21, 2019
• 858 Views
• Nov 21, 2019
• 858 Views
Data
R

## Introduction

Data preparation often takes up to eighty percent of a data scientist's time in a data science project. This demonstrates its importance in the machine learning life cycle. In this guide, you will learn the basics and implementation of several data preparation techniques in R:

1. Dealing with Incorrect Entries

2. Missing Value Treatment

3. Encoding Categorical Labels

4. Handling Outliers

5. Logarithmic Transformation

6. Standardization

7. Converting Column Types

## Data

In this guide, we will use a fictitious dataset of loan applicants that contains 600 observations and 10 variables, as described below:

1. `Marital_status`: Whether the applicant is married ("1") or not ("0")

2. `Dependents`: Number of dependents claimed by the applicant

3. `Is_graduate`: Whether the applicant is a graduate ("1") or not ("0")

4. `Income`: Annual Income of the applicant (in hundreds of dollars)

5. `Loan_amount`: Loan amount (in hundreds of dollars) for which the application was submitted

6. `Term_months`: Tenure of the loan (in months)

7. `Credit_score`: Whether the applicant's credit score was good ("1") or not ("0")

8. `Age`: The applicant’s age in years

9. `Sex`: Whether the applicant is female ("F") or male ("M")

10. `Approval_status`: Whether the loan application was approved ("1") or not ("0")

``````1
2
3
4
5
6
7
8
9
``````library(plyr)
library(dplyr)
library(caret)
library(ggplot2)
library(repr)

glimpse(dat)``````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
``````Observations: 600
Variables: 10
\$ Marital_status  <int> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
\$ Dependents      <int> 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, ...
\$ Is_graduate     <int> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...
\$ Income          <int> 5635, 2708, 3333, 5324, 12677, 50292, 9523, 3685, 3107...
\$ Loan_amount     <int> 122, 126, 107, 230, 208, 169, 153, 122, 111, 107, 98, ...
\$ Term_months     <int> 384, 504, 504, 384, 384, 384, 384, 384, 384, 384, 384,...
\$ Credit_score    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
\$ approval_status <int> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, ...
\$ Age             <int> 0, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23,...
\$ Sex             <chr> "M", "F", "M", "M", "M", "M", "F", "F", "F", "M", "F",...``````

The output shows that the dataset has nine numerical variables (labeled as `int`) and one character variable (labeled as `chr`). We will analyze the data types in the subsequent section. For now, let's look at the summary of the data.

``````1
````summary(dat)````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
``````Marital_status     Dependents      Is_graduate          Income
Min.   :0.0000   Min.   :0.0000   Min.   :  0.000   Min.   :   200
1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:  1.000   1st Qu.:  3832
Median :1.0000   Median :0.0000   Median :  1.000   Median :  5075
Mean   :0.6517   Mean   :0.7308   Mean   :  2.449   Mean   :  7211
3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:  1.000   3rd Qu.:  7642
Max.   :1.0000   Max.   :3.0000   Max.   :999.000   Max.   :108000
NA's   :2        NA's   :1
Loan_amount     Term_months     Credit_score    approval_status
Min.   : 10.0   Min.   : 36.0   Min.   :0.0000   Min.   :0.0000
1st Qu.:111.0   1st Qu.:384.0   1st Qu.:1.0000   1st Qu.:0.0000
Median :140.0   Median :384.0   Median :1.0000   Median :1.0000
Mean   :161.6   Mean   :367.1   Mean   :0.7883   Mean   :0.6867
3rd Qu.:180.5   3rd Qu.:384.0   3rd Qu.:1.0000   3rd Qu.:1.0000
Max.   :778.0   Max.   :504.0   Max.   :1.0000   Max.   :1.0000

Age             Sex
Min.   :  0.00   Length:600
1st Qu.: 36.00   Class :character
Median : 51.00   Mode  :character
Mean   : 51.77
3rd Qu.: 64.00
Max.   :200.00``````

## Dealing with Incorrect Entries

The above output shows that the variable `Age` has minimum and maximum value of `0` and `200`, respectively. Also, the variable `Is_graduate` has a maximum value of `999`, instead of the binary values of `0` and `1`. These entries are incorrect and need correction. One approach would be to delete these records, but instead, we will treat these records as missing values and replace them with a measure of central tendency—mean, median, or mode.

Starting with the `Age` variable, the first two lines of code below convert the incorrect values `0` and `200` to missing records. We repeat the same process for the variable `Is_graduate` in the third line of code. The fourth and fifth lines print the information about the variables.

``````1
2
3
4
5
``````dat[, 9][dat[, 9] == 0] <- NA
dat[, 9][dat[, 9] == 200] <- NA
dat[, 3][dat[, 3] == 999] <- NA
summary(dat\$Age)
{r}

Output:

``````1
2
3
4
5
``````Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
22.00   36.00   50.50   50.61   64.00   80.00       6

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
0.0000  1.0000  1.0000  0.7826  1.0000  1.0000       2 ``````

Now, the variables `Age` and `Is_graduate` have 594 and 598 records, respectively. The left-out entries have been tagged as missing, which we’ll handle in the next section.

## Missing Value Treatment

Missing value imputation is one of the most common data preparation steps. It's easy to count the number of missing values in each variable using the `sapply()` function.

``````1
2
``````sapply(dat, function(x) sum(is.na(x)))
``````
{r}

Output:

``````1
2
3
4
``````Marital_status      Dependents     Is_graduate          Income     Loan_amount
0               2               2               0               0
Term_months    Credit_score approval_status             Age             Sex
0               0               0               6               5 ``````

There are several techniques for handling missing values. The most widely used is replacing the values with the measure of central tendency. The first line of code below replaces the missing values of the `Age` variable with the median of the remaining values. The second line replaces the missing values of the `Is_graduate` variable with the value of '1', which indicates that the applicant's education status is graduate. The last two lines print the summary statistics of the variables.

``````1
2
3
4
5
``````dat\$Age[is.na(dat\$Age)] <- median(dat\$Age, na.rm = TRUE)

summary(dat\$Age)
{r}

Output:

``````1
2
3
4
5
6
``````Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
22.00   36.00   50.50   50.61   64.00   80.00

0   1
130 470 ``````

The missing value treatment is complete for both the variables. The data also has a variable, `Sex`, with five missing values. Since this is a categorical variable, we will check the distribution of labels, which is done in the line of code below.

``````1
````table(dat\$Sex)````
{r}

Output:

``````1
2
``````  F   M
111 484``````

The output shows that 484 out of 595 applicants are male, so we will replace the missing values with the label `M`. The first line of code below performs this task, while the second line prints the distribution of the variable. The output shows 600 records for the `Sex` variable, indicating that the missing values have been imputed.

``````1
2
``````dat\$Sex[is.na(dat\$Sex)] <- "M"
table(dat\$Sex)``````
{r}

Output:

``````1
2
``````Length     Class      Mode
600 character character ``````

We will now check if there are any more missing values with the line of code below. The output shows that we still have two missing values in the variable `Dependents`.

``````1
````sapply(dat, function(x) sum(is.na(x)))````
{r}

Output:

``````1
2
3
4
``````Marital_status      Dependents     Is_graduate          Income     Loan_amount
0               2               0               0               0
Term_months    Credit_score approval_status             Age             Sex
0               0               0               0               0 ``````

There are two missing values left in the data set, and we'll use another approach of treating missing values by dropping the records. The first line of code below uses the `complete.cases()` function to drop rows with any missing values in them, while the second line checks the information about the missing values in the data set. The third line prints the resulting dimension of the data.

``````1
2
3
``````dat = dat[complete.cases(dat), ]
sapply(dat, function(x) sum(is.na(x)))
dim(dat)``````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
``````Marital_status      Dependents     Is_graduate          Income     Loan_amount
0               0               0               0               0
Term_months    Credit_score approval_status             Age             Sex
0               0               0               0               0
0

[1] 598  10``````

## Encoding Categorical Labels

Certain machine learning algorithms like `xgboost` require all variables to be numeric. In this data, the variable `Sex` has labels (`M` and `F`), so we will have to change them to a numeric format. Since there are two labels, we can do binary encoding, which is done in the first line of code below. The output from the second line shows that we have successfully performed the encoding.

``````1
2
``````dat\$Sex <- ifelse(dat\$Sex == "M",1,0)
table(dat\$Sex)``````
{r}

Output:

``````1
2
``````  0   1
111 487``````

## Handling Outliers

One of the biggest obstacles in predictive modeling can be the presence of outliers, which are extreme values that are different from the other data points. Outliers are often a problem because they mislead the training process and lead to inaccurate models.

For numeric variables, we can identify outliers numerically through the skewness value. The lines of code below print the skewness value for the `Income` variable.

``````1
2
``````library(e1071)
skewness(dat\$Income)``````
{r}

Output:

``````1
````[1] 6.455884````

The output above shows that the variable `Income` has a right-skewed distribution with a skewness value of 6.5. Ideally, this value should be between -1 and 1. Apart from the income variable, we also have other variables, such as `Loan_amount` and `Age`, that have differences in scale and require normalization. We'll learn techniques in the next couple of sections to deal with this problem.

## Logarithmic Transformation of Numerical Variables

One of the techniques to make a skewed distribution normal is logarithmic transformation. The first line of code below creates a new variable, `LogIncome`, while the second line computes and prints the skewness value of this new variable.

``````1
2
``````dat\$LogIncome = log(dat\$Income)
{r}

Output:

``````1
````[1] 0.4860717````

The above chart shows that taking the log of the `Income` variable makes the distribution roughly normal and reduces the skewness value. We can use the same transformation for other numerical variables, but instead, we'll learn another transformation technique called standardization.

## Standardization

Several machine learning algorithms use some form of a distance matrix to learn from the data. However, when the features are using different scales, such as `Age` in years and `Income` in hundreds of dollars, the features using larger scales can unduly influence the model. As a result, we want the features to use a similar scale that can be achieved through scaling techniques.

One such technique is standardization, in which all the features are centered around zero and have, roughly, unit variance. The first line of code below uses the `mutate-at` function from the `dplyr` library to perform the scaling. The result is that these variables are standardized with zero mean and unit variance.

``````1
2
3
4
5
``````dat_2 = dat %>% mutate_at(scale, .vars = vars("Income","Loan_amount", "Age"))

sd(dat_2\$Income)
sd(dat_2\$Loan_amount)
sd(dat_2\$Age)``````
{r}

Output:

``````1
2
3
4
5
``````[1] 1

[1] 1

[1] 1``````

Let's now look at the variables after all the steps we've implemented.

``````1
````glimpse(dat_2)````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
13
``````Observations: 598
Variables: 11
\$ Marital_status  <int> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
\$ Dependents      <int> 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, ...
\$ Is_graduate     <dbl> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...
\$ Income          <dbl> -0.192200887, -0.547541541, -0.471665931, -0.229956590...
\$ Loan_amount     <dbl> -0.42074229, -0.37791245, -0.58135418, 0.73566336, 0.5...
\$ Term_months     <int> 384, 504, 504, 384, 384, 384, 384, 384, 384, 384, 384,...
\$ Credit_score    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
\$ approval_status <int> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, ...
\$ Age             <dbl> -0.003567868, -1.766094863, -1.766094863, -1.766094863...
\$ Sex             <dbl> 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, ...
\$ LogIncome       <dbl> 8.636752, 7.903966, 8.111628, 8.579980, 9.447545, 10.8...``````

## Converting Column Types

The output above indicates that there are many variables for which the data type is incorrect. These variables are `Marital_status`, `Is_graduate`, `Credit_score`, `approval_status`, and `Sex`. These are categorical variables labeled as character variables, and they need to be converted to the factor type for modeling purposes. The lines of code below perform this task.

``````1
2
3
``````names <- c(1,3,7,8,10)
dat_2[,names] <- lapply(dat_2[,names] , factor)
glimpse(dat_2)``````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
13
``````Observations: 598
Variables: 11
\$ Marital_status  <fct> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
\$ Dependents      <int> 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, ...
\$ Is_graduate     <fct> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...
\$ Income          <dbl> -0.192200887, -0.547541541, -0.471665931, -0.229956590...
\$ Loan_amount     <dbl> -0.42074229, -0.37791245, -0.58135418, 0.73566336, 0.5...
\$ Term_months     <int> 384, 504, 504, 384, 384, 384, 384, 384, 384, 384, 384,...
\$ Credit_score    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
\$ approval_status <fct> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, ...
\$ Age             <dbl> -0.003567868, -1.766094863, -1.766094863, -1.766094863...
\$ Sex             <fct> 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, ...
\$ LogIncome       <dbl> 8.636752, 7.903966, 8.111628, 8.579980, 9.447545, 10.8...``````

## Conclusion

In this guide, you have learned about the fundamental techniques of data preparation for machine learning. You learned about dealing with missing values, identifying and treating outliers, normalizing and transforming data, and converting data types.