Data preparation often takes up to eighty percent of a data scientist's time in a data science project. This demonstrates its importance in the machine learning life cycle. In this guide, you will learn the basics and implementation of several data preparation techniques in R:
Dealing with Incorrect Entries
Missing Value Treatment
Encoding Categorical Labels
Handling Outliers
Logarithmic Transformation
Standardization
Converting Column Types
In this guide, we will use a fictitious dataset of loan applicants that contains 600 observations and 10 variables, as described below:
Marital_status
: Whether the applicant is married ("1") or not ("0")
Dependents
: Number of dependents claimed by the applicant
Is_graduate
: Whether the applicant is a graduate ("1") or not ("0")
Income
: Annual Income of the applicant (in hundreds of dollars)
Loan_amount
: Loan amount (in hundreds of dollars) for which the application was submitted
Term_months
: Tenure of the loan (in months)
Credit_score
: Whether the applicant's credit score was good ("1") or not ("0")
Age
: The applicant’s age in years
Sex
: Whether the applicant is female ("F") or male ("M")
Approval_status
: Whether the loan application was approved ("1") or not ("0")Let's start by loading the required libraries and the data.
1library(plyr)
2library(readr)
3library(dplyr)
4library(caret)
5library(ggplot2)
6library(repr)
7
8dat <- read_csv("data_prep.csv")
9glimpse(dat)
Output:
1Observations: 600
2Variables: 10
3$ Marital_status <int> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
4$ Dependents <int> 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, ...
5$ Is_graduate <int> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...
6$ Income <int> 5635, 2708, 3333, 5324, 12677, 50292, 9523, 3685, 3107...
7$ Loan_amount <int> 122, 126, 107, 230, 208, 169, 153, 122, 111, 107, 98, ...
8$ Term_months <int> 384, 504, 504, 384, 384, 384, 384, 384, 384, 384, 384,...
9$ Credit_score <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
10$ approval_status <int> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, ...
11$ Age <int> 0, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23,...
12$ Sex <chr> "M", "F", "M", "M", "M", "M", "F", "F", "F", "M", "F",...
The output shows that the dataset has nine numerical variables (labeled as int
) and one character variable (labeled as chr
). We will analyze the data types in the subsequent section. For now, let's look at the summary of the data.
1summary(dat)
Output:
1Marital_status Dependents Is_graduate Income
2 Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. : 200
3 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 1.000 1st Qu.: 3832
4 Median :1.0000 Median :0.0000 Median : 1.000 Median : 5075
5 Mean :0.6517 Mean :0.7308 Mean : 2.449 Mean : 7211
6 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 1.000 3rd Qu.: 7642
7 Max. :1.0000 Max. :3.0000 Max. :999.000 Max. :108000
8 NA's :2 NA's :1
9 Loan_amount Term_months Credit_score approval_status
10 Min. : 10.0 Min. : 36.0 Min. :0.0000 Min. :0.0000
11 1st Qu.:111.0 1st Qu.:384.0 1st Qu.:1.0000 1st Qu.:0.0000
12 Median :140.0 Median :384.0 Median :1.0000 Median :1.0000
13 Mean :161.6 Mean :367.1 Mean :0.7883 Mean :0.6867
14 3rd Qu.:180.5 3rd Qu.:384.0 3rd Qu.:1.0000 3rd Qu.:1.0000
15 Max. :778.0 Max. :504.0 Max. :1.0000 Max. :1.0000
16
17 Age Sex
18 Min. : 0.00 Length:600
19 1st Qu.: 36.00 Class :character
20 Median : 51.00 Mode :character
21 Mean : 51.77
22 3rd Qu.: 64.00
23 Max. :200.00
The above output shows that the variable Age
has minimum and maximum value of 0
and 200
, respectively. Also, the variable Is_graduate
has a maximum value of 999
, instead of the binary values of 0
and 1
. These entries are incorrect and need correction. One approach would be to delete these records, but instead, we will treat these records as missing values and replace them with a measure of central tendency—mean, median, or mode.
Starting with the Age
variable, the first two lines of code below convert the incorrect values 0
and 200
to missing records. We repeat the same process for the variable Is_graduate
in the third line of code. The fourth and fifth lines print the information about the variables.
1dat[, 9][dat[, 9] == 0] <- NA
2dat[, 9][dat[, 9] == 200] <- NA
3dat[, 3][dat[, 3] == 999] <- NA
4summary(dat$Age)
5summary(dat$Is_graduate)
Output:
1Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
2 22.00 36.00 50.50 50.61 64.00 80.00 6
3
4 Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
5 0.0000 1.0000 1.0000 0.7826 1.0000 1.0000 2
Now, the variables Age
and Is_graduate
have 594 and 598 records, respectively. The left-out entries have been tagged as missing, which we’ll handle in the next section.
Missing value imputation is one of the most common data preparation steps. It's easy to count the number of missing values in each variable using the sapply()
function.
1sapply(dat, function(x) sum(is.na(x)))
2
Output:
1Marital_status Dependents Is_graduate Income Loan_amount
2 0 2 2 0 0
3 Term_months Credit_score approval_status Age Sex
4 0 0 0 6 5
There are several techniques for handling missing values. The most widely used is replacing the values with the measure of central tendency. The first line of code below replaces the missing values of the Age
variable with the median of the remaining values. The second line replaces the missing values of the Is_graduate
variable with the value of '1', which indicates that the applicant's education status is graduate. The last two lines print the summary statistics of the variables.
1dat$Age[is.na(dat$Age)] <- median(dat$Age, na.rm = TRUE)
2dat$Is_graduate[is.na(dat$Is_graduate)] <- 1
3
4summary(dat$Age)
5table(dat$Is_graduate)
Output:
1Min. 1st Qu. Median Mean 3rd Qu. Max.
2 22.00 36.00 50.50 50.61 64.00 80.00
3
4
5 0 1
6130 470
The missing value treatment is complete for both the variables. The data also has a variable, Sex
, with five missing values. Since this is a categorical variable, we will check the distribution of labels, which is done in the line of code below.
1table(dat$Sex)
Output:
1 F M
2111 484
The output shows that 484 out of 595 applicants are male, so we will replace the missing values with the label M
. The first line of code below performs this task, while the second line prints the distribution of the variable. The output shows 600 records for the Sex
variable, indicating that the missing values have been imputed.
1dat$Sex[is.na(dat$Sex)] <- "M"
2table(dat$Sex)
Output:
1Length Class Mode
2 600 character character
We will now check if there are any more missing values with the line of code below. The output shows that we still have two missing values in the variable Dependents
.
1sapply(dat, function(x) sum(is.na(x)))
Output:
1Marital_status Dependents Is_graduate Income Loan_amount
2 0 2 0 0 0
3 Term_months Credit_score approval_status Age Sex
4 0 0 0 0 0
There are two missing values left in the data set, and we'll use another approach of treating missing values by dropping the records. The first line of code below uses the complete.cases()
function to drop rows with any missing values in them, while the second line checks the information about the missing values in the data set. The third line prints the resulting dimension of the data.
1dat = dat[complete.cases(dat), ]
2sapply(dat, function(x) sum(is.na(x)))
3dim(dat)
Output:
1Marital_status Dependents Is_graduate Income Loan_amount
2 0 0 0 0 0
3 Term_months Credit_score approval_status Age Sex
4 0 0 0 0 0
5 LogIncome
6 0
7
8
9[1] 598 10
Certain machine learning algorithms like xgboost
require all variables to be numeric. In this data, the variable Sex
has labels (M
and F
), so we will have to change them to a numeric format. Since there are two labels, we can do binary encoding, which is done in the first line of code below. The output from the second line shows that we have successfully performed the encoding.
1dat$Sex <- ifelse(dat$Sex == "M",1,0)
2table(dat$Sex)
Output:
1 0 1
2111 487
One of the biggest obstacles in predictive modeling can be the presence of outliers, which are extreme values that are different from the other data points. Outliers are often a problem because they mislead the training process and lead to inaccurate models.
For numeric variables, we can identify outliers numerically through the skewness value. The lines of code below print the skewness value for the Income
variable.
1library(e1071)
2skewness(dat$Income)
Output:
1[1] 6.455884
The output above shows that the variable Income
has a right-skewed distribution with a skewness value of 6.5. Ideally, this value should be between -1 and 1. Apart from the income variable, we also have other variables, such as Loan_amount
and Age
, that have differences in scale and require normalization. We'll learn techniques in the next couple of sections to deal with this problem.
One of the techniques to make a skewed distribution normal is logarithmic transformation. The first line of code below creates a new variable, LogIncome
, while the second line computes and prints the skewness value of this new variable.
1dat$LogIncome = log(dat$Income)
2skewness(dat$LogIncome)
Output:
1[1] 0.4860717
The above chart shows that taking the log of the Income
variable makes the distribution roughly normal and reduces the skewness value. We can use the same transformation for other numerical variables, but instead, we'll learn another transformation technique called standardization.
Several machine learning algorithms use some form of a distance matrix to learn from the data. However, when the features are using different scales, such as Age
in years and Income
in hundreds of dollars, the features using larger scales can unduly influence the model. As a result, we want the features to use a similar scale that can be achieved through scaling techniques.
One such technique is standardization, in which all the features are centered around zero and have, roughly, unit variance. The first line of code below uses the mutate-at
function from the dplyr
library to perform the scaling. The result is that these variables are standardized with zero mean and unit variance.
1dat_2 = dat %>% mutate_at(scale, .vars = vars("Income","Loan_amount", "Age"))
2
3sd(dat_2$Income)
4sd(dat_2$Loan_amount)
5sd(dat_2$Age)
Output:
1[1] 1
2
3[1] 1
4
5[1] 1
Let's now look at the variables after all the steps we've implemented.
1glimpse(dat_2)
Output:
1Observations: 598
2Variables: 11
3$ Marital_status <int> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
4$ Dependents <int> 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, ...
5$ Is_graduate <dbl> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...
6$ Income <dbl> -0.192200887, -0.547541541, -0.471665931, -0.229956590...
7$ Loan_amount <dbl> -0.42074229, -0.37791245, -0.58135418, 0.73566336, 0.5...
8$ Term_months <int> 384, 504, 504, 384, 384, 384, 384, 384, 384, 384, 384,...
9$ Credit_score <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
10$ approval_status <int> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, ...
11$ Age <dbl> -0.003567868, -1.766094863, -1.766094863, -1.766094863...
12$ Sex <dbl> 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, ...
13$ LogIncome <dbl> 8.636752, 7.903966, 8.111628, 8.579980, 9.447545, 10.8...
The output above indicates that there are many variables for which the data type is incorrect. These variables are Marital_status
, Is_graduate
, Credit_score
, approval_status
, and Sex
. These are categorical variables labeled as character variables, and they need to be converted to the factor type for modeling purposes. The lines of code below perform this task.
1names <- c(1,3,7,8,10)
2dat_2[,names] <- lapply(dat_2[,names] , factor)
3glimpse(dat_2)
Output:
1Observations: 598
2Variables: 11
3$ Marital_status <fct> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
4$ Dependents <int> 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, ...
5$ Is_graduate <fct> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...
6$ Income <dbl> -0.192200887, -0.547541541, -0.471665931, -0.229956590...
7$ Loan_amount <dbl> -0.42074229, -0.37791245, -0.58135418, 0.73566336, 0.5...
8$ Term_months <int> 384, 504, 504, 384, 384, 384, 384, 384, 384, 384, 384,...
9$ Credit_score <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
10$ approval_status <fct> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, ...
11$ Age <dbl> -0.003567868, -1.766094863, -1.766094863, -1.766094863...
12$ Sex <fct> 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, ...
13$ LogIncome <dbl> 8.636752, 7.903966, 8.111628, 8.579980, 9.447545, 10.8...
In this guide, you have learned about the fundamental techniques of data preparation for machine learning. You learned about dealing with missing values, identifying and treating outliers, normalizing and transforming data, and converting data types.
To learn more about data science using R, please refer to the following guides: