Author avatar

Deepika Singh

Encoding Data with R

Deepika Singh

  • Nov 12, 2019
  • 10 Min read
  • 33 Views
  • Nov 12, 2019
  • 10 Min read
  • 33 Views
Data
R

Introduction

There are several powerful machine learning algorithms in R. However, to make the best use of these algorithms, it is imperative that we transform the data into the desired format. One of the common steps for doing this is encoding the data, which enhances the computational power and the efficiency of the algorithms. In this guide, you will learn about the different techniques of encoding data with R.

Data

In this guide, we will use a fictitious dataset of loan applications containing 600 observations and 10 variables:

  1. Marital_status: Whether the applicant is married ("Yes") or not ("No")

  2. Dependents: Number of dependents of the applicant

  3. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")

  4. Income: Annual Income of the applicant (in USD)

  5. Loan_amount: Loan amount (in USD) for which the application was submitted

  6. Credit_score: Whether the applicant’s credit score is good ("Satisfactory") or not ("Not Satisfactory")

  7. Approval_status: Whether the loan application was approved ("1") or not ("0")

  8. Age: The applicant's age in years

  9. Sex: Whether the applicant is a male ("M") or a female ("F")

  10. Purpose: Purpose of applying for the loan

Let's start by loading the required libraries and the data.

1
2
3
4
5
6
7
8
library(plyr)
library(readr)
library(dplyr)
library(caret)

dat <- read_csv("data_eng.csv")

glimpse(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
Observations: 600
Variables: 10
$ Marital_status  <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Dependents      <int> 1, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, ...
$ Is_graduate     <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Income          <int> 298500, 315500, 295100, 319300, 333300, 277700, 332100...
$ Loan_amount     <int> 71000, 75500, 70000, 70000, 98000, 71000, 58000, 64000...
$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ Age             <int> 74, 71, 71, 68, 64, 64, 63, 61, 60, 59, 56, 55, 54, 54...
$ Sex             <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",...
$ Purpose         <chr> "Wedding", "Wedding", "Wedding", "Wedding", "Wedding",...

The output shows that the dataset has six numerical variables (labeled as int), and four categorical variables (labelled as chr). We are now ready to carry out the encoding steps.

There are different methods for encoding categorical variables, and selection depends on the distribution of labels in the variable and the end objective. In the subsequent sections, we will cover the most widely used techniques of encoding categorical variables.

Label Encoding

In simple terms, label encoding is the process of replacing the different levels of a categorical variable with dummy numbers. For instance, the variable Credit_score has two levels, “Satisfactory” and “Not_satisfactory”. These can be encoded to 1 and 0, respectively. The first line of code below performs this task, while the second line prints a table of the levels post-encoding.

1
2
3
dat$Credit_score <- ifelse(dat$Credit_score == "Satisfactory",1,0)

table(dat$Credit_score)
{r}

Output:

1
2
  0   1 
128 472

The above output shows that the label encoding is done. This is easy when you have two levels in the categorical variable, as with Credit_score. If the variable contains more than two labels, this will not be intuitive. For example, the 'Purpose' variable has six levels, as can be seen from the output below.

1
table(dat$Purpose)
{r}

Output:

1
2
Business Education Furniture  Personal    Travel   Wedding 
       43       191        38       166       123        39 

In such cases, one-hot encoding is preferred.

One-Hot Encoding

In this technique, one-hot (dummy) encoding is applied to the features, creating a binary column for each category level and returning a sparse matrix. In each dummy variable, the label “1” will represent the existence of the level in the variable, while the label “0” will represent its non-existence.

We will apply this technique to all the remaining categorical variables. The first line of code below imports the powerful caret package, while the second line uses the dummyVars() function to create a full set of dummy variables. The dummyVars() method works on the categorical variables. It is to be noted that the second line contains the argument fullrank=T, which will create n-1 columns for a categorical variable with n unique levels.

The third line uses the output of the dummyVars() function and transforms the dataset, dat, where all the categorical variables are encoded to numerical variables. The fourth line of code prints the structure of the resulting data, dat-transfored, which confirms that one-hot encoding is completed.

1
2
3
4
5
6
library(caret)

dmy <- dummyVars(" ~ .", data = dat, fullRank = T)
dat_transformed <- data.frame(predict(dmy, newdata = dat))

glimpse(dat_transformed)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Observations: 600
Variables: 14
$ Marital_status.Yes <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ...
$ Dependents         <dbl> 1, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, ...
$ Is_graduate.Yes    <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, ...
$ Income                  <dbl> 298500, 315500, 295100, 319300, 333300, 277700, 332...
$ Loan_amount        <dbl> 71000, 75500, 70000, 70000, 98000, 71000, 58000, 64...
$ Credit_score          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, ...
$ approval_status.1  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ Age                         <dbl> 74, 71, 71, 68, 64, 64, 63, 61, 60, 59, 56, 55, 54,...
$ Sex.M                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ Purpose.Education  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Purpose.Furniture   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Purpose.Personal   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Purpose.Travel        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Purpose.Wedding    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 

Encoding Continuous (or Numeric) Variables

In the previous sections, we learned how to encode categorical variables. However, sometimes it may be useful to carry out encoding for numerical variables as well. For example, the Naive Bayes Algorithm requires all variables to be categorical, so encoding numerical variables is required. This is also called binning.

We will consider the Income variable as an example. Let’s look at the summary statistics of this variable.

1
summary(dat$Income)
{r}

Output:

1
2
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 133300  384975  508350  706302  766100 8444900 

The values of Income range between $133,300 and $8.44 million, which shows that the distribution is right skewed. One of the additional benefits of binning is that it also takes care of the outliers. Let’s create three levels of the variable Income, which are “Low” for income levels lying below $380,000, “High” for income values above $760,000, and “Mid50” for the middle 50 percentage values of the income distribution.

The first step is to create a vector of these cut-off points, which is done in the first line of code below. The second line gives the respective names to these cut-off points. The third line uses the cut() function to break the vector using the cut-off points. Finally, we compare the original Income variable with the binned Income_New variable using the summary() function.

1
2
3
4
5
6
7
8
9
bins <- c(-Inf, 384975, 766100, Inf)

bin_names <- c("Low", "Mid50", "High")

dat$Income_new <- cut(dat$Income, breaks = bins, labels = bin_names)

summary(dat$Income)

summary(dat$Income_new)
{r}

Output:

1
2
3
4
5
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 133300  384975  508350  706302  766100 8444900 
 
  Low Mid50  High 
  150   301   149 

The above output shows that the variable has been binned. It is also possible to create bin cut-offs automatically, as shown in the code below. In this case, we create 5 bins of approximately equal width for the variable Age.

1
2
3
4
5
dat$Age_new <- cut(dat$Age, breaks = 5, labels = c("Bin1", "Bin2", "Bin3","Bin4", "Bin5"))

summary(dat$Age)

summary(dat$Age_new)
{r}

Output:

1
2
3
4
5
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  22.00   36.00   50.00   49.31   61.00   76.00 
 
Bin1 Bin2 Bin3 Bin4 Bin5 
 108  117  114  162   99 

Conclusion

In this guide, you have learned methods of encoding data with R. You have applied these techniques on both quantitative and qualitative variables. Depending on the objective of your project, you can apply any or all of these encoding techniques. To learn more about data science using R, please refer to the following guides:

0