Normalizing Data with R

By Deepika Singh

Nov 6, 2019 • 8 Minute Read

Introduction

One way to turn an average machine learning model into a good one is through the statistical technique of normalizing of data. If we don't normalize the data, the machine learning algorithm will be dominated by the variables that use a larger scale, adversely affecting model performance. This makes it imperative to normalize the data.

In this guide, you will learn various ways to perform this task in the popular statistical programming language R.

Let’s start by looking at the data we’ll use in this guide.

Data

In this guide, we’ll use a fictitious dataset of loan applicants containing 578 observations and 6 variables, as described below:

Dependents: Number of dependents of the applicant
Income: Annual Income of the applicant (in USD)
Loan_amount: Loan amount (in USD) for which the application was submitted
Term_months: Tenure of the loan (in months)
Approval_status: Whether the loan application was approved ("1") or not ("0")
Age: The applicant's age in years

Let's start by loading the required libraries and the data.

          library(plyr)
library(readr)
library(ggplot2)
library(GGally)
library(dplyr)
library(mlbench)
 
dat <- read_csv("data_n.csv")
glimpse(dat)
    

Output:

          Observations: 578
Variables: 6
$ Dependents  	<int> 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 1, 0, ...
$ Income      	<int> 183700, 192300, 222400, 240000, 213300, 263600, 256800...
$ Loan_amount 	<int> 18600, 19500, 22300, 26000, 26600, 28000, 28000, 30000...
$ Term_months 	<int> 384, 384, 384, 384, 384, 384, 384, 384, 384, 384, 204,...
$ approval_status <int> 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, ...
$ Age         	<int> 40, 63, 42, 30, 43, 46, 46, 68, 48, 72, 54, 54, 29, 70...
    

The output above shows that the dataset has six integer variables (labelled as 'int'). However, the variable 'approval_status' is a categorical target variable and will not be normalized.

We’re ready to carry out the most common data normalization steps. Let's begin by looking at the summary of the variables, using the summary() command.

      summary(dat)

Output:

          Dependents     	Income     	Loan_amount   	Term_months  
 Min.   :0.0000   Min.   : 173200   Min.   :  18600   Min.   : 36.0 
 1st Qu.:0.0000   1st Qu.: 389550   1st Qu.:  61500   1st Qu.:384.0 
 Median :0.0000   Median : 513050   Median :  76500   Median :384.0 
 Mean   :0.7561   Mean   : 715589   Mean   : 333702   Mean   :365.5 
 3rd Qu.:1.0000   3rd Qu.: 774800   3rd Qu.: 136250   3rd Qu.:384.0 
 Max.   :6.0000   Max.   :8444900   Max.   :7780000   Max.   :504.0 
 
approval_status   	Age      
 Min.   :0.0000   Min.   :22.00 
 1st Qu.:0.0000   1st Qu.:37.00 
 Median :1.0000   Median :51.00 
 Mean   :0.6955   Mean   :49.71 
 3rd Qu.:1.0000   3rd Qu.:61.75 
 Max.   :1.0000   Max.   :76.00
    

The output above confirms that the numerical variables have different units and scales, for example, 'Age' in years and 'Income' in dollars. These differences can unduly influence the model and, therefore, we need to scale or transform them.

We will be using the caret package in 'R', a powerful package that uses the *** preProcess*** function for carrying out different types of data normalization steps, as discussed in the subsequent sections.

Standardization

Standardization is a technique in which all the features are centred around zero and have roughly unit variance.

The first line of code below loads the 'caret' package, while the second line pre-processes the data. The third line performs the normalization, while the fourth command prints the summary of the standardized variable.

          library(caret)
 
preproc1 <- preProcess(dat[,c(1:4,6)], method=c("center", "scale"))
 
norm1 <- predict(preproc1, dat[,c(1:4,6)])
 
summary(norm1)

Output:

          Dependents      	Income      	Loan_amount   	Term_months 	
 Min.   :-0.7338   Min.   :-0.75854   Min.   :-0.4281   Min.   :-5.3744 
 1st Qu.:-0.7338   1st Qu.:-0.45597   1st Qu.:-0.3698   1st Qu.: 0.3021 
 Median :-0.7338   Median :-0.28325   Median :-0.3494   Median : 0.3021 
 Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000 
 3rd Qu.: 0.2367   3rd Qu.: 0.08281   3rd Qu.:-0.2683   3rd Qu.: 0.3021 
 Max.   : 5.0893   Max.   :10.80959   Max.   :10.1169   Max.   : 2.2595 
  	Age      	
 Min.   :-1.90719 
 1st Qu.:-0.87496 
 Median : 0.08846 
 Mean   : 0.00000 
 3rd Qu.: 0.82823 
 Max.   : 1.80885
    

The output shows that all the numerical variables have been standardized with a mean value of zero.

The same result can be obtained using the scale function, as shown below.

          dat_scaled <- as.data.frame(scale(dat[,c(1:4,6)]))
summary(dat_scaled$Income)
    

Output:

          Min.  1st Qu.   Median 	Mean  3rd Qu. 	Max.
-0.75854 -0.45597 -0.28325  0.00000  0.08281 10.80959
    

Min-max Scaling

In this approach, the data is scaled to a fixed range—usually 0 to 1. The impact is that we end up with smaller standard deviations, which can suppress the effect of outliers. We follow the same steps as above, with the only change in the 'method' argument, where the normalization method is now set to "range”.

          preproc2 <- preProcess(dat[,c(1:4,6)], method=c("range"))
 
norm2 <- predict(preproc2, dat[,c(1:4,6)])
 
summary(norm2)

Output:

          Dependents     	Income     	Loan_amount    	Term_months	
 Min.   :0.0000   Min.   :0.00000   Min.   :0.000000   Min.   :0.0000 
 1st Qu.:0.0000   1st Qu.:0.02616   1st Qu.:0.005527   1st Qu.:0.7436 
 Median :0.0000   Median :0.04109   Median :0.007460   Median :0.7436 
 Mean   :0.1260   Mean   :0.06557   Mean   :0.040599   Mean   :0.7040 
 3rd Qu.:0.1667   3rd Qu.:0.07273   3rd Qu.:0.015158   3rd Qu.:0.7436 
 Max.   :1.0000   Max.   :1.00000   Max.   :1.000000   Max.   :1.0000 
  	Age       
 Min.   :0.0000 
 1st Qu.:0.2778 
 Median :0.5370 
 Mean   :0.5132 
 3rd Qu.:0.7361 
 Max.   :1.0000
    

The output above shows that all the values have been scaled between 0 to 1.

Log Transformation

Many machine learning algorithms require variables to be normally distributed. However, in a real-world scenario, the data is often skewed and does not exhibit normal distribution. One technique to counter this is to apply a logarithmic transformation on the variables.

The first line of code below prints a summary of the 'Income' variable. The output shows that the mean and median incomes of the applicants are $715,589 and $513,050, respectively.

The second command performs the logarithmic transformation, while the third line prints the summary of the transformed variable. The output shows that the mean and median of the new transformed variable are similar.

          summary(dat$Income)
 
logincome = log(dat$Income)
 
summary(logincome)

Output:

          Min. 1st Qu.  Median	Mean 3rd Qu.	Max.
 173200  389550  513050  715589  774800 8444900
 
  Min. 1st Qu.  Median	Mean 3rd Qu.	Max.
  12.06   12.87   13.15   13.26   13.56   15.95

Conclusion

In this guide, you have learned the most commonly used data normalization techniques using the powerful 'caret' package in R. These normalization techniques will help you handle numerical variables of varying units and scales, thus improving the performance of your machine learning algorithm. To learn more about data science using R, please refer to the following guides:

Deepika S.

Coming soon...

More about this author