Normalizing Data with R
Nov 6, 2019 • 8 Minute Read
Introduction
One way to turn an average machine learning model into a good one is through the statistical technique of normalizing of data. If we don't normalize the data, the machine learning algorithm will be dominated by the variables that use a larger scale, adversely affecting model performance. This makes it imperative to normalize the data.
In this guide, you will learn various ways to perform this task in the popular statistical programming language R.
Let’s start by looking at the data we’ll use in this guide.
Data
In this guide, we’ll use a fictitious dataset of loan applicants containing 578 observations and 6 variables, as described below:

Dependents: Number of dependents of the applicant

Income: Annual Income of the applicant (in USD)

Loan_amount: Loan amount (in USD) for which the application was submitted

Term_months: Tenure of the loan (in months)

Approval_status: Whether the loan application was approved ("1") or not ("0")

Age: The applicant's age in years
Let's start by loading the required libraries and the data.
library(plyr)
library(readr)
library(ggplot2)
library(GGally)
library(dplyr)
library(mlbench)
dat < read_csv("data_n.csv")
glimpse(dat)
Output:
Observations: 578
Variables: 6
$ Dependents <int> 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 1, 0, ...
$ Income <int> 183700, 192300, 222400, 240000, 213300, 263600, 256800...
$ Loan_amount <int> 18600, 19500, 22300, 26000, 26600, 28000, 28000, 30000...
$ Term_months <int> 384, 384, 384, 384, 384, 384, 384, 384, 384, 384, 204,...
$ approval_status <int> 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, ...
$ Age <int> 40, 63, 42, 30, 43, 46, 46, 68, 48, 72, 54, 54, 29, 70...
The output above shows that the dataset has six integer variables (labelled as 'int'). However, the variable 'approval_status' is a categorical target variable and will not be normalized.
We’re ready to carry out the most common data normalization steps. Let's begin by looking at the summary of the variables, using the summary() command.
summary(dat)
Output:
Dependents Income Loan_amount Term_months
Min. :0.0000 Min. : 173200 Min. : 18600 Min. : 36.0
1st Qu.:0.0000 1st Qu.: 389550 1st Qu.: 61500 1st Qu.:384.0
Median :0.0000 Median : 513050 Median : 76500 Median :384.0
Mean :0.7561 Mean : 715589 Mean : 333702 Mean :365.5
3rd Qu.:1.0000 3rd Qu.: 774800 3rd Qu.: 136250 3rd Qu.:384.0
Max. :6.0000 Max. :8444900 Max. :7780000 Max. :504.0
approval_status Age
Min. :0.0000 Min. :22.00
1st Qu.:0.0000 1st Qu.:37.00
Median :1.0000 Median :51.00
Mean :0.6955 Mean :49.71
3rd Qu.:1.0000 3rd Qu.:61.75
Max. :1.0000 Max. :76.00
The output above confirms that the numerical variables have different units and scales, for example, 'Age' in years and 'Income' in dollars. These differences can unduly influence the model and, therefore, we need to scale or transform them.
We will be using the caret package in 'R', a powerful package that uses the *** preProcess*** function for carrying out different types of data normalization steps, as discussed in the subsequent sections.
Standardization
Standardization is a technique in which all the features are centred around zero and have roughly unit variance.
The first line of code below loads the 'caret' package, while the second line preprocesses the data. The third line performs the normalization, while the fourth command prints the summary of the standardized variable.
library(caret)
preproc1 < preProcess(dat[,c(1:4,6)], method=c("center", "scale"))
norm1 < predict(preproc1, dat[,c(1:4,6)])
summary(norm1)
Output:
Dependents Income Loan_amount Term_months
Min. :0.7338 Min. :0.75854 Min. :0.4281 Min. :5.3744
1st Qu.:0.7338 1st Qu.:0.45597 1st Qu.:0.3698 1st Qu.: 0.3021
Median :0.7338 Median :0.28325 Median :0.3494 Median : 0.3021
Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.2367 3rd Qu.: 0.08281 3rd Qu.:0.2683 3rd Qu.: 0.3021
Max. : 5.0893 Max. :10.80959 Max. :10.1169 Max. : 2.2595
Age
Min. :1.90719
1st Qu.:0.87496
Median : 0.08846
Mean : 0.00000
3rd Qu.: 0.82823
Max. : 1.80885
The output shows that all the numerical variables have been standardized with a mean value of zero.
The same result can be obtained using the scale function, as shown below.
dat_scaled < as.data.frame(scale(dat[,c(1:4,6)]))
summary(dat_scaled$Income)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.75854 0.45597 0.28325 0.00000 0.08281 10.80959
Minmax Scaling
In this approach, the data is scaled to a fixed range—usually 0 to 1. The impact is that we end up with smaller standard deviations, which can suppress the effect of outliers. We follow the same steps as above, with the only change in the 'method' argument, where the normalization method is now set to "range”.
preproc2 < preProcess(dat[,c(1:4,6)], method=c("range"))
norm2 < predict(preproc2, dat[,c(1:4,6)])
summary(norm2)
Output:
Dependents Income Loan_amount Term_months
Min. :0.0000 Min. :0.00000 Min. :0.000000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.02616 1st Qu.:0.005527 1st Qu.:0.7436
Median :0.0000 Median :0.04109 Median :0.007460 Median :0.7436
Mean :0.1260 Mean :0.06557 Mean :0.040599 Mean :0.7040
3rd Qu.:0.1667 3rd Qu.:0.07273 3rd Qu.:0.015158 3rd Qu.:0.7436
Max. :1.0000 Max. :1.00000 Max. :1.000000 Max. :1.0000
Age
Min. :0.0000
1st Qu.:0.2778
Median :0.5370
Mean :0.5132
3rd Qu.:0.7361
Max. :1.0000
The output above shows that all the values have been scaled between 0 to 1.
Log Transformation
Many machine learning algorithms require variables to be normally distributed. However, in a realworld scenario, the data is often skewed and does not exhibit normal distribution. One technique to counter this is to apply a logarithmic transformation on the variables.
The first line of code below prints a summary of the 'Income' variable. The output shows that the mean and median incomes of the applicants are $715,589 and $513,050, respectively.
The second command performs the logarithmic transformation, while the third line prints the summary of the transformed variable. The output shows that the mean and median of the new transformed variable are similar.
summary(dat$Income)
logincome = log(dat$Income)
summary(logincome)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max.
173200 389550 513050 715589 774800 8444900
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.06 12.87 13.15 13.26 13.56 15.95
Conclusion
In this guide, you have learned the most commonly used data normalization techniques using the powerful 'caret' package in R. These normalization techniques will help you handle numerical variables of varying units and scales, thus improving the performance of your machine learning algorithm. To learn more about data science using R, please refer to the following guides: