Introduction

58

One way to turn an average machine learning model into a good one is through the statistical technique of normalizing of data. If we don't normalize the data, the machine learning algorithm will be dominated by the variables that use a larger scale, adversely affecting model performance. This makes it imperative to normalize the data.

In this guide, you will learn various ways to perform this task in the popular statistical programming language R.

Let’s start by looking at the data we’ll use in this guide.

In this guide, we’ll use a fictitious dataset of loan applicants containing 578 observations and 6 variables, as described below:

Dependents: Number of dependents of the applicant

Income: Annual Income of the applicant (in USD)

Loan_amount: Loan amount (in USD) for which the application was submitted

Term_months: Tenure of the loan (in months)

Approval_status: Whether the loan application was approved ("1") or not ("0")

Age: The applicant's age in years

Let's start by loading the required libraries and the data.

`1 2 3 4 5 6 7 8 9 10 11`

`library(plyr) library(readr) library(ggplot2) library(GGally) library(dplyr) library(mlbench) dat <- read_csv("data_n.csv") glimpse(dat)`

{r}

Output:

`1 2 3 4 5 6 7 8 9`

`Observations: 578 Variables: 6 $ Dependents <int> 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 1, 0, ... $ Income <int> 183700, 192300, 222400, 240000, 213300, 263600, 256800... $ Loan_amount <int> 18600, 19500, 22300, 26000, 26600, 28000, 28000, 30000... $ Term_months <int> 384, 384, 384, 384, 384, 384, 384, 384, 384, 384, 204,... $ approval_status <int> 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, ... $ Age <int> 40, 63, 42, 30, 43, 46, 46, 68, 48, 72, 54, 54, 29, 70...`

The output above shows that the dataset has six integer variables (labelled as 'int'). However, the variable 'approval_status' is a categorical target variable and will not be normalized.

We’re ready to carry out the most common data normalization steps. Let's begin by looking at the summary of the variables, using the *summary()* command.

`1 2 3`

`summary(dat)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17`

`Dependents Income Loan_amount Term_months Min. :0.0000 Min. : 173200 Min. : 18600 Min. : 36.0 1st Qu.:0.0000 1st Qu.: 389550 1st Qu.: 61500 1st Qu.:384.0 Median :0.0000 Median : 513050 Median : 76500 Median :384.0 Mean :0.7561 Mean : 715589 Mean : 333702 Mean :365.5 3rd Qu.:1.0000 3rd Qu.: 774800 3rd Qu.: 136250 3rd Qu.:384.0 Max. :6.0000 Max. :8444900 Max. :7780000 Max. :504.0 approval_status Age Min. :0.0000 Min. :22.00 1st Qu.:0.0000 1st Qu.:37.00 Median :1.0000 Median :51.00 Mean :0.6955 Mean :49.71 3rd Qu.:1.0000 3rd Qu.:61.75 Max. :1.0000 Max. :76.00`

The output above confirms that the numerical variables have different units and scales, for example, 'Age' in years and 'Income' in dollars. These differences can unduly influence the model and, therefore, we need to scale or transform them.

We will be using the ** caret** package in 'R', a powerful package that uses the

Standardization is a technique in which all the features are centred around zero and have roughly unit variance.

The *first line of code* below loads the 'caret' package, while the *second line* pre-processes the data. The *third line* performs the normalization, while the *fourth command* prints the summary of the standardized variable.

`1 2 3 4 5 6 7 8 9`

`library(caret) preproc1 <- preProcess(dat[,c(1:4,6)], method=c("center", "scale")) norm1 <- predict(preproc1, dat[,c(1:4,6)]) summary(norm1)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15`

`Dependents Income Loan_amount Term_months Min. :-0.7338 Min. :-0.75854 Min. :-0.4281 Min. :-5.3744 1st Qu.:-0.7338 1st Qu.:-0.45597 1st Qu.:-0.3698 1st Qu.: 0.3021 Median :-0.7338 Median :-0.28325 Median :-0.3494 Median : 0.3021 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 3rd Qu.: 0.2367 3rd Qu.: 0.08281 3rd Qu.:-0.2683 3rd Qu.: 0.3021 Max. : 5.0893 Max. :10.80959 Max. :10.1169 Max. : 2.2595 Age Min. :-1.90719 1st Qu.:-0.87496 Median : 0.08846 Mean : 0.00000 3rd Qu.: 0.82823 Max. : 1.80885`

The output shows that all the numerical variables have been standardized with a mean value of zero.

The same result can be obtained using the ** scale** function, as shown below.

`1 2 3 4`

`dat_scaled <- as.data.frame(scale(dat[,c(1:4,6)])) summary(dat_scaled$Income)`

{r}

Output:

`1 2 3 4 5 6`

`Min. 1st Qu. Median Mean 3rd Qu. Max. -0.75854 -0.45597 -0.28325 0.00000 0.08281 10.80959`

In this approach, the data is scaled to a fixed range—usually 0 to 1. The impact is that we end up with smaller standard deviations, which can suppress the effect of outliers. We follow the same steps as above, with the only change in the 'method' argument, where the normalization method is now set to "range”.

`1 2 3 4 5 6 7`

`preproc2 <- preProcess(dat[,c(1:4,6)], method=c("range")) norm2 <- predict(preproc2, dat[,c(1:4,6)]) summary(norm2)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15`

`Dependents Income Loan_amount Term_months Min. :0.0000 Min. :0.00000 Min. :0.000000 Min. :0.0000 1st Qu.:0.0000 1st Qu.:0.02616 1st Qu.:0.005527 1st Qu.:0.7436 Median :0.0000 Median :0.04109 Median :0.007460 Median :0.7436 Mean :0.1260 Mean :0.06557 Mean :0.040599 Mean :0.7040 3rd Qu.:0.1667 3rd Qu.:0.07273 3rd Qu.:0.015158 3rd Qu.:0.7436 Max. :1.0000 Max. :1.00000 Max. :1.000000 Max. :1.0000 Age Min. :0.0000 1st Qu.:0.2778 Median :0.5370 Mean :0.5132 3rd Qu.:0.7361 Max. :1.0000`

The output above shows that all the values have been scaled between 0 to 1.

Many machine learning algorithms require variables to be normally distributed. However, in a real-world scenario, the data is often skewed and does not exhibit normal distribution. One technique to counter this is to apply a logarithmic transformation on the variables.

The *first line of code* below prints a summary of the 'Income' variable. The output shows that the mean and median incomes of the applicants are $715,589 and $513,050, respectively.

The *second command* performs the logarithmic transformation, while the *third line* prints the summary of the transformed variable. The output shows that the mean and median of the new transformed variable are similar.

`1 2 3 4 5 6 7`

`summary(dat$Income) logincome = log(dat$Income) summary(logincome)`

{r}

Output:

`1 2 3 4 5 6 7 8`

`Min. 1st Qu. Median Mean 3rd Qu. Max. 173200 389550 513050 715589 774800 8444900 Min. 1st Qu. Median Mean 3rd Qu. Max. 12.06 12.87 13.15 13.26 13.56 15.95`

In this guide, you have learned the most commonly used data normalization techniques using the powerful 'caret' package in R. These normalization techniques will help you handle numerical variables of varying units and scales, thus improving the performance of your machine learning algorithm. To learn more about data science using R, please refer to the following guides:

58