Introduction

139

Machine learning is used by many organizations to identify and solve business problems. The two types of supervised machine learning algorithms are *classification* and *regression*. This guide will focus on regression models that predict a continuous outcome. You'll learn how to implement linear and regularized regression models using R.

The topics we'll cover include:

Linear Regression

Ridge Regression

Lasso Regression

Elastic Net Regression

Let’s start by looking at a real-life situation and data.

Unemployment is a critical socio-economic and political concern for any country, and hence, managing it is a chief task for any government. In this guide, we will build regression algorithms for predicting unemployment within an economy.

The data comes from US economic time series data available from http://research.stlouisfed.org/fred2. The data contains 574 rows and 5 variables, as described below:

`psavert`

: personal savings rate`pce`

: personal consumption expenditures, in billions of dollars`uempmed`

: median duration of unemployment, in weeks`pop`

: total population, in million s`unemploy`

: number unemployed, in millions. This is the dependent variable.

We will evaluate the performance of the model using two metrics: R-squared value and Root Mean Squared Error (RMSE). Ideally, lower RMSE and higher R-squared values are indicative of a good model.

Let's start by loading the required libraries and the data.

`1 2 3 4 5 6 7 8 9`

`library(plyr) library(readr) library(dplyr) library(caret) library(ggplot2) library(repr) dat <- read_csv("reg_data.csv") glimpse(dat)`

{r}

Output:

`1 2 3 4 5 6 7`

`Observations: 574 Variables: 5 $ pce <dbl> 507.4, 510.5, 516.3, 512.9, 518.1, 525.8, 531.5, 534.2, 544.9, 544.6, 5... $ pop <dbl> 198.7, 198.9, 199.1, 199.3, 199.5, 199.7, 199.8, 199.9, 200.1, 200.2, 2... $ psavert <dbl> 12.5, 12.5, 11.7, 12.5, 12.5, 12.1, 11.7, 12.2, 11.6, 12.2, 12.0, 11.6,... $ uempmed <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4, 4.4, 4.5, 4.2, 4... $ unemploy <dbl> 2.9, 2.9, 3.0, 3.1, 3.1, 3.0, 2.9, 3.0, 2.9, 2.7, 2.7, 2.9, 2.9, 2.8, 2...`

The output shows that all the variables in the dataset are numerical variables (labeled as 'dbl').

We will build our model on the training set and evaluate its performance on the test set. This is called the holdout-validation approach for evaluating model performance.

The first line of code below sets the random seed for reproducibility of results. The second line creates an index for randomly sampling observations for data partitioning. The next two lines of code create the training and test set, while the last two lines print the dimensions of the training and test set. The train set contains 70 percent of the data while the test set contains the remaining 30 percent.

`1 2 3 4 5 6 7 8 9`

`set.seed(100) index = sample(1:nrow(dat), 0.7*nrow(dat)) train = dat[index,] # Create the training data test = dat[-index,] # Create the test data dim(train) dim(test)`

{r}

Output:

`1 2 3 4`

`[1] 401 5 [1] 173 5`

The numeric features need to be scaled; otherwise, they may adversely influence the modeling process.
The first line of code below creates a list that contains the names of independent numeric variables. The second line uses the `preProcess`

function from the caret package to complete the scaling task. The pre-processing object is fit only to the training data, while the scaling is applied on both the train and test sets. This is done in the third and fourth lines of code below. The fifth line prints the summary of the scaled train dataset.

The output shows that now all the numeric features have a mean value of zero except the target variable, `unemploy`

, which was not scaled.

`1 2 3 4 5 6 7 8`

`cols = c('pce', 'pop', 'psavert', 'uempmed') pre_proc_val <- preProcess(train[,cols], method = c("center", "scale")) train[,cols] = predict(pre_proc_val, train[,cols]) test[,cols] = predict(pre_proc_val, test[,cols]) summary(train)`

{r}

Output:

`1 2 3 4 5 6 7`

`pce pop psavert uempmed unemploy Min. :-1.2135 Min. :-1.6049 Min. :-1.89481 Min. :-1.1200 Min. : 2.700 1st Qu.:-0.8856 1st Qu.:-0.8606 1st Qu.:-0.76518 1st Qu.:-0.6198 1st Qu.: 6.500 Median :-0.2501 Median :-0.1416 Median :-0.05513 Median :-0.2447 Median : 7.500 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 7.791 3rd Qu.: 0.7797 3rd Qu.: 0.9060 3rd Qu.: 0.81629 3rd Qu.: 0.1055 3rd Qu.: 8.600 Max. : 2.1244 Max. : 1.8047 Max. : 2.91417 Max. : 4.1571 Max. :15.300`

The simplest form of regression is linear regression, which assumes that the predictors have a linear relationship with the target variable. The input variables are assumed to have a Gaussian distribution and are not correlated with each other (a problem called multi-collinearity).

The linear regression equation can be expressed in the following form:
`y = a1x1 + a2x2 + a3x3 + ..... + anxn + b`

In the above equation:

`y`

is the target variable.`x1, x2, x3, ... xn`

are the features.`a1, a2, a3, ... an`

are the coefficients.`b`

is the parameter of the model.

The parameters `a`

and `b`

in the model are selected through the ordinary least squares (OLS) method. This method works by minimizing the sum of squares of residuals (actual value - predicted value).

In order to fit the linear regression model, the first step is to instantiate the algorithm in the first line of code below using the `lm()`

function. The second line prints the summary of the trained model.

`1 2`

`lr = lm(unemploy ~ uempmed + psavert + pop + pce, data = train) summary(lr)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20`

`Call: lm(formula = unemploy ~ uempmed + psavert + pop + pce, data = train) Residuals: Min 1Q Median 3Q Max -2.4262 -0.7253 0.0278 0.6697 3.2753 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.79077 0.04712 165.352 < 2e-16 *** uempmed 2.18021 0.08588 25.386 < 2e-16 *** psavert 0.79126 0.13244 5.975 5.14e-09 *** pop 5.95419 0.37405 15.918 < 2e-16 *** pce -5.31578 0.32753 -16.230 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9435 on 396 degrees of freedom Multiple R-squared: 0.8542, Adjusted R-squared: 0.8527 F-statistic: 579.9 on 4 and 396 DF, p-value: < 2.2e-16`

The significance code `‘***’`

in above output shows that all the features are important predictors. The Adjusted R-squared value of 0.8527 is also a good result. Let's evaluate the model further.

The first step is to create a function for calculating the evaluation metrics R-squared and RMSE. The second step is to predict and evaluate the model on train data, while the third step is to predict and evaluate the model on test data.

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19`

`#Step 1 - create the evaluation metrics function eval_metrics = function(model, df, predictions, target){ resids = df[,target] - predictions resids2 = resids**2 N = length(predictions) r2 = as.character(round(summary(model)$r.squared, 2)) adj_r2 = as.character(round(summary(model)$adj.r.squared, 2)) print(adj_r2) #Adjusted R-squared print(as.character(round(sqrt(sum(resids2)/N), 2))) #RMSE } # Step 2 - predicting and evaluating the model on train data predictions = predict(lr, newdata = train) eval_metrics(lr, train, predictions, target = 'unemploy') # Step 3 - predicting and evaluating the model on test data predictions = predict(lr, newdata = test) eval_metrics(lr, test, predictions, target = 'unemploy')`

{r}

Output:

`1 2 3 4 5 6 7 8 9`

`[1] "0.85" [1] "0.94" [1] "0.85" [1] "1.1"`

The above output shows that RMSE, one of the two evaluation metrics, is 0.94 million for train data and 1.1 million for test data. On the other hand, R-squared value is around 85 percent for both train and test data, which indicates good performance.

Linear regression algorithm works by selecting coefficients for each independent variable that minimizes a loss function. However, if the coefficients are large, they can lead to over-fitting on the training dataset, and such a model will not generalize well on the unseen test data. To overcome this shortcoming, we'll do regularization, which penalizes large coefficients. The following sections of the guide will discuss various regularization algorithms.

We will be using the `glmnet()`

package to build the regularized regression models. The `glmnet`

function does not work with dataframes, so we need to create a numeric matrix for the training features and a vector of target values.

The lines of code below perform the task of creating model matrix using the `dummyVars`

function from the caret package. The `predict`

function is then applied to create numeric model matrices for training and test.

`1 2 3 4 5 6 7 8 9`

`cols_reg = c('pce', 'pop', 'psavert', 'uempmed', 'unemploy') dummies <- dummyVars(unemploy ~ ., data = dat[,cols_reg]) train_dummies = predict(dummies, newdata = train[,cols_reg]) test_dummies = predict(dummies, newdata = test[,cols_reg]) print(dim(train_dummies)); print(dim(test_dummies))`

{r}

Output:

`1 2`

`[1] 401 4 [1] 173 4`

Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model. This modification is done by adding a penalty parameter that is equivalent to the square of the magnitude of the coefficients.

`Loss function = OLS + alpha * summation (squared coefficient values)`

Ridge regression is also referred to as *l2 regularization*. The lines of code below construct a ridge regression model. The first line loads the library, while the next two lines create the training data matrices for the independent (x) and dependent variables (y).

The same step is repeated for the test dataset in the fourth and fifth lines of code. The sixth line creates a list of lambda values for the model to try, while the seventh line builds the ridge regression model.

The arguments used in the model are:

`nlambda`

: determines the number of regularization parameters to be tested.`alpha`

: determines the weighting to be used. In case of ridge regression, the value of alpha is zero.`family`

: determines the distribution family to be used. Since this is a regression model, we will use the Gaussian distribution.`lambda`

: determines the lambda values to be tried.

The last line of code prints the model information.

`1 2 3 4 5 6 7 8 9 10 11 12`

`library(glmnet) x = as.matrix(train_dummies) y_train = train$unemploy x_test = as.matrix(test_dummies) y_test = test$unemploy lambdas <- 10^seq(2, -3, by = -.1) ridge_reg = glmnet(x, y_train, nlambda = 25, alpha = 0, family = 'gaussian', lambda = lambdas) summary(ridge_reg)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13`

`Length Class Mode a0 51 -none- numeric beta 204 dgCMatrix S4 df 51 -none- numeric dim 2 -none- numeric lambda 51 -none- numeric dev.ratio 51 -none- numeric nulldev 1 -none- numeric npasses 1 -none- numeric jerr 1 -none- numeric offset 1 -none- logical call 7 -none- call nobs 1 -none- numeric`

One of the major differences between linear and regularized regression models is that the latter involves tuning a hyperparameter, lambda. The code above runs the `glmnet()`

model several times for different values of lambda. We can automate this task of finding the optimal lambda value using the `cv.glmnet()`

function. This is performed using the lines of code below.

`1 2 3`

`cv_ridge <- cv.glmnet(x, y_train, alpha = 0, lambda = lambdas) optimal_lambda <- cv_ridge$lambda.min optimal_lambda`

{r}

Output:

`1`

`[1] 0.001`

The optimal lambda value comes out to be 0.001 and will be used to build the ridge regression model. We will also create a function for calculating and printing the results, which is done with the `eval_results()`

function in the code below. The next step is to use the `predict`

function to generate predictions on the train and test data. Finally, we use the `eval_results`

function to calculate and print the evaluation metrics.

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23`

`# Compute R^2 from true and predicted values eval_results <- function(true, predicted, df) { SSE <- sum((predicted - true)^2) SST <- sum((true - mean(true))^2) R_square <- 1 - SSE / SST RMSE = sqrt(SSE/nrow(df)) # Model performance metrics data.frame( RMSE = RMSE, Rsquare = R_square ) } # Prediction and evaluation on train data predictions_train <- predict(ridge_reg, s = optimal_lambda, newx = x) eval_results(y_train, predictions_train, train) # Prediction and evaluation on test data predictions_test <- predict(ridge_reg, s = optimal_lambda, newx = x_test) eval_results(y_test, predictions_test, test)`

{r}

Output:

`1 2 3 4`

`RMSE Rsquare 1 0.9383428 0.8539379 RMSE Rsquare 1 1.102452 0.8669624`

The above output shows that the RMSE and R-squared values for the ridge regression model on the training data are 0.93 million and 85.4 percent, respectively. For the test data, the results for these metrics are 1.1 million and 86.7 percent, respectively. There is an improvement in the performance compared with linear regression model.

Lasso regression, or the Least Absolute Shrinkage and Selection Operator, is also a modification of linear regression. In lasso, the loss function is modified to minimize the complexity of the model by limiting the sum of the absolute values of the model coefficients (also called the `l1-norm`

).

The loss function for lasso regression can be expressed as below:

`Loss function = OLS + alpha * summation (absolute values of the magnitude of the coefficients)`

In the above function, `alpha`

is the penalty parameter we need to select. Using an `l1-norm`

constraint forces some weight values to zero to allow other coefficients to take non-zero values.

The first step to build a lasso model is to find the optimal lambda value using the code below. For lasso regression, the alpha value is 1. The output is the best cross-validated lambda, which comes out to be 0.001.

`1 2 3 4 5 6 7 8`

`lambdas <- 10^seq(2, -3, by = -.1) # Setting alpha = 1 implements lasso regression lasso_reg <- cv.glmnet(x, y_train, alpha = 1, lambda = lambdas, standardize = TRUE, nfolds = 5) # Best lambda_best <- lasso_reg$lambda.min lambda_best`

{r}

Output:

`1`

`[1] 0.001`

Once we have the optimal lambda value, we train the lasso model in the first line of code below. The second through fifth lines of code generate the predictions and print the evaluation metrics for both the training and test datasets.

`1 2 3 4 5 6 7`

`lasso_model <- glmnet(x, y_train, alpha = 1, lambda = lambda_best, standardize = TRUE) predictions_train <- predict(lasso_model, s = lambda_best, newx = x) eval_results(y_train, predictions_train, train) predictions_test <- predict(lasso_model, s = lambda_best, newx = x_test) eval_results(y_test, predictions_test, test)`

{r}

Output:

`1 2 3 4`

`RMSE Rsquare 1 0.9378347 0.854096 RMSE Rsquare 1 1.099764 0.8676104`

The above output shows that the RMSE and R-squared values on the training data are 0.93 million and 85.4 percent, respectively. The results on the test data are 1.1 million and 86.7 percent, respectively. Lasso regression can also be used for feature selection because the coeﬃcients of less important features are reduced to zero.

Elastic net regression combines the properties of ridge and lasso regression. It works by penalizing the model using both the 1l2-norm1 and the 1l1-norm1. The model can be easily built using the caret package, which automatically selects the optimal value of parameters alpha and lambda.

The first line of code creates the training control object `train_cont`

which specifies how the repeated cross validation will take place. The second line builds the elastic regression model in which a range of possible alpha and lambda values are tested and their optimum value is selected. The argument `tuneLength`

specifies that 10 different combinations of values for alpha and lambda are to be tested.

The last line of code prints the optimum values, which come out to be 0.4322 for alpha and 0.0065 for lambda.

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18`

`# Set training control train_cont <- trainControl(method = "repeatedcv", number = 10, repeats = 5, search = "random", verboseIter = TRUE) # Train the model elastic_reg <- train(unemploy ~ ., data = train, method = "glmnet", preProcess = c("center", "scale"), tuneLength = 10, trControl = train_cont) # Best tuning parameter elastic_reg$bestTune`

{r}

Output:

`1 2`

`alpha lambda 2 0.1996298 0.001183022`

Once we have trained the model, we use it to generate the predictions and print the evaluation results for both the training and test datasets, using the lines of code below.

`1 2 3 4 5 6 7`

`# Make predictions on training set predictions_train <- predict(elastic_reg, x) eval_results(y_train, predictions_train, train) # Make predictions on test set predictions_test <- predict(elastic_reg, x_test) eval_results(y_test, predictions_test, test)`

{r}

Output:

`1 2 3 4`

`RMSE Rsquare 1 0.9495596 0.850425 RMSE Rsquare 1 1.128873 0.8605094`

The above output shows that the RMSE and R-squared values for the elastic net regression model on the training data are 0.95 million and 85 percent, respectively. The results for these metrics on the test data are 1.12 million and 86 percent, respectively.

In this guide, you have learned about linear regression models using the powerful R language. You also learned about regularization techniques to avoid the shortcomings of the linear regression models.

The performance of the models is summarized below:

Linear Regression Model: Test set RMSE of 1.1 million and R-square of 85 percent.

Ridge Regression Model: Test set RMSE of 1.1 million and R-square of 86.7 percent.

Lasso Regression Model: Test set RMSE of 1.09 million and R-square of 86.7 percent.

ElasticNet Regression Model: Test set RMSE of 1.12 million and R-square of 86 percent.

The regularized regression models are performing better than the linear regression model. Overall, all the models are performing well with decent R-squared and stable RMSE values. The most ideal result would be an RMSE value of zero and R-squared value of 1, but that's almost impossible in real economic datasets.

To learn more about data science using R, please refer to the following guides:

139