Author avatar

Deepika Singh

Linear, Lasso, and Ridge Regression with R

Deepika Singh

  • Nov 12, 2019
  • 20 Min read
  • 220 Views
  • Nov 12, 2019
  • 20 Min read
  • 220 Views
Data
R

Introduction

Machine learning is used by many organizations to identify and solve business problems. The two types of supervised machine learning algorithms are classification and regression. This guide will focus on regression models that predict a continuous outcome. You'll learn how to implement linear and regularized regression models using R.

The topics we'll cover include:

  1. Linear Regression

  2. Ridge Regression

  3. Lasso Regression

  4. Elastic Net Regression

Let’s start by looking at a real-life situation and data.

Data

Unemployment is a critical socio-economic and political concern for any country, and hence, managing it is a chief task for any government. In this guide, we will build regression algorithms for predicting unemployment within an economy.

The data comes from US economic time series data available from http://research.stlouisfed.org/fred2. The data contains 574 rows and 5 variables, as described below:

  1. psavert: personal savings rate

  2. pce: personal consumption expenditures, in billions of dollars

  3. uempmed: median duration of unemployment, in weeks

  4. pop: total population, in million s

  5. unemploy: number unemployed, in millions. This is the dependent variable.

Evaluation Metrics

We will evaluate the performance of the model using two metrics: R-squared value and Root Mean Squared Error (RMSE). Ideally, lower RMSE and higher R-squared values are indicative of a good model.

Let's start by loading the required libraries and the data.

1
2
3
4
5
6
7
8
9
library(plyr)
library(readr)
library(dplyr)
library(caret)
library(ggplot2)
library(repr)

dat <- read_csv("reg_data.csv")
glimpse(dat)
{r}

Output:

1
2
3
4
5
6
7
Observations: 574
Variables: 5
$ pce      <dbl> 507.4, 510.5, 516.3, 512.9, 518.1, 525.8, 531.5, 534.2, 544.9, 544.6, 5...
$ pop      <dbl> 198.7, 198.9, 199.1, 199.3, 199.5, 199.7, 199.8, 199.9, 200.1, 200.2, 2...
$ psavert  <dbl> 12.5, 12.5, 11.7, 12.5, 12.5, 12.1, 11.7, 12.2, 11.6, 12.2, 12.0, 11.6,...
$ uempmed  <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4, 4.4, 4.5, 4.2, 4...
$ unemploy <dbl> 2.9, 2.9, 3.0, 3.1, 3.1, 3.0, 2.9, 3.0, 2.9, 2.7, 2.7, 2.9, 2.9, 2.8, 2...

The output shows that all the variables in the dataset are numerical variables (labeled as 'dbl').

Data Partitioning

We will build our model on the training set and evaluate its performance on the test set. This is called the holdout-validation approach for evaluating model performance.

The first line of code below sets the random seed for reproducibility of results. The second line creates an index for randomly sampling observations for data partitioning. The next two lines of code create the training and test set, while the last two lines print the dimensions of the training and test set. The train set contains 70 percent of the data while the test set contains the remaining 30 percent.

1
2
3
4
5
6
7
8
9
set.seed(100) 

index = sample(1:nrow(dat), 0.7*nrow(dat)) 

train = dat[index,] # Create the training data 
test = dat[-index,] # Create the test data

dim(train)
dim(test)
{r}

Output:

1
2
3
4
[1] 401   5

[1] 173   5
 

Scaling the Numeric Features

The numeric features need to be scaled; otherwise, they may adversely influence the modeling process. The first line of code below creates a list that contains the names of independent numeric variables. The second line uses the preProcess function from the caret package to complete the scaling task. The pre-processing object is fit only to the training data, while the scaling is applied on both the train and test sets. This is done in the third and fourth lines of code below. The fifth line prints the summary of the scaled train dataset.

The output shows that now all the numeric features have a mean value of zero except the target variable, unemploy, which was not scaled.

1
2
3
4
5
6
7
8
cols = c('pce', 'pop', 'psavert', 'uempmed')

pre_proc_val <- preProcess(train[,cols], method = c("center", "scale"))

train[,cols] = predict(pre_proc_val, train[,cols])
test[,cols] = predict(pre_proc_val, test[,cols])

summary(train)
{r}

Output:

1
2
3
4
5
6
7
      pce               pop             psavert            uempmed           unemploy     
 Min.   :-1.2135   Min.   :-1.6049   Min.   :-1.89481   Min.   :-1.1200   Min.   : 2.700  
 1st Qu.:-0.8856   1st Qu.:-0.8606   1st Qu.:-0.76518   1st Qu.:-0.6198   1st Qu.: 6.500  
 Median :-0.2501   Median :-0.1416   Median :-0.05513   Median :-0.2447   Median : 7.500  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 7.791  
 3rd Qu.: 0.7797   3rd Qu.: 0.9060   3rd Qu.: 0.81629   3rd Qu.: 0.1055   3rd Qu.: 8.600  
 Max.   : 2.1244   Max.   : 1.8047   Max.   : 2.91417   Max.   : 4.1571   Max.   :15.300  

Linear Regression

The simplest form of regression is linear regression, which assumes that the predictors have a linear relationship with the target variable. The input variables are assumed to have a Gaussian distribution and are not correlated with each other (a problem called multi-collinearity).

The linear regression equation can be expressed in the following form: y = a1x1 + a2x2 + a3x3 + ..... + anxn + b

In the above equation:

  • y is the target variable.
  • x1, x2, x3, ... xn are the features.
  • a1, a2, a3, ... an are the coefficients.
  • b is the parameter of the model.

The parameters a and b in the model are selected through the ordinary least squares (OLS) method. This method works by minimizing the sum of squares of residuals (actual value - predicted value).

In order to fit the linear regression model, the first step is to instantiate the algorithm in the first line of code below using the lm() function. The second line prints the summary of the trained model.

1
2
lr = lm(unemploy ~ uempmed + psavert + pop + pce, data = train)
summary(lr)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Call:
lm(formula = unemploy ~ uempmed + psavert + pop + pce, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4262 -0.7253  0.0278  0.6697  3.2753 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.79077    0.04712 165.352  < 2e-16 ***
uempmed      2.18021    0.08588  25.386  < 2e-16 ***
psavert      0.79126    0.13244   5.975 5.14e-09 ***
pop          5.95419    0.37405  15.918  < 2e-16 ***
pce         -5.31578    0.32753 -16.230  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9435 on 396 degrees of freedom
Multiple R-squared:  0.8542,	Adjusted R-squared:  0.8527 
F-statistic: 579.9 on 4 and 396 DF,  p-value: < 2.2e-16 

The significance code ‘***’ in above output shows that all the features are important predictors. The Adjusted R-squared value of 0.8527 is also a good result. Let's evaluate the model further.

Model Evaluation Metrics

The first step is to create a function for calculating the evaluation metrics R-squared and RMSE. The second step is to predict and evaluate the model on train data, while the third step is to predict and evaluate the model on test data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#Step 1 - create the evaluation metrics function

eval_metrics = function(model, df, predictions, target){
    resids = df[,target] - predictions
    resids2 = resids**2
    N = length(predictions)
    r2 = as.character(round(summary(model)$r.squared, 2))
    adj_r2 = as.character(round(summary(model)$adj.r.squared, 2))
    print(adj_r2) #Adjusted R-squared
    print(as.character(round(sqrt(sum(resids2)/N), 2))) #RMSE
}

# Step 2 - predicting and evaluating the model on train data
predictions = predict(lr, newdata = train)
eval_metrics(lr, train, predictions, target = 'unemploy')

# Step 3 - predicting and evaluating the model on test data
predictions = predict(lr, newdata = test)
eval_metrics(lr, test, predictions, target = 'unemploy')
{r}

Output:

1
2
3
4
5
6
7
8
9
[1] "0.85"
      
[1] "0.94"
     

[1] "0.85"
     
[1] "1.1"
         

The above output shows that RMSE, one of the two evaluation metrics, is 0.94 million for train data and 1.1 million for test data. On the other hand, R-squared value is around 85 percent for both train and test data, which indicates good performance.

Regularization

Linear regression algorithm works by selecting coefficients for each independent variable that minimizes a loss function. However, if the coefficients are large, they can lead to over-fitting on the training dataset, and such a model will not generalize well on the unseen test data. To overcome this shortcoming, we'll do regularization, which penalizes large coefficients. The following sections of the guide will discuss various regularization algorithms.

We will be using the glmnet() package to build the regularized regression models. The glmnet function does not work with dataframes, so we need to create a numeric matrix for the training features and a vector of target values.

The lines of code below perform the task of creating model matrix using the dummyVars function from the caret package. The predict function is then applied to create numeric model matrices for training and test.

1
2
3
4
5
6
7
8
9
cols_reg = c('pce', 'pop', 'psavert', 'uempmed', 'unemploy')

dummies <- dummyVars(unemploy ~ ., data = dat[,cols_reg])

train_dummies = predict(dummies, newdata = train[,cols_reg])

test_dummies = predict(dummies, newdata = test[,cols_reg])

print(dim(train_dummies)); print(dim(test_dummies))
{r}

Output:

1
2
[1] 401   4
[1] 173   4

Ridge Regression

Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model. This modification is done by adding a penalty parameter that is equivalent to the square of the magnitude of the coefficients.

Loss function = OLS + alpha * summation (squared coefficient values)

Ridge regression is also referred to as l2 regularization. The lines of code below construct a ridge regression model. The first line loads the library, while the next two lines create the training data matrices for the independent (x) and dependent variables (y).

The same step is repeated for the test dataset in the fourth and fifth lines of code. The sixth line creates a list of lambda values for the model to try, while the seventh line builds the ridge regression model.

The arguments used in the model are:

  1. nlambda: determines the number of regularization parameters to be tested.

  2. alpha: determines the weighting to be used. In case of ridge regression, the value of alpha is zero.

  3. family: determines the distribution family to be used. Since this is a regression model, we will use the Gaussian distribution.

  4. lambda: determines the lambda values to be tried.

The last line of code prints the model information.

1
2
3
4
5
6
7
8
9
10
11
12
library(glmnet)

x = as.matrix(train_dummies)
y_train = train$unemploy

x_test = as.matrix(test_dummies)
y_test = test$unemploy

lambdas <- 10^seq(2, -3, by = -.1)
ridge_reg = glmnet(x, y_train, nlambda = 25, alpha = 0, family = 'gaussian', lambda = lambdas)

summary(ridge_reg)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
          Length Class     Mode   
a0         51    -none-    numeric
beta      204    dgCMatrix S4     
df         51    -none-    numeric
dim         2    -none-    numeric
lambda     51    -none-    numeric
dev.ratio  51    -none-    numeric
nulldev     1    -none-    numeric
npasses     1    -none-    numeric
jerr        1    -none-    numeric
offset      1    -none-    logical
call        7    -none-    call   
nobs        1    -none-    numeric

One of the major differences between linear and regularized regression models is that the latter involves tuning a hyperparameter, lambda. The code above runs the glmnet() model several times for different values of lambda. We can automate this task of finding the optimal lambda value using the cv.glmnet() function. This is performed using the lines of code below.

1
2
3
cv_ridge <- cv.glmnet(x, y_train, alpha = 0, lambda = lambdas)
optimal_lambda <- cv_ridge$lambda.min
optimal_lambda
{r}

Output:

1
[1] 0.001

The optimal lambda value comes out to be 0.001 and will be used to build the ridge regression model. We will also create a function for calculating and printing the results, which is done with the eval_results() function in the code below. The next step is to use the predict function to generate predictions on the train and test data. Finally, we use the eval_results function to calculate and print the evaluation metrics.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Compute R^2 from true and predicted values
eval_results <- function(true, predicted, df) {
  SSE <- sum((predicted - true)^2)
  SST <- sum((true - mean(true))^2)
  R_square <- 1 - SSE / SST
  RMSE = sqrt(SSE/nrow(df))

  
  # Model performance metrics
data.frame(
  RMSE = RMSE,
  Rsquare = R_square
)
  
}

# Prediction and evaluation on train data
predictions_train <- predict(ridge_reg, s = optimal_lambda, newx = x)
eval_results(y_train, predictions_train, train)

# Prediction and evaluation on test data
predictions_test <- predict(ridge_reg, s = optimal_lambda, newx = x_test)
eval_results(y_test, predictions_test, test)
{r}

Output:

1
2
3
4
       RMSE   Rsquare
1 0.9383428 0.8539379
      RMSE   Rsquare
1 1.102452 0.8669624

The above output shows that the RMSE and R-squared values for the ridge regression model on the training data are 0.93 million and 85.4 percent, respectively. For the test data, the results for these metrics are 1.1 million and 86.7 percent, respectively. There is an improvement in the performance compared with linear regression model.

Lasso Regression

Lasso regression, or the Least Absolute Shrinkage and Selection Operator, is also a modification of linear regression. In lasso, the loss function is modified to minimize the complexity of the model by limiting the sum of the absolute values of the model coefficients (also called the l1-norm).

The loss function for lasso regression can be expressed as below:

Loss function = OLS + alpha * summation (absolute values of the magnitude of the coefficients)

In the above function, alpha is the penalty parameter we need to select. Using an l1-norm constraint forces some weight values to zero to allow other coefficients to take non-zero values.

The first step to build a lasso model is to find the optimal lambda value using the code below. For lasso regression, the alpha value is 1. The output is the best cross-validated lambda, which comes out to be 0.001.

1
2
3
4
5
6
7
8
lambdas <- 10^seq(2, -3, by = -.1)

# Setting alpha = 1 implements lasso regression
lasso_reg <- cv.glmnet(x, y_train, alpha = 1, lambda = lambdas, standardize = TRUE, nfolds = 5)

# Best 
lambda_best <- lasso_reg$lambda.min 
lambda_best
{r}

Output:

1
[1] 0.001

Once we have the optimal lambda value, we train the lasso model in the first line of code below. The second through fifth lines of code generate the predictions and print the evaluation metrics for both the training and test datasets.

1
2
3
4
5
6
7
lasso_model <- glmnet(x, y_train, alpha = 1, lambda = lambda_best, standardize = TRUE)

predictions_train <- predict(lasso_model, s = lambda_best, newx = x)
eval_results(y_train, predictions_train, train)

predictions_test <- predict(lasso_model, s = lambda_best, newx = x_test)
eval_results(y_test, predictions_test, test)
{r}

Output:

1
2
3
4
       RMSE  Rsquare
1 0.9378347 0.854096
      RMSE   Rsquare
1 1.099764 0.8676104

The above output shows that the RMSE and R-squared values on the training data are 0.93 million and 85.4 percent, respectively. The results on the test data are 1.1 million and 86.7 percent, respectively. Lasso regression can also be used for feature selection because the coefficients of less important features are reduced to zero.

Elastic Net Regression

Elastic net regression combines the properties of ridge and lasso regression. It works by penalizing the model using both the 1l2-norm1 and the 1l1-norm1. The model can be easily built using the caret package, which automatically selects the optimal value of parameters alpha and lambda.

The first line of code creates the training control object train_cont which specifies how the repeated cross validation will take place. The second line builds the elastic regression model in which a range of possible alpha and lambda values are tested and their optimum value is selected. The argument tuneLength specifies that 10 different combinations of values for alpha and lambda are to be tested.

The last line of code prints the optimum values, which come out to be 0.4322 for alpha and 0.0065 for lambda.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Set training control
train_cont <- trainControl(method = "repeatedcv",
                              number = 10,
                              repeats = 5,
                              search = "random",
                              verboseIter = TRUE)

# Train the model
elastic_reg <- train(unemploy ~ .,
                           data = train,
                           method = "glmnet",
                           preProcess = c("center", "scale"),
                           tuneLength = 10,
                           trControl = train_cont)


# Best tuning parameter
elastic_reg$bestTune
{r}

Output:

1
2
      alpha      lambda
2 0.1996298 0.001183022

Once we have trained the model, we use it to generate the predictions and print the evaluation results for both the training and test datasets, using the lines of code below.

1
2
3
4
5
6
7
# Make predictions on training set
predictions_train <- predict(elastic_reg, x)
eval_results(y_train, predictions_train, train) 

# Make predictions on test set
predictions_test <- predict(elastic_reg, x_test)
eval_results(y_test, predictions_test, test)
{r}

Output:

1
2
3
4
       RMSE  Rsquare
1 0.9495596 0.850425
      RMSE   Rsquare
1 1.128873 0.8605094

The above output shows that the RMSE and R-squared values for the elastic net regression model on the training data are 0.95 million and 85 percent, respectively. The results for these metrics on the test data are 1.12 million and 86 percent, respectively.

Conclusion

In this guide, you have learned about linear regression models using the powerful R language. You also learned about regularization techniques to avoid the shortcomings of the linear regression models.

The performance of the models is summarized below:

  1. Linear Regression Model: Test set RMSE of 1.1 million and R-square of 85 percent.

  2. Ridge Regression Model: Test set RMSE of 1.1 million and R-square of 86.7 percent.

  3. Lasso Regression Model: Test set RMSE of 1.09 million and R-square of 86.7 percent.

  4. ElasticNet Regression Model: Test set RMSE of 1.12 million and R-square of 86 percent.

The regularized regression models are performing better than the linear regression model. Overall, all the models are performing well with decent R-squared and stable RMSE values. The most ideal result would be an RMSE value of zero and R-squared value of 1, but that's almost impossible in real economic datasets.

To learn more about data science using R, please refer to the following guides:

0