Linear, Lasso, and Ridge Regression with R
Nov 12, 2019 • 20 Minute Read
Introduction
Machine learning is used by many organizations to identify and solve business problems. The two types of supervised machine learning algorithms are classification and regression. This guide will focus on regression models that predict a continuous outcome. You'll learn how to implement linear and regularized regression models using R.
The topics we'll cover include:

Linear Regression

Ridge Regression

Lasso Regression

Elastic Net Regression
Let’s start by looking at a reallife situation and data.
Data
Unemployment is a critical socioeconomic and political concern for any country, and hence, managing it is a chief task for any government. In this guide, we will build regression algorithms for predicting unemployment within an economy.
The data comes from US economic time series data available from https://research.stlouisfed.org/fred2. The data contains 574 rows and 5 variables, as described below:

psavert: personal savings rate

pce: personal consumption expenditures, in billions of dollars

uempmed: median duration of unemployment, in weeks

pop: total population, in million s

unemploy: number unemployed, in millions. This is the dependent variable.
Evaluation Metrics
We will evaluate the performance of the model using two metrics: Rsquared value and Root Mean Squared Error (RMSE). Ideally, lower RMSE and higher Rsquared values are indicative of a good model.
Let's start by loading the required libraries and the data.
library(plyr)
library(readr)
library(dplyr)
library(caret)
library(ggplot2)
library(repr)
dat < read_csv("reg_data.csv")
glimpse(dat)
Output:
Observations: 574
Variables: 5
$ pce <dbl> 507.4, 510.5, 516.3, 512.9, 518.1, 525.8, 531.5, 534.2, 544.9, 544.6, 5...
$ pop <dbl> 198.7, 198.9, 199.1, 199.3, 199.5, 199.7, 199.8, 199.9, 200.1, 200.2, 2...
$ psavert <dbl> 12.5, 12.5, 11.7, 12.5, 12.5, 12.1, 11.7, 12.2, 11.6, 12.2, 12.0, 11.6,...
$ uempmed <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4, 4.4, 4.5, 4.2, 4...
$ unemploy <dbl> 2.9, 2.9, 3.0, 3.1, 3.1, 3.0, 2.9, 3.0, 2.9, 2.7, 2.7, 2.9, 2.9, 2.8, 2...
The output shows that all the variables in the dataset are numerical variables (labeled as 'dbl').
Data Partitioning
We will build our model on the training set and evaluate its performance on the test set. This is called the holdoutvalidation approach for evaluating model performance.
The first line of code below sets the random seed for reproducibility of results. The second line creates an index for randomly sampling observations for data partitioning. The next two lines of code create the training and test set, while the last two lines print the dimensions of the training and test set. The train set contains 70 percent of the data while the test set contains the remaining 30 percent.
set.seed(100)
index = sample(1:nrow(dat), 0.7*nrow(dat))
train = dat[index,] # Create the training data
test = dat[index,] # Create the test data
dim(train)
dim(test)
Output:
1] 401 5
[1] 173 5
Scaling the Numeric Features
The numeric features need to be scaled; otherwise, they may adversely influence the modeling process. The first line of code below creates a list that contains the names of independent numeric variables. The second line uses the preProcess function from the caret package to complete the scaling task. The preprocessing object is fit only to the training data, while the scaling is applied on both the train and test sets. This is done in the third and fourth lines of code below. The fifth line prints the summary of the scaled train dataset.
The output shows that now all the numeric features have a mean value of zero except the target variable, unemploy, which was not scaled.
cols = c('pce', 'pop', 'psavert', 'uempmed')
pre_proc_val < preProcess(train[,cols], method = c("center", "scale"))
train[,cols] = predict(pre_proc_val, train[,cols])
test[,cols] = predict(pre_proc_val, test[,cols])
summary(train)
Output:
pce pop psavert uempmed unemploy
Min. :1.2135 Min. :1.6049 Min. :1.89481 Min. :1.1200 Min. : 2.700
1st Qu.:0.8856 1st Qu.:0.8606 1st Qu.:0.76518 1st Qu.:0.6198 1st Qu.: 6.500
Median :0.2501 Median :0.1416 Median :0.05513 Median :0.2447 Median : 7.500
Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 7.791
3rd Qu.: 0.7797 3rd Qu.: 0.9060 3rd Qu.: 0.81629 3rd Qu.: 0.1055 3rd Qu.: 8.600
Max. : 2.1244 Max. : 1.8047 Max. : 2.91417 Max. : 4.1571 Max. :15.300
Linear Regression
The simplest form of regression is linear regression, which assumes that the predictors have a linear relationship with the target variable. The input variables are assumed to have a Gaussian distribution and are not correlated with each other (a problem called multicollinearity).
The linear regression equation can be expressed in the following form: y = a1x1 + a2x2 + a3x3 + ..... + anxn + b
In the above equation:
 y is the target variable.
 x1, x2, x3, ... xn are the features.
 a1, a2, a3, ... an are the coefficients.
 b is the parameter of the model.
The parameters a and b in the model are selected through the ordinary least squares (OLS) method. This method works by minimizing the sum of squares of residuals (actual value  predicted value).
In order to fit the linear regression model, the first step is to instantiate the algorithm in the first line of code below using the lm() function. The second line prints the summary of the trained model.
lr = lm(unemploy ~ uempmed + psavert + pop + pce, data = train)
summary(lr)
Output:
Call:
lm(formula = unemploy ~ uempmed + psavert + pop + pce, data = train)
Residuals:
Min 1Q Median 3Q Max
2.4262 0.7253 0.0278 0.6697 3.2753
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 7.79077 0.04712 165.352 < 2e16 ***
uempmed 2.18021 0.08588 25.386 < 2e16 ***
psavert 0.79126 0.13244 5.975 5.14e09 ***
pop 5.95419 0.37405 15.918 < 2e16 ***
pce 5.31578 0.32753 16.230 < 2e16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9435 on 396 degrees of freedom
Multiple Rsquared: 0.8542, Adjusted Rsquared: 0.8527
Fstatistic: 579.9 on 4 and 396 DF, pvalue: < 2.2e16
The significance code ‘***’ in above output shows that all the features are important predictors. The Adjusted Rsquared value of 0.8527 is also a good result. Let's evaluate the model further.
Model Evaluation Metrics
The first step is to create a function for calculating the evaluation metrics Rsquared and RMSE. The second step is to predict and evaluate the model on train data, while the third step is to predict and evaluate the model on test data.
#Step 1  create the evaluation metrics function
eval_metrics = function(model, df, predictions, target){
resids = df[,target]  predictions
resids2 = resids**2
N = length(predictions)
r2 = as.character(round(summary(model)$r.squared, 2))
adj_r2 = as.character(round(summary(model)$adj.r.squared, 2))
print(adj_r2) #Adjusted Rsquared
print(as.character(round(sqrt(sum(resids2)/N), 2))) #RMSE
}
# Step 2  predicting and evaluating the model on train data
predictions = predict(lr, newdata = train)
eval_metrics(lr, train, predictions, target = 'unemploy')
# Step 3  predicting and evaluating the model on test data
predictions = predict(lr, newdata = test)
eval_metrics(lr, test, predictions, target = 'unemploy')
Output:
1] "0.85"
[1] "0.94"
[1] "0.85"
[1] "1.1"
The above output shows that RMSE, one of the two evaluation metrics, is 0.94 million for train data and 1.1 million for test data. On the other hand, Rsquared value is around 85 percent for both train and test data, which indicates good performance.
Regularization
Linear regression algorithm works by selecting coefficients for each independent variable that minimizes a loss function. However, if the coefficients are large, they can lead to overfitting on the training dataset, and such a model will not generalize well on the unseen test data. To overcome this shortcoming, we'll do regularization, which penalizes large coefficients. The following sections of the guide will discuss various regularization algorithms.
We will be using the glmnet() package to build the regularized regression models. The glmnet function does not work with dataframes, so we need to create a numeric matrix for the training features and a vector of target values.
The lines of code below perform the task of creating model matrix using the dummyVars function from the caret package. The predict function is then applied to create numeric model matrices for training and test.
cols_reg = c('pce', 'pop', 'psavert', 'uempmed', 'unemploy')
dummies < dummyVars(unemploy ~ ., data = dat[,cols_reg])
train_dummies = predict(dummies, newdata = train[,cols_reg])
test_dummies = predict(dummies, newdata = test[,cols_reg])
print(dim(train_dummies)); print(dim(test_dummies))
Output:
1] 401 4
[1] 173 4
Ridge Regression
Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model. This modification is done by adding a penalty parameter that is equivalent to the square of the magnitude of the coefficients.
Loss function = OLS + alpha * summation (squared coefficient values)
Ridge regression is also referred to as l2 regularization. The lines of code below construct a ridge regression model. The first line loads the library, while the next two lines create the training data matrices for the independent (x) and dependent variables (y).
The same step is repeated for the test dataset in the fourth and fifth lines of code. The sixth line creates a list of lambda values for the model to try, while the seventh line builds the ridge regression model.
The arguments used in the model are:

nlambda: determines the number of regularization parameters to be tested.

alpha: determines the weighting to be used. In case of ridge regression, the value of alpha is zero.

family: determines the distribution family to be used. Since this is a regression model, we will use the Gaussian distribution.

lambda: determines the lambda values to be tried.
The last line of code prints the model information.
library(glmnet)
x = as.matrix(train_dummies)
y_train = train$unemploy
x_test = as.matrix(test_dummies)
y_test = test$unemploy
lambdas < 10^seq(2, 3, by = .1)
ridge_reg = glmnet(x, y_train, nlambda = 25, alpha = 0, family = 'gaussian', lambda = lambdas)
summary(ridge_reg)
Output:
Length Class Mode
a0 51 none numeric
beta 204 dgCMatrix S4
df 51 none numeric
dim 2 none numeric
lambda 51 none numeric
dev.ratio 51 none numeric
nulldev 1 none numeric
npasses 1 none numeric
jerr 1 none numeric
offset 1 none logical
call 7 none call
nobs 1 none numeric
One of the major differences between linear and regularized regression models is that the latter involves tuning a hyperparameter, lambda. The code above runs the glmnet() model several times for different values of lambda. We can automate this task of finding the optimal lambda value using the cv.glmnet() function. This is performed using the lines of code below.
cv_ridge < cv.glmnet(x, y_train, alpha = 0, lambda = lambdas)
optimal_lambda < cv_ridge$lambda.min
optimal_lambda
Output:
1] 0.001
The optimal lambda value comes out to be 0.001 and will be used to build the ridge regression model. We will also create a function for calculating and printing the results, which is done with the eval_results() function in the code below. The next step is to use the predict function to generate predictions on the train and test data. Finally, we use the eval_results function to calculate and print the evaluation metrics.
# Compute R^2 from true and predicted values
eval_results < function(true, predicted, df) {
SSE < sum((predicted  true)^2)
SST < sum((true  mean(true))^2)
R_square < 1  SSE / SST
RMSE = sqrt(SSE/nrow(df))
# Model performance metrics
data.frame(
RMSE = RMSE,
Rsquare = R_square
)
}
# Prediction and evaluation on train data
predictions_train < predict(ridge_reg, s = optimal_lambda, newx = x)
eval_results(y_train, predictions_train, train)
# Prediction and evaluation on test data
predictions_test < predict(ridge_reg, s = optimal_lambda, newx = x_test)
eval_results(y_test, predictions_test, test)
Output:
RMSE Rsquare
1 0.9383428 0.8539379
RMSE Rsquare
1 1.102452 0.8669624
The above output shows that the RMSE and Rsquared values for the ridge regression model on the training data are 0.93 million and 85.4 percent, respectively. For the test data, the results for these metrics are 1.1 million and 86.7 percent, respectively. There is an improvement in the performance compared with linear regression model.
Lasso Regression
Lasso regression, or the Least Absolute Shrinkage and Selection Operator, is also a modification of linear regression. In lasso, the loss function is modified to minimize the complexity of the model by limiting the sum of the absolute values of the model coefficients (also called the l1norm).
The loss function for lasso regression can be expressed as below:
Loss function = OLS + alpha * summation (absolute values of the magnitude of the coefficients)
In the above function, alpha is the penalty parameter we need to select. Using an l1norm constraint forces some weight values to zero to allow other coefficients to take nonzero values.
The first step to build a lasso model is to find the optimal lambda value using the code below. For lasso regression, the alpha value is 1. The output is the best crossvalidated lambda, which comes out to be 0.001.
lambdas < 10^seq(2, 3, by = .1)
# Setting alpha = 1 implements lasso regression
lasso_reg < cv.glmnet(x, y_train, alpha = 1, lambda = lambdas, standardize = TRUE, nfolds = 5)
# Best
lambda_best < lasso_reg$lambda.min
lambda_best
Output:
1] 0.001
Once we have the optimal lambda value, we train the lasso model in the first line of code below. The second through fifth lines of code generate the predictions and print the evaluation metrics for both the training and test datasets.
lasso_model < glmnet(x, y_train, alpha = 1, lambda = lambda_best, standardize = TRUE)
predictions_train < predict(lasso_model, s = lambda_best, newx = x)
eval_results(y_train, predictions_train, train)
predictions_test < predict(lasso_model, s = lambda_best, newx = x_test)
eval_results(y_test, predictions_test, test)
Output:
RMSE Rsquare
1 0.9378347 0.854096
RMSE Rsquare
1 1.099764 0.8676104
The above output shows that the RMSE and Rsquared values on the training data are 0.93 million and 85.4 percent, respectively. The results on the test data are 1.1 million and 86.7 percent, respectively. Lasso regression can also be used for feature selection because the coeﬃcients of less important features are reduced to zero.
Elastic Net Regression
Elastic net regression combines the properties of ridge and lasso regression. It works by penalizing the model using both the 1l2norm1 and the 1l1norm1. The model can be easily built using the caret package, which automatically selects the optimal value of parameters alpha and lambda.
The first line of code creates the training control object train_cont which specifies how the repeated cross validation will take place. The second line builds the elastic regression model in which a range of possible alpha and lambda values are tested and their optimum value is selected. The argument tuneLength specifies that 10 different combinations of values for alpha and lambda are to be tested.
The last line of code prints the optimum values, which come out to be 0.4322 for alpha and 0.0065 for lambda.
# Set training control
train_cont < trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
search = "random",
verboseIter = TRUE)
# Train the model
elastic_reg < train(unemploy ~ .,
data = train,
method = "glmnet",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = train_cont)
# Best tuning parameter
elastic_reg$bestTune
Output:
alpha lambda
2 0.1996298 0.001183022
Once we have trained the model, we use it to generate the predictions and print the evaluation results for both the training and test datasets, using the lines of code below.
# Make predictions on training set
predictions_train < predict(elastic_reg, x)
eval_results(y_train, predictions_train, train)
# Make predictions on test set
predictions_test < predict(elastic_reg, x_test)
eval_results(y_test, predictions_test, test)
Output:
RMSE Rsquare
1 0.9495596 0.850425
RMSE Rsquare
1 1.128873 0.8605094
The above output shows that the RMSE and Rsquared values for the elastic net regression model on the training data are 0.95 million and 85 percent, respectively. The results for these metrics on the test data are 1.12 million and 86 percent, respectively.
Conclusion
In this guide, you have learned about linear regression models using the powerful R language. You also learned about regularization techniques to avoid the shortcomings of the linear regression models.
The performance of the models is summarized below:

Linear Regression Model: Test set RMSE of 1.1 million and Rsquare of 85 percent.

Ridge Regression Model: Test set RMSE of 1.1 million and Rsquare of 86.7 percent.

Lasso Regression Model: Test set RMSE of 1.09 million and Rsquare of 86.7 percent.

ElasticNet Regression Model: Test set RMSE of 1.12 million and Rsquare of 86 percent.
The regularized regression models are performing better than the linear regression model. Overall, all the models are performing well with decent Rsquared and stable RMSE values. The most ideal result would be an RMSE value of zero and Rsquared value of 1, but that's almost impossible in real economic datasets.
To learn more about data science using R, please refer to the following guides:
 Interpreting Data Using Descriptive Statistics with R
 Interpreting Data Using Statistical Models with R
 Time Series Forecasting Using R
 Hypothesis Testing  Interpreting Data with Statistical Models
 Machine Learning with Text Data Using R
 Visualization of Text Data Using Word Cloud in R
 Exploring Data Visually with R