Linear, Lasso, and Ridge Regression with scikitlearn
May 17, 2019 • 17 Minute Read
Introduction
Supervised Machine Learning is being used by many organizations to identify and solve business problems. The two types of algorithms commonly used are Classification and Regression. In the previous guide, Scikit Machine Learning, we learned how to build a classification algorithm with scikitlearn.
In this guide, the focus will be on Regression. Regression models are models which predict a continuous outcome. A few examples include predicting the unemployment levels in a country, sales of a retail store, number of matches a team will win in the baseball league, or number of seats a party will win in an election.
In this guide, you will learn how to implement the following linear regression models using scikitlearn:
 Linear Regression
 Ridge Regression
 Lasso Regression
 Elastic Net Regression
As always, the first step is to understand the Problem Statement.
Problem Statement
Unemployment is a big socioeconomic and political concern for any country and, hence, managing it is a chief task for any government. In this guide, we will try to build regression algorithms for predicting unemployment within an economy.
The data used in this project was produced from US economic time series data available from [https://research.stlouisfed.org/fred2]. The data contains 574 rows and 5 variables, as described below:
 psavert  personal savings rate.
 pce  personal consumption expenditures, in billions of dollars.
 uempmed  median duration of unemployment, in weeks.
 pop  total population, in thousands.
 unemploy number of unemployed in thousands (dependent variable).
Evaluation Metrics
We will evaluate the performance of the model using two metrics  Rsquared value and Root Mean Squared Error (RMSE).
Rsquared values range from 0 to 1 and are commonly stated as percentages. It is a statistical measure that represents the proportion of the variance for a target variable that is explained by the independent variables. The other commonly used metric for regression problems is RMSE, that measures the average magnitude of the residuals or error. We will be using both these metrics to evaluate the model performance.
Ideally, lower RMSE and higher Rsquared values are indicative of a good model.
Steps
In this guide, we will follow the following steps:
Step 1  Loading the required libraries and modules.
Step 2  Loading the data and performing basic data checks.
Step 3  Creating arrays for the features and the response variable.
Step 4  Creating the training and test datasets.
Step 5  Build, Predict and Evaluate the regression model. We will be repeating Step 5 for the various regression models.
The following sections will cover these steps.
Step 1  Loading the Required Libraries and Modules
import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
Step 2  Reading the Data and Performing Basic Data Checks
The first line of code reads in the data as pandas dataframe, while the second line prints the shape  574 observations of 5 variables. The third line gives summary statistics of the numerical variables. The average unemployment stands at 7771 thousand for the data. Also, we don't have missing values because all the variables have 574 as 'count' which is equal to the number of records in the dataset.
df = pd.read_csv('regressionexample.csv')
print(df.shape)
df.describe()
Output:
(574, 5)
  pce  pop  psavert  uempmed  unemploy 

 count  574.000000  574.000000  574.000000  574.000000  574.000000 
 mean  4843.510453  257189.381533  7.936585  8.610105  7771.557491 
 std  3579.287206  36730.801593  3.124394  4.108112  2641.960571 
 min  507.400000  198712.000000  1.900000  4.000000  2685.000000 
 25%  1582.225000  224896.000000  5.500000  6.000000  6284.000000 
 50%  3953.550000  253060.000000  7.700000  7.500000  7494.000000 
 75%  7667.325000  290290.750000  10.500000  9.100000  8691.000000 
 max  12161.500000  320887.000000  17.000000  25.200000  15352.000000 
Step 3  Creating Arrays for the Features and the Response Variable
The first line of code creates an object of the target variable called 'target_column'. The second line gives us the list of all the features, excluding the target variable 'unemploy'.
The third line normalizes the predictors. This is done because the units of the variables differ significantly and may influence the modeling process. To prevent this, we will do normalization via scaling of the predictors between 0 and 1.
The fourth line displays the summary of the normalized data. We can see that all the independent variables have now been scaled between 0 and 1. The target variable remains unchanged.
target_column = ['unemploy']
predictors = list(set(list(df.columns))set(target_column))
df[predictors] = df[predictors]/df[predictors].max()
df.describe()
Output:
  pce  pop  psavert  uempmed  unemploy 

 count  574.000000  574.000000  574.000000  574.000000  574.000000 
 mean  0.398266  0.801495  0.466858  0.341671  7771.557491 
 std  0.294313  0.114466  0.183788  0.163020  2641.960571 
 min  0.041722  0.619258  0.111765  0.158730  2685.000000 
 25%  0.130101  0.700857  0.323529  0.238095  6284.000000 
 50%  0.325087  0.788627  0.452941  0.297619  7494.000000 
 75%  0.630459  0.904651  0.617647  0.361111  8691.000000 
 max  1.000000  1.000000  1.000000  1.000000  15352.000000 
Step 4  Creating the Training and Test Datasets
We will build our model on the training set and evaluate its performance on the test set. This is called the holdoutvalidation method.
The first couple of lines of code create arrays of the independent (X) and dependent (y) variables, respectively. The third line splits the data into training and test dataset, with the 'test_size' argument specifying the percentage of data to be kept in the test data. The fourth line prints the shape of the training set (401 observations of 4 variables) and test set (173 observations of 4 variables).
X = df[predictors].values
y = df[target_column].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)
Output:
(401, 4)
(173, 4)
Step 5  Build, Predict and Evaluate the Regression Model
In this step, we will be implementing the various linear regression models using the scikitlearn library.
Linear Regression
The simplest form of regression is the linear regression, which assumes that the predictors have a linear relationship with the target variable. The input variables are assumed to have a Gaussian distribution. Another assumption is that the predictors are not highly correlated with each other (a problem called multicollinearity).
The linear regression equation can be expressed in the following form:
y = a1x1 + a2x2 + a3x3 + ..... + anxn + b
Where the following is true:
 y is the target variable.
 x1, x2, x3,...xn are the features.
 a1, a2, a3,..., an are the coefficients.
 b is the parameter of the model.
The parameters a and b of the model are selected through the Ordinary least squares (OLS) method. It works by minimizing the sum of squares of residuals (actual value  predicted value).
In order to fit the linear regression model, the first step is to instantiate the algorithm that is done in the first line of code below. The second line fits the model on the training set.
lr = LinearRegression()
lr.fit(X_train, y_train)
Output:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
Once the model is built on the training set, we can make the predictions. The first line of code below predicts on the training set. The second and third lines of code prints the evaluation metrics  RMSE and Rsquared  on the training set. The same steps are repeated on the test dataset in the fourth to sixth lines.
pred_train_lr= lr.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train_lr)))
print(r2_score(y_train, pred_train_lr))
pred_test_lr= lr.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_lr)))
print(r2_score(y_test, pred_test_lr))
Output:
971.0295047627518
0.8685609551239368
1019.3232585671161
0.8396633322870104
The above output shows that the RMSE, one of the two evaluation metrics, is 971 thousand for train data and 1019 thousand for test data. On the other hand, Rsquared value is 87 percent for train data and 84 percent for test data, which is a good performance.
Regularized Regression
As discussed above, linear regression works by selecting coefficients for each independent variable that minimizes a loss function. However, if the coefficients are too large, it can lead to model overfitting on the training dataset. Such a model will not generalize well on the unseen data. To overcome this shortcoming, we do regularization which penalizes large coefficients. The following sections of the guide will discuss the various regularization algorithms.
Ridge Regression
Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model. This modification is done by adding a penalty parameter that is equivalent to the square of the magnitude of the coefficients.
Loss function = OLS + alpha * summation (squared coefficient values)
In the above loss function, alpha is the parameter we need to select. A low alpha value can lead to overfitting, whereas a high alpha value can lead to underfitting.
In scikitlearn, a ridge regression model is constructed by using the Ridge class. The first line of code below instantiates the Ridge Regression model with an alpha value of 0.01. The second line fits the model to the training data.
The third line of code predicts, while the fourth and fifth lines print the evaluation metrics  RMSE and Rsquared  on the training set. The same steps are repeated on the test dataset in the sixth to eight lines of code.
rr = Ridge(alpha=0.01)
rr.fit(X_train, y_train)
pred_train_rr= rr.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train_rr)))
print(r2_score(y_train, pred_train_rr))
pred_test_rr= rr.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_rr)))
print(r2_score(y_test, pred_test_rr))
Output:
975.8314265299163
0.8672577596814723
1017.3110662731054
0.8402957317988335
The above output shows that the RMSE and Rsquared values for the Ridge Regression model on the training data is 975 thousand and 86.7 percent, respectively. For the test data, the result for these metrics is 1017 thousand and 84 percent, respectively.
Lasso Regression
Lasso regression, or the Least Absolute Shrinkage and Selection Operator, is also a modification of linear regression. In Lasso, the loss function is modified to minimize the complexity of the model by limiting the sum of the absolute values of the model coefficients (also called the l1norm).
The loss function for Lasso Regression can be expressed as below:
Loss function = OLS + alpha * summation (absolute values of the magnitude of the coefficients)
In the above loss function, alpha is the penalty parameter we need to select. Using an l1 norm constraint forces some weight values to zero to allow other coefficients to take nonzero values.
In scikitlearn, a lasso regression model is constructed by using the Lasso class. The first line of code below instantiates the Lasso Regression model with an alpha value of 0.01. The second line fits the model to the training data.
The third line of code predicts, while the fourth and fifth lines print the evaluation metrics  RMSE and Rsquared  on the training set. The same steps are repeated on the test dataset in the sixth to eight lines of code.
model_lasso = Lasso(alpha=0.01)
model_lasso.fit(X_train, y_train)
pred_train_lasso= model_lasso.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train_lasso)))
print(r2_score(y_train, pred_train_lasso))
pred_test_lasso= model_lasso.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_lasso)))
print(r2_score(y_test, pred_test_lasso))
Output:
971.0300033264347
0.8685608201522376
1019.2575575977107
0.8396840007744909
The above output shows that the RMSE and Rsquared values for the Lasso Regression model on the training data is 971 thousand and 86.7 percent, respectively.
The results for these metrics on the test data is 1019 thousand and 84 percent, respectively. Lasso Regression can also be used for feature selection because the coeﬃcients of less important features are reduced to zero.
ElasticNet Regression
ElasticNet combines the properties of both Ridge and Lasso regression. It works by penalizing the model using both the l2norm and the l1norm.
In scikitlearn, an ElasticNet regression model is constructed by using the ElasticNet class. The first line of code below instantiates the ElasticNet Regression with an alpha value of 0.01. The second line fits the model to the training data.
The third line of code predicts, while the fourth and fifth lines print the evaluation metrics  RMSE and Rsquared  on the training set. The same steps are repeated on the test dataset in the sixth to eight lines of code.
#Elastic Net
model_enet = ElasticNet(alpha = 0.01)
model_enet.fit(X_train, y_train)
pred_train_enet= model_enet.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train_enet)))
print(r2_score(y_train, pred_train_enet))
pred_test_enet= model_enet.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_enet)))
print(r2_score(y_test, pred_test_enet))
Output:
1352.6359049952857
0.744952327304685
1379.7820437888938
0.7062147664176855
The above output shows that the RMSE and Rsquared value for the ElasticNet Regression model on the training data is 1352 thousand and 74 percent, respectively. The results for these metrics on the test data is 1379 thousand and 71 percent, respectively.
Conclusion
In this guide, you have learned about Linear Regression models using the powerful Python library, scikitlearn. You have also learned about Regularization techniques to avoid the shortcomings of the linear regression models. The performance of the models is summarized below:

Linear Regression Model: Test set RMSE of 1019 thousand and Rsquare of 83.96 percent.

Ridge Regression Model: Test set RMSE of 1017 thousand and Rsquare of 84.02 percent.

Lasso Regression Model: Test set RMSE of 1019 thousand and Rsquare of 83.96 percent.

ElasticNet Regression Model: Test set RMSE of 1379 thousand and Rsquare of 70.62 percent.
The ElasticNet Regression model is performing the worst. All the other regression models are performing better with a decent Rsquared and stable RMSE values. The most ideal result would be an RMSE value of zero and Rsquared value of 1, but that's almost impossible in real economic datasets.
There are other iterations that can be done to improve model performance. We have assigned the value of alpha to be 0.01, but this can be altered by hyper parameter tuning to arrive at the optimal alpha value. Crossvalidation can also be tried along with feature selection techniques. However, that is not covered in this guide which was aimed at enabling individuals to understand and implement the various Linear Regression models using the scikitlearn library.