Linear, Lasso, and Ridge Regression with scikit-learn

May 17, 2019 • 17 Minute Read

Introduction

Supervised Machine Learning is being used by many organizations to identify and solve business problems. The two types of algorithms commonly used are Classification and Regression. In the previous guide, Scikit Machine Learning, we learned how to build a classification algorithm with scikit-learn.

In this guide, the focus will be on Regression. Regression models are models which predict a continuous outcome. A few examples include predicting the unemployment levels in a country, sales of a retail store, number of matches a team will win in the baseball league, or number of seats a party will win in an election.

In this guide, you will learn how to implement the following linear regression models using scikit-learn:

Linear Regression
Ridge Regression
Lasso Regression
Elastic Net Regression

As always, the first step is to understand the Problem Statement.

Problem Statement

Unemployment is a big socio-economic and political concern for any country and, hence, managing it is a chief task for any government. In this guide, we will try to build regression algorithms for predicting unemployment within an economy.

The data used in this project was produced from US economic time series data available from [https://research.stlouisfed.org/fred2]. The data contains 574 rows and 5 variables, as described below:

psavert - personal savings rate.
pce - personal consumption expenditures, in billions of dollars.
uempmed - median duration of unemployment, in weeks.
pop - total population, in thousands.
unemploy- number of unemployed in thousands (dependent variable).

Evaluation Metrics

We will evaluate the performance of the model using two metrics - R-squared value and Root Mean Squared Error (RMSE).

R-squared values range from 0 to 1 and are commonly stated as percentages. It is a statistical measure that represents the proportion of the variance for a target variable that is explained by the independent variables. The other commonly used metric for regression problems is RMSE, that measures the average magnitude of the residuals or error. We will be using both these metrics to evaluate the model performance.

Ideally, lower RMSE and higher R-squared values are indicative of a good model.

Steps

In this guide, we will follow the following steps:

Step 1 - Loading the required libraries and modules.

Step 2 - Loading the data and performing basic data checks.

Step 3 - Creating arrays for the features and the response variable.

Step 4 - Creating the training and test datasets.

Step 5 - Build, Predict and Evaluate the regression model. We will be repeating Step 5 for the various regression models.

The following sections will cover these steps.

Step 1 - Loading the Required Libraries and Modules

      import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
    

Step 2 - Reading the Data and Performing Basic Data Checks

The first line of code reads in the data as pandas dataframe, while the second line prints the shape - 574 observations of 5 variables. The third line gives summary statistics of the numerical variables. The average unemployment stands at 7771 thousand for the data. Also, we don't have missing values because all the variables have 574 as 'count' which is equal to the number of records in the dataset.

      df = pd.read_csv('regressionexample.csv') 
print(df.shape)
df.describe()
    

Output:

      (574, 5)


|       | pce          | pop           | psavert    | uempmed    | unemploy     |
|-------|--------------|---------------|------------|------------|--------------|
| count | 574.000000   | 574.000000    | 574.000000 | 574.000000 | 574.000000   |
| mean  | 4843.510453  | 257189.381533 | 7.936585   | 8.610105   | 7771.557491  |
| std   | 3579.287206  | 36730.801593  | 3.124394   | 4.108112   | 2641.960571  |
| min   | 507.400000   | 198712.000000 | 1.900000   | 4.000000   | 2685.000000  |
| 25%   | 1582.225000  | 224896.000000 | 5.500000   | 6.000000   | 6284.000000  |
| 50%   | 3953.550000  | 253060.000000 | 7.700000   | 7.500000   | 7494.000000  |
| 75%   | 7667.325000  | 290290.750000 | 10.500000  | 9.100000   | 8691.000000  |
| max   | 12161.500000 | 320887.000000 | 17.000000  | 25.200000  | 15352.000000 |
    

Step 3 - Creating Arrays for the Features and the Response Variable

The first line of code creates an object of the target variable called 'target_column'. The second line gives us the list of all the features, excluding the target variable 'unemploy'.

The third line normalizes the predictors. This is done because the units of the variables differ significantly and may influence the modeling process. To prevent this, we will do normalization via scaling of the predictors between 0 and 1.

The fourth line displays the summary of the normalized data. We can see that all the independent variables have now been scaled between 0 and 1. The target variable remains unchanged.

      target_column = ['unemploy'] 
predictors = list(set(list(df.columns))-set(target_column))
df[predictors] = df[predictors]/df[predictors].max()
df.describe()
    

Output:

      |       | pce        | pop        | psavert    | uempmed    | unemploy     |
|-------|------------|------------|------------|------------|--------------|
| count | 574.000000 | 574.000000 | 574.000000 | 574.000000 | 574.000000   |
| mean  | 0.398266   | 0.801495   | 0.466858   | 0.341671   | 7771.557491  |
| std   | 0.294313   | 0.114466   | 0.183788   | 0.163020   | 2641.960571  |
| min   | 0.041722   | 0.619258   | 0.111765   | 0.158730   | 2685.000000  |
| 25%   | 0.130101   | 0.700857   | 0.323529   | 0.238095   | 6284.000000  |
| 50%   | 0.325087   | 0.788627   | 0.452941   | 0.297619   | 7494.000000  |
| 75%   | 0.630459   | 0.904651   | 0.617647   | 0.361111   | 8691.000000  |
| max   | 1.000000   | 1.000000   | 1.000000   | 1.000000   | 15352.000000 |
    

Step 4 - Creating the Training and Test Datasets

We will build our model on the training set and evaluate its performance on the test set. This is called the holdout-validation method.

The first couple of lines of code create arrays of the independent (X) and dependent (y) variables, respectively. The third line splits the data into training and test dataset, with the 'test_size' argument specifying the percentage of data to be kept in the test data. The fourth line prints the shape of the training set (401 observations of 4 variables) and test set (173 observations of 4 variables).

      X = df[predictors].values
y = df[target_column].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)
    

Output:

      (401, 4)
(173, 4)
    

Step 5 - Build, Predict and Evaluate the Regression Model

In this step, we will be implementing the various linear regression models using the scikit-learn library.

Linear Regression

The simplest form of regression is the linear regression, which assumes that the predictors have a linear relationship with the target variable. The input variables are assumed to have a Gaussian distribution. Another assumption is that the predictors are not highly correlated with each other (a problem called multi-collinearity).

The linear regression equation can be expressed in the following form:

y = a1x1 + a2x2 + a3x3 + ..... + anxn + b

Where the following is true:

y is the target variable.
x1, x2, x3,...xn are the features.
a1, a2, a3,..., an are the coefficients.
b is the parameter of the model.

The parameters a and b of the model are selected through the Ordinary least squares (OLS) method. It works by minimizing the sum of squares of residuals (actual value - predicted value).

In order to fit the linear regression model, the first step is to instantiate the algorithm that is done in the first line of code below. The second line fits the model on the training set.

      lr = LinearRegression()
lr.fit(X_train, y_train)
    

Output:

      LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Once the model is built on the training set, we can make the predictions. The first line of code below predicts on the training set. The second and third lines of code prints the evaluation metrics - RMSE and R-squared - on the training set. The same steps are repeated on the test dataset in the fourth to sixth lines.

      pred_train_lr= lr.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train_lr)))
print(r2_score(y_train, pred_train_lr))

pred_test_lr= lr.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_lr))) 
print(r2_score(y_test, pred_test_lr))
    

Output:

0295047627518
8685609551239368
3232585671161
8396633322870104
    

The above output shows that the RMSE, one of the two evaluation metrics, is 971 thousand for train data and 1019 thousand for test data. On the other hand, R-squared value is 87 percent for train data and 84 percent for test data, which is a good performance.

Regularized Regression

As discussed above, linear regression works by selecting coefficients for each independent variable that minimizes a loss function. However, if the coefficients are too large, it can lead to model over-fitting on the training dataset. Such a model will not generalize well on the unseen data. To overcome this shortcoming, we do regularization which penalizes large coefficients. The following sections of the guide will discuss the various regularization algorithms.

Ridge Regression

Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model. This modification is done by adding a penalty parameter that is equivalent to the square of the magnitude of the coefficients.

Loss function = OLS + alpha * summation (squared coefficient values)

In the above loss function, alpha is the parameter we need to select. A low alpha value can lead to over-fitting, whereas a high alpha value can lead to under-fitting.

In scikit-learn, a ridge regression model is constructed by using the Ridge class. The first line of code below instantiates the Ridge Regression model with an alpha value of 0.01. The second line fits the model to the training data.

The third line of code predicts, while the fourth and fifth lines print the evaluation metrics - RMSE and R-squared - on the training set. The same steps are repeated on the test dataset in the sixth to eight lines of code.

      rr = Ridge(alpha=0.01)
rr.fit(X_train, y_train) 
pred_train_rr= rr.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train_rr)))
print(r2_score(y_train, pred_train_rr))

pred_test_rr= rr.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_rr))) 
print(r2_score(y_test, pred_test_rr))
    

Output:

8314265299163
8672577596814723
3110662731054
8402957317988335
    

The above output shows that the RMSE and R-squared values for the Ridge Regression model on the training data is 975 thousand and 86.7 percent, respectively. For the test data, the result for these metrics is 1017 thousand and 84 percent, respectively.

Lasso Regression

Lasso regression, or the Least Absolute Shrinkage and Selection Operator, is also a modification of linear regression. In Lasso, the loss function is modified to minimize the complexity of the model by limiting the sum of the absolute values of the model coefficients (also called the l1-norm).

The loss function for Lasso Regression can be expressed as below:

Loss function = OLS + alpha * summation (absolute values of the magnitude of the coefficients)

In the above loss function, alpha is the penalty parameter we need to select. Using an l1 norm constraint forces some weight values to zero to allow other coefficients to take non-zero values.

In scikit-learn, a lasso regression model is constructed by using the Lasso class. The first line of code below instantiates the Lasso Regression model with an alpha value of 0.01. The second line fits the model to the training data.

      model_lasso = Lasso(alpha=0.01)
model_lasso.fit(X_train, y_train) 
pred_train_lasso= model_lasso.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train_lasso)))
print(r2_score(y_train, pred_train_lasso))

pred_test_lasso= model_lasso.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_lasso))) 
print(r2_score(y_test, pred_test_lasso))
    

Output:

0300033264347
8685608201522376
2575575977107
8396840007744909
    

The above output shows that the RMSE and R-squared values for the Lasso Regression model on the training data is 971 thousand and 86.7 percent, respectively.

The results for these metrics on the test data is 1019 thousand and 84 percent, respectively. Lasso Regression can also be used for feature selection because the coeﬃcients of less important features are reduced to zero.

ElasticNet Regression

ElasticNet combines the properties of both Ridge and Lasso regression. It works by penalizing the model using both the l2-norm and the l1-norm.

In scikit-learn, an ElasticNet regression model is constructed by using the ElasticNet class. The first line of code below instantiates the ElasticNet Regression with an alpha value of 0.01. The second line fits the model to the training data.

      #Elastic Net
model_enet = ElasticNet(alpha = 0.01)
model_enet.fit(X_train, y_train) 
pred_train_enet= model_enet.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train_enet)))
print(r2_score(y_train, pred_train_enet))

pred_test_enet= model_enet.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_enet)))
print(r2_score(y_test, pred_test_enet))
    

Output:

6359049952857
744952327304685
7820437888938
7062147664176855
    

The above output shows that the RMSE and R-squared value for the ElasticNet Regression model on the training data is 1352 thousand and 74 percent, respectively. The results for these metrics on the test data is 1379 thousand and 71 percent, respectively.

Conclusion

In this guide, you have learned about Linear Regression models using the powerful Python library, scikit-learn. You have also learned about Regularization techniques to avoid the shortcomings of the linear regression models. The performance of the models is summarized below:

Linear Regression Model: Test set RMSE of 1019 thousand and R-square of 83.96 percent.
Ridge Regression Model: Test set RMSE of 1017 thousand and R-square of 84.02 percent.
Lasso Regression Model: Test set RMSE of 1019 thousand and R-square of 83.96 percent.
ElasticNet Regression Model: Test set RMSE of 1379 thousand and R-square of 70.62 percent.

The ElasticNet Regression model is performing the worst. All the other regression models are performing better with a decent R-squared and stable RMSE values. The most ideal result would be an RMSE value of zero and R-squared value of 1, but that's almost impossible in real economic datasets.

There are other iterations that can be done to improve model performance. We have assigned the value of alpha to be 0.01, but this can be altered by hyper parameter tuning to arrive at the optimal alpha value. Cross-validation can also be tried along with feature selection techniques. However, that is not covered in this guide which was aimed at enabling individuals to understand and implement the various Linear Regression models using the scikit-learn library.