Introduction

41

Supervised Machine Learning is being used by many organizations to identify and solve business problems. The two types of algorithms commonly used are Classification and Regression. In the previous guide, Scikit Machine Learning, we learned how to build a classification algorithm with scikit-learn.

In this guide, the focus will be on Regression. Regression models are models which predict a continuous outcome. A few examples include predicting the unemployment levels in a country, sales of a retail store, number of matches a team will win in the baseball league, or number of seats a party will win in an election.

In this guide, you will learn how to implement the following linear regression models using scikit-learn:

- Linear Regression
- Ridge Regression
- Lasso Regression
- Elastic Net Regression

As always, the first step is to understand the Problem Statement.

Unemployment is a big socio-economic and political concern for any country and, hence, managing it is a chief task for any government. In this guide, we will try to build regression algorithms for predicting unemployment within an economy.

The data used in this project was produced from US economic time series data available from http://research.stlouisfed.org/fred2. The data contains 574 rows and 5 variables, as described below:

- psavert - personal savings rate.
- pce - personal consumption expenditures, in billions of dollars.
- uempmed - median duration of unemployment, in weeks.
- pop - total population, in thousands.
- unemploy- number of unemployed in thousands (dependent variable).

We will evaluate the performance of the model using two metrics - R-squared value and Root Mean Squared Error (RMSE).

R-squared values range from 0 to 1 and are commonly stated as percentages. It is a statistical measure that represents the proportion of the variance for a target variable that is explained by the independent variables. The other commonly used metric for regression problems is RMSE, that measures the average magnitude of the residuals or error. We will be using both these metrics to evaluate the model performance.

Ideally, lower RMSE and higher R-squared values are indicative of a good model.

In this guide, we will follow the following steps:

*Step 1 - Loading the required libraries and modules.*

*Step 2 - Loading the data and performing basic data checks.*

*Step 3 - Creating arrays for the features and the response variable.*

*Step 4 - Creating the training and test datasets.*

*Step 5 - Build, Predict and Evaluate the regression model.*
We will be repeating Step 5 for the various regression models.

The following sections will cover these steps.

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15`

`import pandas as pd import numpy as np from sklearn import model_selection from sklearn.linear_model import LinearRegression from sklearn.linear_model import Ridge from sklearn.linear_model import Lasso from sklearn.linear_model import ElasticNet from sklearn.neighbors import KNeighborsRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.svm import SVR from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import r2_score from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from math import sqrt`

python

The *first line of code* reads in the data as pandas dataframe, while the *second line* prints the shape - 574 observations of 5 variables. The *third line* gives summary statistics of the numerical variables. The average unemployment stands at 7771 thousand for the data. Also, we don't have missing values because all the variables have 574 as 'count' which is equal to the number of records in the dataset.

`1 2 3`

`df = pd.read_csv('regressionexample.csv') print(df.shape) df.describe()`

python

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13`

`(574, 5) | | pce | pop | psavert | uempmed | unemploy | |-------|--------------|---------------|------------|------------|--------------| | count | 574.000000 | 574.000000 | 574.000000 | 574.000000 | 574.000000 | | mean | 4843.510453 | 257189.381533 | 7.936585 | 8.610105 | 7771.557491 | | std | 3579.287206 | 36730.801593 | 3.124394 | 4.108112 | 2641.960571 | | min | 507.400000 | 198712.000000 | 1.900000 | 4.000000 | 2685.000000 | | 25% | 1582.225000 | 224896.000000 | 5.500000 | 6.000000 | 6284.000000 | | 50% | 3953.550000 | 253060.000000 | 7.700000 | 7.500000 | 7494.000000 | | 75% | 7667.325000 | 290290.750000 | 10.500000 | 9.100000 | 8691.000000 | | max | 12161.500000 | 320887.000000 | 17.000000 | 25.200000 | 15352.000000 |`

The *first line of code* creates an object of the target variable called 'target_column'. The *second line* gives us the list of all the features, excluding the target variable 'unemploy'.

The *third line* normalizes the predictors. This is done because the units of the variables differ significantly and may influence the modeling process. To prevent this, we will do normalization via scaling of the predictors between 0 and 1.

The *fourth line* displays the summary of the normalized data. We can see that all the independent variables have now been scaled between 0 and 1. The target variable remains unchanged.

`1 2 3 4`

`target_column = ['unemploy'] predictors = list(set(list(df.columns))-set(target_column)) df[predictors] = df[predictors]/df[predictors].max() df.describe()`

python

Output:

`1 2 3 4 5 6 7 8 9 10`

`| | pce | pop | psavert | uempmed | unemploy | |-------|------------|------------|------------|------------|--------------| | count | 574.000000 | 574.000000 | 574.000000 | 574.000000 | 574.000000 | | mean | 0.398266 | 0.801495 | 0.466858 | 0.341671 | 7771.557491 | | std | 0.294313 | 0.114466 | 0.183788 | 0.163020 | 2641.960571 | | min | 0.041722 | 0.619258 | 0.111765 | 0.158730 | 2685.000000 | | 25% | 0.130101 | 0.700857 | 0.323529 | 0.238095 | 6284.000000 | | 50% | 0.325087 | 0.788627 | 0.452941 | 0.297619 | 7494.000000 | | 75% | 0.630459 | 0.904651 | 0.617647 | 0.361111 | 8691.000000 | | max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 15352.000000 |`

We will build our model on the training set and evaluate its performance on the test set. This is called the holdout-validation method.

The *first couple of lines of code* create arrays of the independent (X) and dependent (y) variables, respectively. The *third line* splits the data into training and test dataset, with the 'test_size' argument specifying the percentage of data to be kept in the test data. The *fourth line* prints the shape of the training set (401 observations of 4 variables) and test set (173 observations of 4 variables).

`1 2 3 4 5`

`X = df[predictors].values y = df[target_column].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40) print(X_train.shape); print(X_test.shape)`

python

Output:

`1 2`

`(401, 4) (173, 4)`

In this step, we will be implementing the various linear regression models using the scikit-learn library.

The simplest form of regression is the linear regression, which assumes that the predictors have a linear relationship with the target variable. The input variables are assumed to have a Gaussian distribution. Another assumption is that the predictors are not highly correlated with each other (a problem called multi-collinearity).

The linear regression equation can be expressed in the following form:

**y = a1x1 + a2x2 + a3x3 + ..... + anxn + b**

Where the following is true:

- y is the target variable.
- x1, x2, x3,...xn are the features.
- a1, a2, a3,..., an are the coefficients.
- b is the parameter of the model.

The parameters a and b of the model are selected through the Ordinary least squares (OLS) method. It works by minimizing the sum of squares of residuals (actual value - predicted value).

In order to fit the linear regression model, the first step is to instantiate the algorithm that is done in the *first line of code* below. The *second line* fits the model on the training set.

`1 2`

`lr = LinearRegression() lr.fit(X_train, y_train)`

python

Output:

`1`

`LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)`

Once the model is built on the training set, we can make the predictions. The *first line of code* below predicts on the training set. The *second and third lines of code* prints the evaluation metrics - RMSE and R-squared - on the training set. The same steps are repeated on the test dataset in the *fourth to sixth lines*.

`1 2 3 4 5 6 7`

`pred_train_lr= lr.predict(X_train) print(np.sqrt(mean_squared_error(y_train,pred_train_lr))) print(r2_score(y_train, pred_train_lr)) pred_test_lr= lr.predict(X_test) print(np.sqrt(mean_squared_error(y_test,pred_test_lr))) print(r2_score(y_test, pred_test_lr))`

python

Output:

`1 2 3 4`

`971.0295047627518 0.8685609551239368 1019.3232585671161 0.8396633322870104`

The above output shows that the RMSE, one of the two evaluation metrics, is 971 thousand for train data and 1019 thousand for test data. On the other hand, R-squared value is 87 percent for train data and 84 percent for test data, which is a good performance.

As discussed above, linear regression works by selecting coefficients for each independent variable that minimizes a loss function. However, if the coefficients are too large, it can lead to model over-fitting on the training dataset. Such a model will not generalize well on the unseen data. To overcome this shortcoming, we do regularization which penalizes large coefficients. The following sections of the guide will discuss the various regularization algorithms.

Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model. This modification is done by adding a penalty parameter that is equivalent to the square of the magnitude of the coefficients.

**Loss function = OLS + alpha * summation (squared coefficient values)**

In the above loss function, alpha is the parameter we need to select. A low alpha value can lead to over-fitting, whereas a high alpha value can lead to under-fitting.

In scikit-learn, a ridge regression model is constructed by using the Ridge class. The *first line of code* below instantiates the Ridge Regression model with an alpha value of 0.01. The *second line* fits the model to the training data.

The *third line of code* predicts, while the *fourth and fifth lines* print the evaluation metrics - RMSE and R-squared - on the training set. The same steps are repeated on the test dataset in the *sixth to eight lines of code*.

`1 2 3 4 5 6 7 8 9`

`rr = Ridge(alpha=0.01) rr.fit(X_train, y_train) pred_train_rr= rr.predict(X_train) print(np.sqrt(mean_squared_error(y_train,pred_train_rr))) print(r2_score(y_train, pred_train_rr)) pred_test_rr= rr.predict(X_test) print(np.sqrt(mean_squared_error(y_test,pred_test_rr))) print(r2_score(y_test, pred_test_rr))`

python

Output:

`1 2 3 4`

`975.8314265299163 0.8672577596814723 1017.3110662731054 0.8402957317988335`

The above output shows that the RMSE and R-squared values for the Ridge Regression model on the training data is 975 thousand and 86.7 percent, respectively. For the test data, the result for these metrics is 1017 thousand and 84 percent, respectively.

Lasso regression, or the Least Absolute Shrinkage and Selection Operator, is also a modification of linear regression. In Lasso, the loss function is modified to minimize the complexity of the model by limiting the sum of the absolute values of the model coefficients (also called the l1-norm).

The loss function for Lasso Regression can be expressed as below:

**Loss function = OLS + alpha * summation (absolute values of the magnitude of the coefficients)**

In the above loss function, alpha is the penalty parameter we need to select. Using an l1 norm constraint forces some weight values to zero to allow other coefficients to take non-zero values.

In scikit-learn, a lasso regression model is constructed by using the Lasso class. The *first line of code* below instantiates the Lasso Regression model with an alpha value of 0.01. The *second line* fits the model to the training data.

The *third line of code* predicts, while the *fourth and fifth lines* print the evaluation metrics - RMSE and R-squared - on the training set. The same steps are repeated on the test dataset in the *sixth to eight lines of code*.

`1 2 3 4 5 6 7 8 9`

`model_lasso = Lasso(alpha=0.01) model_lasso.fit(X_train, y_train) pred_train_lasso= model_lasso.predict(X_train) print(np.sqrt(mean_squared_error(y_train,pred_train_lasso))) print(r2_score(y_train, pred_train_lasso)) pred_test_lasso= model_lasso.predict(X_test) print(np.sqrt(mean_squared_error(y_test,pred_test_lasso))) print(r2_score(y_test, pred_test_lasso))`

python

Output:

`1 2 3 4`

`971.0300033264347 0.8685608201522376 1019.2575575977107 0.8396840007744909`

The above output shows that the RMSE and R-squared values for the Lasso Regression model on the training data is 971 thousand and 86.7 percent, respectively.

The results for these metrics on the test data is 1019 thousand and 84 percent, respectively. Lasso Regression can also be used for feature selection because the coeﬃcients of less important features are reduced to zero.

ElasticNet combines the properties of both Ridge and Lasso regression. It works by penalizing the model using both the l2-norm and the l1-norm.

In scikit-learn, an ElasticNet regression model is constructed by using the ElasticNet class. The *first line of code* below instantiates the ElasticNet Regression with an alpha value of 0.01. The *second line* fits the model to the training data.

The *third line of code* predicts, while the *fourth and fifth lines* print the evaluation metrics - RMSE and R-squared - on the training set. The same steps are repeated on the test dataset in the *sixth to eight lines of code*.

`1 2 3 4 5 6 7 8 9 10`

`#Elastic Net model_enet = ElasticNet(alpha = 0.01) model_enet.fit(X_train, y_train) pred_train_enet= model_enet.predict(X_train) print(np.sqrt(mean_squared_error(y_train,pred_train_enet))) print(r2_score(y_train, pred_train_enet)) pred_test_enet= model_enet.predict(X_test) print(np.sqrt(mean_squared_error(y_test,pred_test_enet))) print(r2_score(y_test, pred_test_enet))`

python

Output:

`1 2 3 4`

`1352.6359049952857 0.744952327304685 1379.7820437888938 0.7062147664176855`

The above output shows that the RMSE and R-squared value for the ElasticNet Regression model on the training data is 1352 thousand and 74 percent, respectively. The results for these metrics on the test data is 1379 thousand and 71 percent, respectively.

In this guide, you have learned about Linear Regression models using the powerful Python library, scikit-learn. You have also learned about Regularization techniques to avoid the shortcomings of the linear regression models. The performance of the models is summarized below:

Linear Regression Model: Test set RMSE of 1019 thousand and R-square of 83.96 percent.

Ridge Regression Model: Test set RMSE of 1017 thousand and R-square of 84.02 percent.

Lasso Regression Model: Test set RMSE of 1019 thousand and R-square of 83.96 percent.

ElasticNet Regression Model: Test set RMSE of 1379 thousand and R-square of 70.62 percent.

The ElasticNet Regression model is performing the worst. All the other regression models are performing better with a decent R-squared and stable RMSE values. The most ideal result would be an RMSE value of zero and R-squared value of 1, but that's almost impossible in real economic datasets.

There are other iterations that can be done to improve model performance. We have assigned the value of alpha to be 0.01, but this can be altered by hyper parameter tuning to arrive at the optimal alpha value. Cross-validation can also be tried along with feature selection techniques. However, that is not covered in this guide which was aimed at enabling individuals to understand and implement the various Linear Regression models using the scikit-learn library.

41