Non-Linear Regression Trees with scikit-learn

May 21, 2019 • 16 Minute Read

Introduction

Regression is the supervised machine learning technique that predicts a continuous outcome. There are mainly two types of regression algorithms - linear and nonlinear. While linear models are useful, they rely on the assumption of linear relationships between the independent and dependent variables. In real business settings, this assumption is often difficult to meet. This is where the non-linear regression algorithms come into picture that are able to capture the non-linearity within the data.

In this guide, the focus will be on Regression Trees and Random Forest, which are tree-based non-linear algorithms. As always, the first step is to understand the Problem Statement.

Problem Statement

In this guide, we will try to build regression algorithms for predicting unemployment within an economy. The data used in this project was produced from US economic time series data available from [https://research.stlouisfed.org/fred2]. The data contains 574 rows and 5 variables, as described below:

psavert - personal savings rate.
pce - personal consumption expenditures, in billions of dollars.
uempmed - median duration of unemployment, in weeks.
pop - total population, in thousands.
unemploy - number of unemployed in thousands (dependent variable).

Evaluation Metrics

We will evaluate the performance of the model using two metrics - R-squared value and Root Mean Squared Error (RMSE). Ideally, lower RMSE and higher R-squared values are indicative of a good model.

Steps

In this guide, we will follow the following steps:

Step 1 - Loading the required libraries and modules.

Step 2 - Loading the data and performing basic data checks.

Step 3 - Creating arrays for the features and the response variable.

Step 4 - Creating the training and test datasets.

Step 5 - Build, predict, and evaluate the models - Decision Tree and Random Forest.

The following sections will cover these steps.

Step 1 - Loading the Required Libraries and Modules

      import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
import matplotlib.pyplot as plt
    

Step 2 - Reading the Data and Performing Basic Data Checks

The first line of code reads in the data as pandas dataframe, while the second line prints the shape - 574 observations of 5 variables. The third line gives summary statistics of the numerical variables.

The mean population is 257 million, while the mean unemployment stands at 7.8 million. Also, there are no missing values, as all the variables have 574 'count' which is equal to the number of records in the data. Another important observation is the difference in scale of the variables. While population has a range between 198 to 321 million; personal savings rate, 'psavert', has the range between 1.9 to 17 percent. This difference in the scale needs to be normalized.

      df = pd.read_csv('regressionexample.csv') 
print(df.shape)
df.describe()
    

Output:

      (574, 5)


|       | pce          | pop           | psavert    | uempmed    | unemploy     |
|-------|--------------|---------------|------------|------------|--------------|
| count | 574.000000   | 574.000000    | 574.000000 | 574.000000 | 574.000000   |
| mean  | 4843.510453  | 257189.381533 | 7.936585   | 8.610105   | 7771.557491  |
| std   | 3579.287206  | 36730.801593  | 3.124394   | 4.108112   | 2641.960571  |
| min   | 507.400000   | 198712.000000 | 1.900000   | 4.000000   | 2685.000000  |
| 25%   | 1582.225000  | 224896.000000 | 5.500000   | 6.000000   | 6284.000000  |
| 50%   | 3953.550000  | 253060.000000 | 7.700000   | 7.500000   | 7494.000000  |
| 75%   | 7667.325000  | 290290.750000 | 10.500000  | 9.100000   | 8691.000000  |
| max   | 12161.500000 | 320887.000000 | 17.000000  | 25.200000  | 15352.000000 |
    

Step 3 - Creating Arrays for the Features and the Response Variable

The first line of code creates an object of the target variable called 'target_column'. The second line gives us the list of all the features, excluding the target variable 'unemploy'.

We have seen above that the units of the variables differ significantly and may influence the modeling process. To prevent this, we will do normalization via scaling of the predictors between 0 and 1. The third line performs this task.

The fourth line displays the summary of the normalized data. We can see that all the independent variables have now been scaled between 0 and 1. The target variable remains unchanged.

      target_column = ['unemploy'] 
predictors = list(set(list(df.columns))-set(target_column))
df[predictors] = df[predictors]/df[predictors].max()
df.describe()
    

Output:

      |       | pce        | pop        | psavert    | uempmed    | unemploy     |
|-------|------------|------------|------------|------------|--------------|
| count | 574.000000 | 574.000000 | 574.000000 | 574.000000 | 574.000000   |
| mean  | 0.398266   | 0.801495   | 0.466858   | 0.341671   | 7771.557491  |
| std   | 0.294313   | 0.114466   | 0.183788   | 0.163020   | 2641.960571  |
| min   | 0.041722   | 0.619258   | 0.111765   | 0.158730   | 2685.000000  |
| 25%   | 0.130101   | 0.700857   | 0.323529   | 0.238095   | 6284.000000  |
| 50%   | 0.325087   | 0.788627   | 0.452941   | 0.297619   | 7494.000000  |
| 75%   | 0.630459   | 0.904651   | 0.617647   | 0.361111   | 8691.000000  |
| max   | 1.000000   | 1.000000   | 1.000000   | 1.000000   | 15352.000000 |
    

Step 4 - Creating the Training and Test Datasets

We will build our model on the training set and evaluate its performance on the test set. The first couple of lines of code below create arrays of the independent (X) and dependent (y) variables, respectively. The third line splits the data into training and test dataset, with the 'test_size' argument specifying the percentage of data to be kept in the test data. The fourth line prints the shape of the training set (401 observations of 4 variables) and test set (173 observations of 4 variables).

      X = df[predictors].values
y = df[target_column].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)
    

Output:

      (401, 4)
(173, 4)
    

Step 5 - Build, Predict, and Evaluate the Models - Decision Tree and Random Forest

In this step, we will be implementing the various tree-based, non-linear regression models using the scikit-learn library.

Decision Trees

Decision Trees, also referred to as Classification and Regression Trees (CART), work for both categorical and continuous input and output variables. It works by splitting the data into two or more homogeneous sets based on the most significant splitter among the independent variables. The best differentiator is the one that minimizes the cost metric. The cost metrics for a classification tree is often the entropy or the gini index, whereas, for a regression tree, the default metric is the mean squared error.

The basic workflow of Decision Trees is as follows:

The modeling process starts at the Root Node, which represents the entire data. This is divided into two or more sub-nodes, also referred to as splitting. This process of splitting continues until the splitting criterion is met, and the sub-node where splitting happens is called a decision node. Once the splitting criterion is met, the nodes do not split any further; such nodes are called a Leaf or Terminal node. We can also remove sub-nodes through the process called Pruning.

We will now create a CART regression model using the DecisionTreeRegressor class. The first step is to instantiate the algorithm that is done in the first line of code below. The second line fits the model on the training set. The arguments used are max_depth, which indicates the maximum depth of the tree, and min_samples_leaf, that indicates the minimum number of samples required to be at a leaf node.

      dtree = DecisionTreeRegressor(max_depth=8, min_samples_leaf=0.13, random_state=3)

dtree.fit(X_train, y_train)

Output:

      DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
               max_leaf_nodes=None, min_impurity_decrease=0.0,
               min_impurity_split=None, min_samples_leaf=0.13,
               min_samples_split=2, min_weight_fraction_leaf=0.0,
               presort=False, random_state=3, splitter='best')
    

Once the model is built on the training set, we can make the predictions. The first line of code below predicts on the training set. The second and third lines of code prints the evaluation metrics - RMSE and R-squared - on the training set. The same steps are repeated on the test dataset in the fourth to sixth lines.

      # Code lines 1 to 3
pred_train_tree= dtree.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train_tree)))
print(r2_score(y_train, pred_train_tree))

# Code lines 4 to 6
pred_test_tree= dtree.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_tree))) 
print(r2_score(y_test, pred_test_tree))
    

Output:

4038283170603
807082155638476
376221946623
7849943413269588
    

The above output shows that the RMSE is 1,176,404 for train data and 1,180,376 for test data. On the other hand, the R-squared value is 80.7 percent for train data and 78.5 percent for test data. These are decent numbers, but more improvement can be done by parameter tuning. We will be changing the values of the parameter, 'max_depth', to see how that affects the model performance.

The first four lines of code below instantiates and fits the regression trees with 'max_depth' parameter of 2 and 5, respectively. The fifth and sixth lines of code generate predictions on the training data, whereas the seventh and eight lines of code gives predictions on the testing data.

      # Code Lines 1 to 4: Fit the regression tree 'dtree1' and 'dtree2' 
dtree1 = DecisionTreeRegressor(max_depth=2)
dtree2 = DecisionTreeRegressor(max_depth=5)
dtree1.fit(X_train, y_train)
dtree2.fit(X_train, y_train)

# Code Lines 5 to 6: Predict on training data
tr1 = dtree1.predict(X_train)
tr2 = dtree2.predict(X_train) 

#Code Lines 7 to 8: Predict on testing data
y1 = dtree1.predict(X_test)
y2 = dtree2.predict(X_test)
    

The code below generates the evaluation metrics - RMSE and R-squared - for the first regression tree, 'dtree1'.

      # Print RMSE and R-squared value for regression tree 'dtree1' on training data
print(np.sqrt(mean_squared_error(y_train,tr1))) 
print(r2_score(y_train, tr1))

# Print RMSE and R-squared value for regression tree 'dtree1' on testing data
print(np.sqrt(mean_squared_error(y_test,y1))) 
print(r2_score(y_test, y1))
    

Output:

4861869175104
8044222060059463
4870036355467
7231235652677634
    

The above output for 'dtree1' model shows that the RMSE is 1,184,486 for train data and 1,339,487 for test data. The R-squared value is 80.4 percent for train and 72.3 percent for test data. This model is under-performing the previous model in both the evaluation metrics.

We will now examine the performance of the decision tree model, 'dtree2', by running the following lines of code.

      # Print RMSE and R-squared value for regression tree 'dtree2' on training data
print(np.sqrt(mean_squared_error(y_train,tr2))) 
print(r2_score(y_train, tr2))

# Print RMSE and R-squared value for regression tree 'dtree2' on testing data
print(np.sqrt(mean_squared_error(y_test,y2))) 
print(r2_score(y_test, y2))
    

Output:

5285929502469
9558888127723508
4818444997242
9272777851696767
    

The above output shows significant improvement from the earlier models. The train and test set RMSE come down to 562,529 and 686,482, respectively. On the other hand, the R-squared value for the train and test set increases to 95.6 percent and 92.7 percent, respectively. This shows that the regression tree model with 'max_depth' parameter of 5 is performing better, demonstrating how parameter tuning can improve the model performance.

Random Forest (or Bootstrap Aggregation)

Decision Trees are useful, but the problem is that they often tend to overfit the training data leading to high variances in the test data. Random Forest algorithms overcome this shortcoming by reducing the variance of the decision trees. They are called 'Forest' because they are the collection, or ensemble, of several decision trees. One major difference between a Decision Tree and a Random Forest model is on how the splits happen. In Random Forest, instead of trying splits on all the features, a sample of features is selected for each split, thereby reducing the variance of the model.

In scikit-learn, the RandomForestRegressor class is used for building regression trees. The first line of code below instantiates the Random Forest Regression model with the 'n_estimators' value of 500. 'n_estimators' indicates the number of trees in the forest. The second line fits the model to the training data.

The third line of code predicts, while the fourth and fifth lines print the evaluation metrics - RMSE and R-squared - on the training set. The same steps are repeated on the test dataset in the sixth to eight lines of code.

      #RF model
model_rf = RandomForestRegressor(n_estimators=500, oob_score=True, random_state=100)
model_rf.fit(X_train, y_train) 
pred_train_rf= model_rf.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train_rf)))
print(r2_score(y_train, pred_train_rf))

pred_test_rf = model_rf.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_rf)))
print(r2_score(y_test, pred_test_rf))
    

Output:

29018295109884
9973341098832793
3496374600418
9878714471993849
    

The above output shows that the RMSE and R-squared values on the training data is 138,290 and 99.7 percent, respectively. For the test data, the result for these metrics is 280,349 and 98.8 percent, respectively. The performance of the random forest model is far superior to the decision tree models built earlier.

Conclusion

In this guide, you have learned about Tree-Based Non-linear Regression models - Decision Tree and Random Forest. You have also learned about how to tune the parameters of a Regression Tree.

We also observed that the Random Forest model outperforms the Regression Tree models, with the test set RMSE and R-squared values of 280 thousand and 98.8 percent, respectively. This is close to the most ideal result of an R-squared value of 1, indicating the superior performance of the Random Forest algorithm.

To learn more about Machine Learning using scikit-learn, please refer to the guide, Scikit Machine Learning.