Author avatar

Deepika Singh

Non-Linear Regression Trees with R

Deepika Singh

  • Nov 18, 2019
  • 16 Min read
  • 56 Views
  • Nov 18, 2019
  • 16 Min read
  • 56 Views
Data
R

Introduction

Regression is a supervised machine learning technique that predicts a continuous outcome. There are two types of regression algorithms: linear and nonlinear. While linear models are useful, they rely on the assumption of a linear relationship between the independent and dependent variables. This assumption is difficult to meet in real business scenarios. This is where non-linear regression algorithms come into picture that can capture non-linearity within the data.

In this guide, you'll learn how to implement non-linear regression trees using R.

Data

Unemployment is an important socio-economic and political concern for a country, and managing it is a major task for any government. In this guide, we will build non-linear regression algorithms for predicting unemployment within an economy.

The data comes from the US economic time series data available from http://research.stlouisfed.org/fred2. The data contains 574 rows and 5 variables, as described below:

  1. psavert: personal savings rate

  2. pce: personal consumption expenditures, in billions of dollars

  3. uempmed: median duration of unemployment, in weeks

  4. pop: total population, in millions

  5. unemploy: number unemployed, in millions. This is the dependent variable.

Evaluation Metrics

We will evaluate the performance of the model using two metrics: R-squared value and Root Mean Squared Error (RMSE). Ideally, lower RMSE and higher R-squared values are indicative of a good model.

Let's start by loading the required libraries and the data.

1
2
3
4
5
6
7
8
9
10
11
library(plyr)
library(readr)
library(dplyr)
library(caret)
library(ggplot2)
library(rpart)    
library(rpart.plot)
library(randomForest)

dat <- read_csv("reg_data.csv")
glimpse(dat)
{r}

Output:

1
2
3
4
5
6
7
Observations: 574
Variables: 5
$ pce      <dbl> 507.4, 510.5, 516.3, 512.9, 518.1, 525.8, 531.5, 534.2, 544.9...
$ pop      <dbl> 198.7, 198.9, 199.1, 199.3, 199.5, 199.7, 199.8, 199.9, 200.1...
$ psavert  <dbl> 12.5, 12.5, 11.7, 12.5, 12.5, 12.1, 11.7, 12.2, 11.6, 12.2, 1...
$ uempmed  <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4, 4.4, 4...
$ unemploy <dbl> 2.9, 2.9, 3.0, 3.1, 3.1, 3.0, 2.9, 3.0, 2.9, 2.7, 2.7, 2.9, 2...

The output shows that all the variables in the dataset are numerical variables (labeled as dbl). Let's look at the summary statistics of the data with the summary() function.

1
summary(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
      pce                 pop              psavert             uempmed         
 Min.   :  507.400   Min.   :198.7000   Min.   : 1.900000   Min.   : 4.000000  
 1st Qu.: 1582.225   1st Qu.:224.8750   1st Qu.: 5.500000   1st Qu.: 6.000000  
 Median : 3953.550   Median :253.0500   Median : 7.700000   Median : 7.500000  
 Mean   : 4843.510   Mean   :257.1913   Mean   : 7.936585   Mean   : 8.610105  
 3rd Qu.: 7667.325   3rd Qu.:290.2500   3rd Qu.:10.500000   3rd Qu.: 9.100000  
 Max.   :12161.500   Max.   :320.9000   Max.   :17.000000   Max.   :25.200000  
    unemploy        
 Min.   : 2.700000  
 1st Qu.: 6.300000  
 Median : 7.500000  
 Mean   : 7.770383  
 3rd Qu.: 8.700000  
 Max.   :15.400000  

The output shows that there are no missing values in the data. It also indicates that the unit and magnitude of the variables is different, therefore scaling will be required.

Data Partitioning

We will build our model on the training set and evaluate its performance on the test set. This is called the holdout-validation approach for evaluating model performance.

The first line of code below sets the random seed for reproducibility of results. The second line creates an index for randomly sampling observations for data partitioning. The next two lines of code create the training and test sets, while the last two lines print the dimensions of the training and test sets. The train set contains 70 percent of the data while the test set contains the remaining 30 percent.

1
2
3
4
5
6
7
8
9
set.seed(100) 
index = sample(1:nrow(dat), 0.7*nrow(dat)) 

train = dat[index,] # Create the training data 
test = dat[-index,] # Create the test data

dim(train)
dim(test)
 
{r}

Output:

1
2
3
4
[1] 401   5

[1] 173   5
 

Scaling the Numeric Features

The numeric features need to be scaled, otherwise the modeling process may be adversely affected.

The first line of code below creates a list that contains the names of independent numeric variables. The second line uses the preProcess function from the caret package to complete the scaling task. The pre-processing object is fit only to the training data, while the scaling is applied on both the train and test sets. This is done in the third and fourth lines of code below. The fifth line prints the summary of the scaled train dataset. The output shows that now all the numeric features have a mean value of zero except the target variable, unemploy, which was not scaled.

1
2
3
4
5
6
7
8
cols = c('pce', 'pop', 'psavert', 'uempmed')

pre_proc_val <- preProcess(train[,cols], method = c("center", "scale"))

train[,cols] = predict(pre_proc_val, train[,cols])
test[,cols] = predict(pre_proc_val, test[,cols])

summary(train)
{r}

Output:

1
2
3
4
5
6
7
      pce               pop             psavert            uempmed           unemploy     
 Min.   :-1.2135   Min.   :-1.6049   Min.   :-1.89481   Min.   :-1.1200   Min.   : 2.700  
 1st Qu.:-0.8856   1st Qu.:-0.8606   1st Qu.:-0.76518   1st Qu.:-0.6198   1st Qu.: 6.500  
 Median :-0.2501   Median :-0.1416   Median :-0.05513   Median :-0.2447   Median : 7.500  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 7.791  
 3rd Qu.: 0.7797   3rd Qu.: 0.9060   3rd Qu.: 0.81629   3rd Qu.: 0.1055   3rd Qu.: 8.600  
 Max.   : 2.1244   Max.   : 1.8047   Max.   : 2.91417   Max.   : 4.1571   Max.   :15.300  

Decision Tree

Decision trees, also referred to as Classification and Regression Trees (CART), work for both categorical and continuous input and output variables. They work by splitting the data into two or more homogeneous sets based on the most significant splitter among the independent variables. The best differentiation is the one that minimizes the cost metric. The cost metric for a classification tree is often the entropy or the gini index, whereas for a regression tree, the default metric is the mean squared error.

The basic workflow of decision trees is as follows:

The modeling process starts at the Root Node, which represents the entire data. This is divided into two or more sub-nodes, also referred to as splitting. This process of splitting continues until the splitting criterion is met, and the sub-node where splitting happens is called a decision node. Once the splitting criterion is met, the nodes do not split any further; such nodes are called a Leaf or Terminal node. We can also remove sub-nodes through a process called Pruning.

Training the Model

We will now create a regression tree model using the rpart library. The first step is to instantiate the algorithm, which is done in the first line of code below. The argument method = "anova" specifies that a regression model is to be built. For classification problem, the argument will be class. The control argument provides a list of options that control the way the decision tree algorithm is trained.

In the code below, the control=rpart.control(minsplit=20, cp=0.001) argument specifies that the minimum number of observations in a node should be 20 before attempting a split. It also specifies that a split happens only if it decreases the error metric by a factor of 0.001, referred to as the cost complexity factor. The second line prints the summary of the model. Note that it is possible to understand all the possible arguments with the ?rpart() command.

1
2
3
tree_model = rpart(unemploy ~ ., data=train, method = "anova", control=rpart.control(minsplit=20, cp=0.001))

summary(tree_model)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#Truncated Output 

Call:
rpart(formula = unemploy ~ ., data = train, method = "anova", 
    control = rpart.control(minsplit = 20, cp = 0.001))
  n= 401 

               CP nsplit     rel error        xerror          xstd
1  0.508609424691      0 1.00000000000 1.00746991170 0.09222517909
2  0.240199126781      1 0.49139057531 0.53432047449 0.04138079775
3  0.062388054346      2 0.25119144853 0.29990084346 0.02550320375
4  0.027180308031      3 0.18880339418 0.21834245878 0.02083102075
5  0.024078055174      4 0.16162308615 0.20705841419 0.02021514091
6  0.019456798303      5 0.13754503098 0.17518743853 0.01711287778
7  0.015928262852      6 0.11808823267 0.16563982394 0.01633076066
8  0.014943330509      7 0.10215996982 0.15505479787 0.01516085859
9  0.012183802564      8 0.08721663931 0.13629795408 0.01491501765
10 0.007010774289      9 0.07503283675 0.12235179176 0.01517832910
11 0.006864114984     11 0.06101128817 0.11620014096 0.01353111243
12 0.004679462622     12 0.05414717319 0.11258856185 0.01342153785
13 0.002606235055     13 0.04946771057 0.09356044395 0.01191539603
14 0.002392463727     14 0.04686147551 0.08924901899 0.01192405697
15 0.002373022016     15 0.04446901178 0.08914954140 0.01192548795
16 0.001075173879     16 0.04209598977 0.08103175712 0.01180204685
17 0.001054136373     18 0.03994564201 0.07929979931 0.01179186081
18 0.001039626829     19 0.03889150564 0.07916127996 0.01179072614
19 0.001024169682     21 0.03681225198 0.07900442265 0.01178836902
20 0.001000000000     22 0.03578808230 0.07900442265 0.01178836902

Variable importance
    pop uempmed     pce psavert 
     33      32      28       7 

The important variables, according to the regression tree model, are pop, uempmed and pce.

Evaluating the Model

Let's evaluate the model performance on the train and test datasets. The first step is to create a function for calculating the evaluation metrics, R-squared and RMSE. The second step is to predict and evaluate the model on train data, while the third step is to predict and evaluate the model on test data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Step 1 - create the evaluation metrics function

eval_results <- function(true, predicted, df) {
  SSE <- sum((predicted - true)^2)
  SST <- sum((true - mean(true))^2)
  R_square <- 1 - SSE / SST
  RMSE = sqrt(SSE/nrow(df))
  
# Model performance metrics
  data.frame(
    RMSE = RMSE,
    Rsquare = R_square
  )
  
}

# Step 2 - predicting and evaluating the model on train data

predictions_train_cart = predict(tree_model, data = train)
eval_results(train$unemploy, predictions_train_cart, train)

# Step 3 - predicting and evaluating the model on test data

predictions_test_cart = predict(tree_model, newdata = test)
eval_results(test$unemploy, predictions_test_cart, test)
{r}

Output:

1
2
3
4
5
6
7
8
9
# Training set results

        RMSE   Rsquare
1 0.4644746 0.9642119"

# Test set results

        RMSE   Rsquare
1 0.6106443 0.9591839

The above output shows that the RMSE and R-squared values for the decision tree model on the training data are 0.47 million and 96.4 percent, respectively. For the test data, the results for these metrics are 0.61 million and 96 percent, respectively.

Random Forest (or Bootstrap Aggregation)

Decision Trees are useful, but the problem is that they often tend to overfit the training data, leading to high variances in the test data. Random forest algorithms overcome this shortcoming by reducing the variance of the decision trees. They are called a forest because they are the collection, or ensemble, of several decision trees. One major difference between a decision tree and a random forest model is how the splits happen. In random forest, instead of trying splits on all the features, a sample of features is selected for each split, thereby reducing the variance of the model.

Training the Model

In R, the randomForest package is used to train the random forest algorithm. The first line of code below instantiates the random forest regression model, and the second line prints the summary of the model.

1
2
3
rf_model = randomForest(unemploy ~ ., data=train)
summary(rf_model)
 
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
                Length Class  Mode     
call              3    -none- call     
type              1    -none- character
predicted       401    -none- numeric  
mse             500    -none- numeric  
rsq             500    -none- numeric  
oob.times       401    -none- numeric  
importance        4    -none- numeric  
importanceSD      0    -none- NULL     
localImportance   0    -none- NULL     
proximity         0    -none- NULL     
ntree             1    -none- numeric  
mtry              1    -none- numeric  
forest           11    -none- list     
coefs             0    -none- NULL     
y               401    -none- numeric  
test              0    -none- NULL     
inbag             0    -none- NULL     
terms             3    terms  call   

We can look at the important variables using the code below. The output shows that the variables pop, pce, and uempmed are the significant predictors.

1
importance(rf_model)
{r}

Output:

1
2
3
4
5
        IncNodePurity
pce       695.0615302
pop       759.6686618
psavert   266.2619092
uempmed   676.7521072

Evaluating the Model

Let's evaluate the model performance on the train and test datasets, using the lines of code below.

1
2
3
4
5
6
7
# predicting and evaluating the model on train data
predictions_train_rf = predict(rf_model, data = train)
eval_results(train$unemploy, predictions_train_rf, train)

# predicting and evaluating the model on test data
predictions_test_rf = predict(rf_model, newdata = test)
eval_results(test$unemploy, predictions_test_rf, test)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
# Output on Training Data

        RMSE   Rsquare
1  0.3469372 0.9800328


# Output on Test Data

       RMSE   Rsquare
1  0.511681 0.9713415

The above output shows that the RMSE and R-squared values on the training data are 0.35 million and 98 percent, respectively. For the test data, the results for these metrics are 0.51 million and 97.1 percent, respectively. The performance of the random forest model is superior to the decision tree model built earlier.

Conclusion

In this guide, you learned about tree-based non-linear regression models, including Decision tree and random forest. We observed that the random forest model outperforms the decision tree model. The results from these two algorithms are summarized below:

  1. Decision Tree Model: Test set RMSE of 0.6 million and R-square of 96 percent.

  2. Random Forest Model: Test set RMSE of 0.5 million and R-square of 97.1 percent.

To learn more about data science with R, please refer to the following guides:

0