Author avatar

Deepika Singh

Non-Linear Regression Trees with R

Deepika Singh

  • Nov 18, 2019
  • 16 Min read
  • 3,899 Views
  • Nov 18, 2019
  • 16 Min read
  • 3,899 Views
Data
R

Introduction

Regression is a supervised machine learning technique that predicts a continuous outcome. There are two types of regression algorithms: linear and nonlinear. While linear models are useful, they rely on the assumption of a linear relationship between the independent and dependent variables. This assumption is difficult to meet in real business scenarios. This is where non-linear regression algorithms come into picture that can capture non-linearity within the data.

In this guide, you'll learn how to implement non-linear regression trees using R.

Data

Unemployment is an important socio-economic and political concern for a country, and managing it is a major task for any government. In this guide, we will build non-linear regression algorithms for predicting unemployment within an economy.

The data comes from the US economic time series data available from http://research.stlouisfed.org/fred2. The data contains 574 rows and 5 variables, as described below:

  1. psavert: personal savings rate

  2. pce: personal consumption expenditures, in billions of dollars

  3. uempmed: median duration of unemployment, in weeks

  4. pop: total population, in millions

  5. unemploy: number unemployed, in millions. This is the dependent variable.

Evaluation Metrics

We will evaluate the performance of the model using two metrics: R-squared value and Root Mean Squared Error (RMSE). Ideally, lower RMSE and higher R-squared values are indicative of a good model.

Let's start by loading the required libraries and the data.

1library(plyr)
2library(readr)
3library(dplyr)
4library(caret)
5library(ggplot2)
6library(rpart)    
7library(rpart.plot)
8library(randomForest)
9
10dat <- read_csv("reg_data.csv")
11glimpse(dat)
{r}

Output:

1Observations: 574
2Variables: 5
3$ pce      <dbl> 507.4, 510.5, 516.3, 512.9, 518.1, 525.8, 531.5, 534.2, 544.9...
4$ pop      <dbl> 198.7, 198.9, 199.1, 199.3, 199.5, 199.7, 199.8, 199.9, 200.1...
5$ psavert  <dbl> 12.5, 12.5, 11.7, 12.5, 12.5, 12.1, 11.7, 12.2, 11.6, 12.2, 1...
6$ uempmed  <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4, 4.4, 4...
7$ unemploy <dbl> 2.9, 2.9, 3.0, 3.1, 3.1, 3.0, 2.9, 3.0, 2.9, 2.7, 2.7, 2.9, 2...

The output shows that all the variables in the dataset are numerical variables (labeled as dbl). Let's look at the summary statistics of the data with the summary() function.

1summary(dat)
{r}

Output:

1      pce                 pop              psavert             uempmed         
2 Min.   :  507.400   Min.   :198.7000   Min.   : 1.900000   Min.   : 4.000000  
3 1st Qu.: 1582.225   1st Qu.:224.8750   1st Qu.: 5.500000   1st Qu.: 6.000000  
4 Median : 3953.550   Median :253.0500   Median : 7.700000   Median : 7.500000  
5 Mean   : 4843.510   Mean   :257.1913   Mean   : 7.936585   Mean   : 8.610105  
6 3rd Qu.: 7667.325   3rd Qu.:290.2500   3rd Qu.:10.500000   3rd Qu.: 9.100000  
7 Max.   :12161.500   Max.   :320.9000   Max.   :17.000000   Max.   :25.200000  
8    unemploy        
9 Min.   : 2.700000  
10 1st Qu.: 6.300000  
11 Median : 7.500000  
12 Mean   : 7.770383  
13 3rd Qu.: 8.700000  
14 Max.   :15.400000  

The output shows that there are no missing values in the data. It also indicates that the unit and magnitude of the variables is different, therefore scaling will be required.

Data Partitioning

We will build our model on the training set and evaluate its performance on the test set. This is called the holdout-validation approach for evaluating model performance.

The first line of code below sets the random seed for reproducibility of results. The second line creates an index for randomly sampling observations for data partitioning. The next two lines of code create the training and test sets, while the last two lines print the dimensions of the training and test sets. The train set contains 70 percent of the data while the test set contains the remaining 30 percent.

1set.seed(100) 
2index = sample(1:nrow(dat), 0.7*nrow(dat)) 
3
4train = dat[index,] # Create the training data 
5test = dat[-index,] # Create the test data
6
7dim(train)
8dim(test)
9 
{r}

Output:

1[1] 401   5
2
3[1] 173   5
4 

Scaling the Numeric Features

The numeric features need to be scaled, otherwise the modeling process may be adversely affected.

The first line of code below creates a list that contains the names of independent numeric variables. The second line uses the preProcess function from the caret package to complete the scaling task. The pre-processing object is fit only to the training data, while the scaling is applied on both the train and test sets. This is done in the third and fourth lines of code below. The fifth line prints the summary of the scaled train dataset. The output shows that now all the numeric features have a mean value of zero except the target variable, unemploy, which was not scaled.

1cols = c('pce', 'pop', 'psavert', 'uempmed')
2
3pre_proc_val <- preProcess(train[,cols], method = c("center", "scale"))
4
5train[,cols] = predict(pre_proc_val, train[,cols])
6test[,cols] = predict(pre_proc_val, test[,cols])
7
8summary(train)
{r}

Output:

1      pce               pop             psavert            uempmed           unemploy     
2 Min.   :-1.2135   Min.   :-1.6049   Min.   :-1.89481   Min.   :-1.1200   Min.   : 2.700  
3 1st Qu.:-0.8856   1st Qu.:-0.8606   1st Qu.:-0.76518   1st Qu.:-0.6198   1st Qu.: 6.500  
4 Median :-0.2501   Median :-0.1416   Median :-0.05513   Median :-0.2447   Median : 7.500  
5 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 7.791  
6 3rd Qu.: 0.7797   3rd Qu.: 0.9060   3rd Qu.: 0.81629   3rd Qu.: 0.1055   3rd Qu.: 8.600  
7 Max.   : 2.1244   Max.   : 1.8047   Max.   : 2.91417   Max.   : 4.1571   Max.   :15.300  

Decision Tree

Decision trees, also referred to as Classification and Regression Trees (CART), work for both categorical and continuous input and output variables. They work by splitting the data into two or more homogeneous sets based on the most significant splitter among the independent variables. The best differentiation is the one that minimizes the cost metric. The cost metric for a classification tree is often the entropy or the gini index, whereas for a regression tree, the default metric is the mean squared error.

The basic workflow of decision trees is as follows:

The modeling process starts at the Root Node, which represents the entire data. This is divided into two or more sub-nodes, also referred to as splitting. This process of splitting continues until the splitting criterion is met, and the sub-node where splitting happens is called a decision node. Once the splitting criterion is met, the nodes do not split any further; such nodes are called a Leaf or Terminal node. We can also remove sub-nodes through a process called Pruning.

Training the Model

We will now create a regression tree model using the rpart library. The first step is to instantiate the algorithm, which is done in the first line of code below. The argument method = "anova" specifies that a regression model is to be built. For classification problem, the argument will be class. The control argument provides a list of options that control the way the decision tree algorithm is trained.

In the code below, the control=rpart.control(minsplit=20, cp=0.001) argument specifies that the minimum number of observations in a node should be 20 before attempting a split. It also specifies that a split happens only if it decreases the error metric by a factor of 0.001, referred to as the cost complexity factor. The second line prints the summary of the model. Note that it is possible to understand all the possible arguments with the ?rpart() command.

1tree_model = rpart(unemploy ~ ., data=train, method = "anova", control=rpart.control(minsplit=20, cp=0.001))
2
3summary(tree_model)
{r}

Output:

1#Truncated Output 
2
3Call:
4rpart(formula = unemploy ~ ., data = train, method = "anova", 
5    control = rpart.control(minsplit = 20, cp = 0.001))
6  n= 401 
7
8               CP nsplit     rel error        xerror          xstd
91  0.508609424691      0 1.00000000000 1.00746991170 0.09222517909
102  0.240199126781      1 0.49139057531 0.53432047449 0.04138079775
113  0.062388054346      2 0.25119144853 0.29990084346 0.02550320375
124  0.027180308031      3 0.18880339418 0.21834245878 0.02083102075
135  0.024078055174      4 0.16162308615 0.20705841419 0.02021514091
146  0.019456798303      5 0.13754503098 0.17518743853 0.01711287778
157  0.015928262852      6 0.11808823267 0.16563982394 0.01633076066
168  0.014943330509      7 0.10215996982 0.15505479787 0.01516085859
179  0.012183802564      8 0.08721663931 0.13629795408 0.01491501765
1810 0.007010774289      9 0.07503283675 0.12235179176 0.01517832910
1911 0.006864114984     11 0.06101128817 0.11620014096 0.01353111243
2012 0.004679462622     12 0.05414717319 0.11258856185 0.01342153785
2113 0.002606235055     13 0.04946771057 0.09356044395 0.01191539603
2214 0.002392463727     14 0.04686147551 0.08924901899 0.01192405697
2315 0.002373022016     15 0.04446901178 0.08914954140 0.01192548795
2416 0.001075173879     16 0.04209598977 0.08103175712 0.01180204685
2517 0.001054136373     18 0.03994564201 0.07929979931 0.01179186081
2618 0.001039626829     19 0.03889150564 0.07916127996 0.01179072614
2719 0.001024169682     21 0.03681225198 0.07900442265 0.01178836902
2820 0.001000000000     22 0.03578808230 0.07900442265 0.01178836902
29
30Variable importance
31    pop uempmed     pce psavert 
32     33      32      28       7 

The important variables, according to the regression tree model, are pop, uempmed and pce.

Evaluating the Model

Let's evaluate the model performance on the train and test datasets. The first step is to create a function for calculating the evaluation metrics, R-squared and RMSE. The second step is to predict and evaluate the model on train data, while the third step is to predict and evaluate the model on test data.

1# Step 1 - create the evaluation metrics function
2
3eval_results <- function(true, predicted, df) {
4  SSE <- sum((predicted - true)^2)
5  SST <- sum((true - mean(true))^2)
6  R_square <- 1 - SSE / SST
7  RMSE = sqrt(SSE/nrow(df))
8  
9# Model performance metrics
10  data.frame(
11    RMSE = RMSE,
12    Rsquare = R_square
13  )
14  
15}
16
17# Step 2 - predicting and evaluating the model on train data
18
19predictions_train_cart = predict(tree_model, data = train)
20eval_results(train$unemploy, predictions_train_cart, train)
21
22# Step 3 - predicting and evaluating the model on test data
23
24predictions_test_cart = predict(tree_model, newdata = test)
25eval_results(test$unemploy, predictions_test_cart, test)
{r}

Output:

1# Training set results
2
3        RMSE   Rsquare
41 0.4644746 0.9642119"
5
6# Test set results
7
8        RMSE   Rsquare
91 0.6106443 0.9591839

The above output shows that the RMSE and R-squared values for the decision tree model on the training data are 0.47 million and 96.4 percent, respectively. For the test data, the results for these metrics are 0.61 million and 96 percent, respectively.

Random Forest (or Bootstrap Aggregation)

Decision Trees are useful, but the problem is that they often tend to overfit the training data, leading to high variances in the test data. Random forest algorithms overcome this shortcoming by reducing the variance of the decision trees. They are called a forest because they are the collection, or ensemble, of several decision trees. One major difference between a decision tree and a random forest model is how the splits happen. In random forest, instead of trying splits on all the features, a sample of features is selected for each split, thereby reducing the variance of the model.

Training the Model

In R, the randomForest package is used to train the random forest algorithm. The first line of code below instantiates the random forest regression model, and the second line prints the summary of the model.

1rf_model = randomForest(unemploy ~ ., data=train)
2summary(rf_model)
3 
{r}

Output:

1                Length Class  Mode     
2call              3    -none- call     
3type              1    -none- character
4predicted       401    -none- numeric  
5mse             500    -none- numeric  
6rsq             500    -none- numeric  
7oob.times       401    -none- numeric  
8importance        4    -none- numeric  
9importanceSD      0    -none- NULL     
10localImportance   0    -none- NULL     
11proximity         0    -none- NULL     
12ntree             1    -none- numeric  
13mtry              1    -none- numeric  
14forest           11    -none- list     
15coefs             0    -none- NULL     
16y               401    -none- numeric  
17test              0    -none- NULL     
18inbag             0    -none- NULL     
19terms             3    terms  call   

We can look at the important variables using the code below. The output shows that the variables pop, pce, and uempmed are the significant predictors.

1importance(rf_model)
{r}

Output:

1        IncNodePurity
2pce       695.0615302
3pop       759.6686618
4psavert   266.2619092
5uempmed   676.7521072

Evaluating the Model

Let's evaluate the model performance on the train and test datasets, using the lines of code below.

1# predicting and evaluating the model on train data
2predictions_train_rf = predict(rf_model, data = train)
3eval_results(train$unemploy, predictions_train_rf, train)
4
5# predicting and evaluating the model on test data
6predictions_test_rf = predict(rf_model, newdata = test)
7eval_results(test$unemploy, predictions_test_rf, test)
{r}

Output:

1# Output on Training Data
2
3        RMSE   Rsquare
41  0.3469372 0.9800328
5
6
7# Output on Test Data
8
9       RMSE   Rsquare
101  0.511681 0.9713415

The above output shows that the RMSE and R-squared values on the training data are 0.35 million and 98 percent, respectively. For the test data, the results for these metrics are 0.51 million and 97.1 percent, respectively. The performance of the random forest model is superior to the decision tree model built earlier.

Conclusion

In this guide, you learned about tree-based non-linear regression models, including Decision tree and random forest. We observed that the random forest model outperforms the decision tree model. The results from these two algorithms are summarized below:

  1. Decision Tree Model: Test set RMSE of 0.6 million and R-square of 96 percent.

  2. Random Forest Model: Test set RMSE of 0.5 million and R-square of 97.1 percent.

To learn more about data science with R, please refer to the following guides: