Introduction

*Regression* is a supervised machine learning technique that predicts a continuous outcome. There are two types of regression algorithms: *linear* and *nonlinear*. While linear models are useful, they rely on the assumption of a linear relationship between the independent and dependent variables. This assumption is difficult to meet in real business scenarios. This is where non-linear regression algorithms come into picture that can capture non-linearity within the data.

In this guide, you'll learn how to implement non-linear regression trees using R.

Unemployment is an important socio-economic and political concern for a country, and managing it is a major task for any government. In this guide, we will build non-linear regression algorithms for predicting unemployment within an economy.

The data comes from the US economic time series data available from http://research.stlouisfed.org/fred2. The data contains 574 rows and 5 variables, as described below:

`psavert`

: personal savings rate`pce`

: personal consumption expenditures, in billions of dollars`uempmed`

: median duration of unemployment, in weeks`pop`

: total population, in millions`unemploy`

: number unemployed, in millions. This is the dependent variable.

We will evaluate the performance of the model using two metrics: R-squared value and Root Mean Squared Error (RMSE). Ideally, lower RMSE and higher R-squared values are indicative of a good model.

Let's start by loading the required libraries and the data.

```
1library(plyr)
2library(readr)
3library(dplyr)
4library(caret)
5library(ggplot2)
6library(rpart)
7library(rpart.plot)
8library(randomForest)
9
10dat <- read_csv("reg_data.csv")
11glimpse(dat)
```

{r}

Output:

```
1Observations: 574
2Variables: 5
3$ pce <dbl> 507.4, 510.5, 516.3, 512.9, 518.1, 525.8, 531.5, 534.2, 544.9...
4$ pop <dbl> 198.7, 198.9, 199.1, 199.3, 199.5, 199.7, 199.8, 199.9, 200.1...
5$ psavert <dbl> 12.5, 12.5, 11.7, 12.5, 12.5, 12.1, 11.7, 12.2, 11.6, 12.2, 1...
6$ uempmed <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4, 4.4, 4...
7$ unemploy <dbl> 2.9, 2.9, 3.0, 3.1, 3.1, 3.0, 2.9, 3.0, 2.9, 2.7, 2.7, 2.9, 2...
```

The output shows that all the variables in the dataset are numerical variables (labeled as `dbl`

). Let's look at the summary statistics of the data with the `summary()`

function.

`1summary(dat)`

{r}

Output:

```
1 pce pop psavert uempmed
2 Min. : 507.400 Min. :198.7000 Min. : 1.900000 Min. : 4.000000
3 1st Qu.: 1582.225 1st Qu.:224.8750 1st Qu.: 5.500000 1st Qu.: 6.000000
4 Median : 3953.550 Median :253.0500 Median : 7.700000 Median : 7.500000
5 Mean : 4843.510 Mean :257.1913 Mean : 7.936585 Mean : 8.610105
6 3rd Qu.: 7667.325 3rd Qu.:290.2500 3rd Qu.:10.500000 3rd Qu.: 9.100000
7 Max. :12161.500 Max. :320.9000 Max. :17.000000 Max. :25.200000
8 unemploy
9 Min. : 2.700000
10 1st Qu.: 6.300000
11 Median : 7.500000
12 Mean : 7.770383
13 3rd Qu.: 8.700000
14 Max. :15.400000
```

The output shows that there are no missing values in the data. It also indicates that the unit and magnitude of the variables is different, therefore scaling will be required.

We will build our model on the training set and evaluate its performance on the test set. This is called the *holdout-validation* approach for evaluating model performance.

The first line of code below sets the random seed for reproducibility of results. The second line creates an index for randomly sampling observations for data partitioning. The next two lines of code create the training and test sets, while the last two lines print the dimensions of the training and test sets. The train set contains 70 percent of the data while the test set contains the remaining 30 percent.

```
1set.seed(100)
2index = sample(1:nrow(dat), 0.7*nrow(dat))
3
4train = dat[index,] # Create the training data
5test = dat[-index,] # Create the test data
6
7dim(train)
8dim(test)
9
```

{r}

Output:

```
1[1] 401 5
2
3[1] 173 5
4
```

The numeric features need to be scaled, otherwise the modeling process may be adversely affected.

The first line of code below creates a list that contains the names of independent numeric variables. The second line uses the `preProcess`

function from the `caret`

package to complete the scaling task. The pre-processing object is fit only to the training data, while the scaling is applied on both the train and test sets. This is done in the third and fourth lines of code below. The fifth line prints the summary of the scaled train dataset. The output shows that now all the numeric features have a mean value of zero except the target variable, `unemploy`

, which was not scaled.

```
1cols = c('pce', 'pop', 'psavert', 'uempmed')
2
3pre_proc_val <- preProcess(train[,cols], method = c("center", "scale"))
4
5train[,cols] = predict(pre_proc_val, train[,cols])
6test[,cols] = predict(pre_proc_val, test[,cols])
7
8summary(train)
```

{r}

Output:

```
1 pce pop psavert uempmed unemploy
2 Min. :-1.2135 Min. :-1.6049 Min. :-1.89481 Min. :-1.1200 Min. : 2.700
3 1st Qu.:-0.8856 1st Qu.:-0.8606 1st Qu.:-0.76518 1st Qu.:-0.6198 1st Qu.: 6.500
4 Median :-0.2501 Median :-0.1416 Median :-0.05513 Median :-0.2447 Median : 7.500
5 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 7.791
6 3rd Qu.: 0.7797 3rd Qu.: 0.9060 3rd Qu.: 0.81629 3rd Qu.: 0.1055 3rd Qu.: 8.600
7 Max. : 2.1244 Max. : 1.8047 Max. : 2.91417 Max. : 4.1571 Max. :15.300
```

*Decision trees*, also referred to as Classification and Regression Trees (CART), work for both categorical and continuous input and output variables. They work by splitting the data into two or more homogeneous sets based on the most significant splitter among the independent variables. The best differentiation is the one that minimizes the cost metric. The cost metric for a classification tree is often the entropy or the gini index, whereas for a regression tree, the default metric is the mean squared error.

The basic workflow of decision trees is as follows:

The modeling process starts at the `Root Node`

, which represents the entire data. This is divided into two or more sub-nodes, also referred to as splitting. This process of splitting continues until the splitting criterion is met, and the sub-node where splitting happens is called a decision node. Once the splitting criterion is met, the nodes do not split any further; such nodes are called a `Leaf or Terminal node`

. We can also remove sub-nodes through a process called `Pruning`

.

We will now create a regression tree model using the `rpart`

library. The first step is to instantiate the algorithm, which is done in the first line of code below. The argument `method = "anova"`

specifies that a regression model is to be built. For classification problem, the argument will be `class`

. The `control`

argument provides a list of options that control the way the decision tree algorithm is trained.

In the code below, the `control=rpart.control(minsplit=20, cp=0.001)`

argument specifies that the minimum number of observations in a node should be 20 before attempting a split. It also specifies that a split happens only if it decreases the error metric by a factor of 0.001, referred to as the *cost complexity factor*. The second line prints the summary of the model. Note that it is possible to understand all the possible arguments with the `?rpart()`

command.

```
1tree_model = rpart(unemploy ~ ., data=train, method = "anova", control=rpart.control(minsplit=20, cp=0.001))
2
3summary(tree_model)
```

{r}

Output:

```
1#Truncated Output
2
3Call:
4rpart(formula = unemploy ~ ., data = train, method = "anova",
5 control = rpart.control(minsplit = 20, cp = 0.001))
6 n= 401
7
8 CP nsplit rel error xerror xstd
91 0.508609424691 0 1.00000000000 1.00746991170 0.09222517909
102 0.240199126781 1 0.49139057531 0.53432047449 0.04138079775
113 0.062388054346 2 0.25119144853 0.29990084346 0.02550320375
124 0.027180308031 3 0.18880339418 0.21834245878 0.02083102075
135 0.024078055174 4 0.16162308615 0.20705841419 0.02021514091
146 0.019456798303 5 0.13754503098 0.17518743853 0.01711287778
157 0.015928262852 6 0.11808823267 0.16563982394 0.01633076066
168 0.014943330509 7 0.10215996982 0.15505479787 0.01516085859
179 0.012183802564 8 0.08721663931 0.13629795408 0.01491501765
1810 0.007010774289 9 0.07503283675 0.12235179176 0.01517832910
1911 0.006864114984 11 0.06101128817 0.11620014096 0.01353111243
2012 0.004679462622 12 0.05414717319 0.11258856185 0.01342153785
2113 0.002606235055 13 0.04946771057 0.09356044395 0.01191539603
2214 0.002392463727 14 0.04686147551 0.08924901899 0.01192405697
2315 0.002373022016 15 0.04446901178 0.08914954140 0.01192548795
2416 0.001075173879 16 0.04209598977 0.08103175712 0.01180204685
2517 0.001054136373 18 0.03994564201 0.07929979931 0.01179186081
2618 0.001039626829 19 0.03889150564 0.07916127996 0.01179072614
2719 0.001024169682 21 0.03681225198 0.07900442265 0.01178836902
2820 0.001000000000 22 0.03578808230 0.07900442265 0.01178836902
29
30Variable importance
31 pop uempmed pce psavert
32 33 32 28 7
```

The important variables, according to the regression tree model, are `pop`

, `uempmed`

and `pce`

.

Let's evaluate the model performance on the train and test datasets. The first step is to create a function for calculating the evaluation metrics, R-squared and RMSE. The second step is to predict and evaluate the model on train data, while the third step is to predict and evaluate the model on test data.

```
1# Step 1 - create the evaluation metrics function
2
3eval_results <- function(true, predicted, df) {
4 SSE <- sum((predicted - true)^2)
5 SST <- sum((true - mean(true))^2)
6 R_square <- 1 - SSE / SST
7 RMSE = sqrt(SSE/nrow(df))
8
9# Model performance metrics
10 data.frame(
11 RMSE = RMSE,
12 Rsquare = R_square
13 )
14
15}
16
17# Step 2 - predicting and evaluating the model on train data
18
19predictions_train_cart = predict(tree_model, data = train)
20eval_results(train$unemploy, predictions_train_cart, train)
21
22# Step 3 - predicting and evaluating the model on test data
23
24predictions_test_cart = predict(tree_model, newdata = test)
25eval_results(test$unemploy, predictions_test_cart, test)
```

{r}

Output:

```
1# Training set results
2
3 RMSE Rsquare
41 0.4644746 0.9642119"
5
6# Test set results
7
8 RMSE Rsquare
91 0.6106443 0.9591839
```

The above output shows that the RMSE and R-squared values for the decision tree model on the training data are 0.47 million and 96.4 percent, respectively. For the test data, the results for these metrics are 0.61 million and 96 percent, respectively.

Decision Trees are useful, but the problem is that they often tend to overfit the training data, leading to high variances in the test data. *Random forest* algorithms overcome this shortcoming by reducing the variance of the decision trees. They are called a *forest* because they are the collection, or ensemble, of several decision trees. One major difference between a decision tree and a random forest model is how the splits happen. In random forest, instead of trying splits on all the features, a sample of features is selected for each split, thereby reducing the variance of the model.

In R, the `randomForest`

package is used to train the random forest algorithm. The first line of code below instantiates the random forest regression model, and the second line prints the summary of the model.

```
1rf_model = randomForest(unemploy ~ ., data=train)
2summary(rf_model)
3
```

{r}

Output:

```
1 Length Class Mode
2call 3 -none- call
3type 1 -none- character
4predicted 401 -none- numeric
5mse 500 -none- numeric
6rsq 500 -none- numeric
7oob.times 401 -none- numeric
8importance 4 -none- numeric
9importanceSD 0 -none- NULL
10localImportance 0 -none- NULL
11proximity 0 -none- NULL
12ntree 1 -none- numeric
13mtry 1 -none- numeric
14forest 11 -none- list
15coefs 0 -none- NULL
16y 401 -none- numeric
17test 0 -none- NULL
18inbag 0 -none- NULL
19terms 3 terms call
```

We can look at the important variables using the code below. The output shows that the variables `pop`

, `pce`

, and `uempmed`

are the significant predictors.

`1importance(rf_model)`

{r}

Output:

```
1 IncNodePurity
2pce 695.0615302
3pop 759.6686618
4psavert 266.2619092
5uempmed 676.7521072
```

Let's evaluate the model performance on the train and test datasets, using the lines of code below.

```
1# predicting and evaluating the model on train data
2predictions_train_rf = predict(rf_model, data = train)
3eval_results(train$unemploy, predictions_train_rf, train)
4
5# predicting and evaluating the model on test data
6predictions_test_rf = predict(rf_model, newdata = test)
7eval_results(test$unemploy, predictions_test_rf, test)
```

{r}

Output:

```
1# Output on Training Data
2
3 RMSE Rsquare
41 0.3469372 0.9800328
5
6
7# Output on Test Data
8
9 RMSE Rsquare
101 0.511681 0.9713415
```

The above output shows that the RMSE and R-squared values on the training data are 0.35 million and 98 percent, respectively. For the test data, the results for these metrics are 0.51 million and 97.1 percent, respectively. The performance of the random forest model is superior to the decision tree model built earlier.

In this guide, you learned about tree-based non-linear regression models, including Decision tree and random forest. We observed that the random forest model outperforms the decision tree model. The results from these two algorithms are summarized below:

Decision Tree Model: Test set RMSE of 0.6 million and R-square of 96 percent.

- Random Forest Model: Test set RMSE of 0.5 million and R-square of 97.1 percent.

To learn more about data science with R, please refer to the following guides: