Introduction

1

*Regression* is a supervised machine learning technique that predicts a continuous outcome. There are two types of regression algorithms: *linear* and *nonlinear*. While linear models are useful, they rely on the assumption of a linear relationship between the independent and dependent variables. This assumption is difficult to meet in real business scenarios. This is where non-linear regression algorithms come into picture that can capture non-linearity within the data.

In this guide, you'll learn how to implement non-linear regression trees using R.

Unemployment is an important socio-economic and political concern for a country, and managing it is a major task for any government. In this guide, we will build non-linear regression algorithms for predicting unemployment within an economy.

The data comes from the US economic time series data available from http://research.stlouisfed.org/fred2. The data contains 574 rows and 5 variables, as described below:

`psavert`

: personal savings rate`pce`

: personal consumption expenditures, in billions of dollars`uempmed`

: median duration of unemployment, in weeks`pop`

: total population, in millions`unemploy`

: number unemployed, in millions. This is the dependent variable.

We will evaluate the performance of the model using two metrics: R-squared value and Root Mean Squared Error (RMSE). Ideally, lower RMSE and higher R-squared values are indicative of a good model.

Let's start by loading the required libraries and the data.

`1 2 3 4 5 6 7 8 9 10 11`

`library(plyr) library(readr) library(dplyr) library(caret) library(ggplot2) library(rpart) library(rpart.plot) library(randomForest) dat <- read_csv("reg_data.csv") glimpse(dat)`

{r}

Output:

`1 2 3 4 5 6 7`

`Observations: 574 Variables: 5 $ pce <dbl> 507.4, 510.5, 516.3, 512.9, 518.1, 525.8, 531.5, 534.2, 544.9... $ pop <dbl> 198.7, 198.9, 199.1, 199.3, 199.5, 199.7, 199.8, 199.9, 200.1... $ psavert <dbl> 12.5, 12.5, 11.7, 12.5, 12.5, 12.1, 11.7, 12.2, 11.6, 12.2, 1... $ uempmed <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4, 4.4, 4... $ unemploy <dbl> 2.9, 2.9, 3.0, 3.1, 3.1, 3.0, 2.9, 3.0, 2.9, 2.7, 2.7, 2.9, 2...`

The output shows that all the variables in the dataset are numerical variables (labeled as `dbl`

). Let's look at the summary statistics of the data with the `summary()`

function.

`1`

`summary(dat)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14`

`pce pop psavert uempmed Min. : 507.400 Min. :198.7000 Min. : 1.900000 Min. : 4.000000 1st Qu.: 1582.225 1st Qu.:224.8750 1st Qu.: 5.500000 1st Qu.: 6.000000 Median : 3953.550 Median :253.0500 Median : 7.700000 Median : 7.500000 Mean : 4843.510 Mean :257.1913 Mean : 7.936585 Mean : 8.610105 3rd Qu.: 7667.325 3rd Qu.:290.2500 3rd Qu.:10.500000 3rd Qu.: 9.100000 Max. :12161.500 Max. :320.9000 Max. :17.000000 Max. :25.200000 unemploy Min. : 2.700000 1st Qu.: 6.300000 Median : 7.500000 Mean : 7.770383 3rd Qu.: 8.700000 Max. :15.400000`

The output shows that there are no missing values in the data. It also indicates that the unit and magnitude of the variables is different, therefore scaling will be required.

We will build our model on the training set and evaluate its performance on the test set. This is called the *holdout-validation* approach for evaluating model performance.

The first line of code below sets the random seed for reproducibility of results. The second line creates an index for randomly sampling observations for data partitioning. The next two lines of code create the training and test sets, while the last two lines print the dimensions of the training and test sets. The train set contains 70 percent of the data while the test set contains the remaining 30 percent.

`1 2 3 4 5 6 7 8 9`

`set.seed(100) index = sample(1:nrow(dat), 0.7*nrow(dat)) train = dat[index,] # Create the training data test = dat[-index,] # Create the test data dim(train) dim(test)`

{r}

Output:

`1 2 3 4`

`[1] 401 5 [1] 173 5`

The numeric features need to be scaled, otherwise the modeling process may be adversely affected.

The first line of code below creates a list that contains the names of independent numeric variables. The second line uses the `preProcess`

function from the `caret`

package to complete the scaling task. The pre-processing object is fit only to the training data, while the scaling is applied on both the train and test sets. This is done in the third and fourth lines of code below. The fifth line prints the summary of the scaled train dataset. The output shows that now all the numeric features have a mean value of zero except the target variable, `unemploy`

, which was not scaled.

`1 2 3 4 5 6 7 8`

`cols = c('pce', 'pop', 'psavert', 'uempmed') pre_proc_val <- preProcess(train[,cols], method = c("center", "scale")) train[,cols] = predict(pre_proc_val, train[,cols]) test[,cols] = predict(pre_proc_val, test[,cols]) summary(train)`

{r}

Output:

`1 2 3 4 5 6 7`

`pce pop psavert uempmed unemploy Min. :-1.2135 Min. :-1.6049 Min. :-1.89481 Min. :-1.1200 Min. : 2.700 1st Qu.:-0.8856 1st Qu.:-0.8606 1st Qu.:-0.76518 1st Qu.:-0.6198 1st Qu.: 6.500 Median :-0.2501 Median :-0.1416 Median :-0.05513 Median :-0.2447 Median : 7.500 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 7.791 3rd Qu.: 0.7797 3rd Qu.: 0.9060 3rd Qu.: 0.81629 3rd Qu.: 0.1055 3rd Qu.: 8.600 Max. : 2.1244 Max. : 1.8047 Max. : 2.91417 Max. : 4.1571 Max. :15.300`

*Decision trees*, also referred to as Classification and Regression Trees (CART), work for both categorical and continuous input and output variables. They work by splitting the data into two or more homogeneous sets based on the most significant splitter among the independent variables. The best differentiation is the one that minimizes the cost metric. The cost metric for a classification tree is often the entropy or the gini index, whereas for a regression tree, the default metric is the mean squared error.

The basic workflow of decision trees is as follows:

The modeling process starts at the `Root Node`

, which represents the entire data. This is divided into two or more sub-nodes, also referred to as splitting. This process of splitting continues until the splitting criterion is met, and the sub-node where splitting happens is called a decision node. Once the splitting criterion is met, the nodes do not split any further; such nodes are called a `Leaf or Terminal node`

. We can also remove sub-nodes through a process called `Pruning`

.

We will now create a regression tree model using the `rpart`

library. The first step is to instantiate the algorithm, which is done in the first line of code below. The argument `method = "anova"`

specifies that a regression model is to be built. For classification problem, the argument will be `class`

. The `control`

argument provides a list of options that control the way the decision tree algorithm is trained.

In the code below, the `control=rpart.control(minsplit=20, cp=0.001)`

argument specifies that the minimum number of observations in a node should be 20 before attempting a split. It also specifies that a split happens only if it decreases the error metric by a factor of 0.001, referred to as the *cost complexity factor*. The second line prints the summary of the model. Note that it is possible to understand all the possible arguments with the `?rpart()`

command.

`1 2 3`

`tree_model = rpart(unemploy ~ ., data=train, method = "anova", control=rpart.control(minsplit=20, cp=0.001)) summary(tree_model)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32`

`#Truncated Output Call: rpart(formula = unemploy ~ ., data = train, method = "anova", control = rpart.control(minsplit = 20, cp = 0.001)) n= 401 CP nsplit rel error xerror xstd 1 0.508609424691 0 1.00000000000 1.00746991170 0.09222517909 2 0.240199126781 1 0.49139057531 0.53432047449 0.04138079775 3 0.062388054346 2 0.25119144853 0.29990084346 0.02550320375 4 0.027180308031 3 0.18880339418 0.21834245878 0.02083102075 5 0.024078055174 4 0.16162308615 0.20705841419 0.02021514091 6 0.019456798303 5 0.13754503098 0.17518743853 0.01711287778 7 0.015928262852 6 0.11808823267 0.16563982394 0.01633076066 8 0.014943330509 7 0.10215996982 0.15505479787 0.01516085859 9 0.012183802564 8 0.08721663931 0.13629795408 0.01491501765 10 0.007010774289 9 0.07503283675 0.12235179176 0.01517832910 11 0.006864114984 11 0.06101128817 0.11620014096 0.01353111243 12 0.004679462622 12 0.05414717319 0.11258856185 0.01342153785 13 0.002606235055 13 0.04946771057 0.09356044395 0.01191539603 14 0.002392463727 14 0.04686147551 0.08924901899 0.01192405697 15 0.002373022016 15 0.04446901178 0.08914954140 0.01192548795 16 0.001075173879 16 0.04209598977 0.08103175712 0.01180204685 17 0.001054136373 18 0.03994564201 0.07929979931 0.01179186081 18 0.001039626829 19 0.03889150564 0.07916127996 0.01179072614 19 0.001024169682 21 0.03681225198 0.07900442265 0.01178836902 20 0.001000000000 22 0.03578808230 0.07900442265 0.01178836902 Variable importance pop uempmed pce psavert 33 32 28 7`

The important variables, according to the regression tree model, are `pop`

, `uempmed`

and `pce`

.

Let's evaluate the model performance on the train and test datasets. The first step is to create a function for calculating the evaluation metrics, R-squared and RMSE. The second step is to predict and evaluate the model on train data, while the third step is to predict and evaluate the model on test data.

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25`

`# Step 1 - create the evaluation metrics function eval_results <- function(true, predicted, df) { SSE <- sum((predicted - true)^2) SST <- sum((true - mean(true))^2) R_square <- 1 - SSE / SST RMSE = sqrt(SSE/nrow(df)) # Model performance metrics data.frame( RMSE = RMSE, Rsquare = R_square ) } # Step 2 - predicting and evaluating the model on train data predictions_train_cart = predict(tree_model, data = train) eval_results(train$unemploy, predictions_train_cart, train) # Step 3 - predicting and evaluating the model on test data predictions_test_cart = predict(tree_model, newdata = test) eval_results(test$unemploy, predictions_test_cart, test)`

{r}

Output:

`1 2 3 4 5 6 7 8 9`

`# Training set results RMSE Rsquare 1 0.4644746 0.9642119" # Test set results RMSE Rsquare 1 0.6106443 0.9591839`

The above output shows that the RMSE and R-squared values for the decision tree model on the training data are 0.47 million and 96.4 percent, respectively. For the test data, the results for these metrics are 0.61 million and 96 percent, respectively.

Decision Trees are useful, but the problem is that they often tend to overfit the training data, leading to high variances in the test data. *Random forest* algorithms overcome this shortcoming by reducing the variance of the decision trees. They are called a *forest* because they are the collection, or ensemble, of several decision trees. One major difference between a decision tree and a random forest model is how the splits happen. In random forest, instead of trying splits on all the features, a sample of features is selected for each split, thereby reducing the variance of the model.

In R, the `randomForest`

package is used to train the random forest algorithm. The first line of code below instantiates the random forest regression model, and the second line prints the summary of the model.

`1 2 3`

`rf_model = randomForest(unemploy ~ ., data=train) summary(rf_model)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19`

`Length Class Mode call 3 -none- call type 1 -none- character predicted 401 -none- numeric mse 500 -none- numeric rsq 500 -none- numeric oob.times 401 -none- numeric importance 4 -none- numeric importanceSD 0 -none- NULL localImportance 0 -none- NULL proximity 0 -none- NULL ntree 1 -none- numeric mtry 1 -none- numeric forest 11 -none- list coefs 0 -none- NULL y 401 -none- numeric test 0 -none- NULL inbag 0 -none- NULL terms 3 terms call`

We can look at the important variables using the code below. The output shows that the variables `pop`

, `pce`

, and `uempmed`

are the significant predictors.

`1`

`importance(rf_model)`

{r}

Output:

`1 2 3 4 5`

`IncNodePurity pce 695.0615302 pop 759.6686618 psavert 266.2619092 uempmed 676.7521072`

Let's evaluate the model performance on the train and test datasets, using the lines of code below.

`1 2 3 4 5 6 7`

`# predicting and evaluating the model on train data predictions_train_rf = predict(rf_model, data = train) eval_results(train$unemploy, predictions_train_rf, train) # predicting and evaluating the model on test data predictions_test_rf = predict(rf_model, newdata = test) eval_results(test$unemploy, predictions_test_rf, test)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10`

`# Output on Training Data RMSE Rsquare 1 0.3469372 0.9800328 # Output on Test Data RMSE Rsquare 1 0.511681 0.9713415`

The above output shows that the RMSE and R-squared values on the training data are 0.35 million and 98 percent, respectively. For the test data, the results for these metrics are 0.51 million and 97.1 percent, respectively. The performance of the random forest model is superior to the decision tree model built earlier.

In this guide, you learned about tree-based non-linear regression models, including Decision tree and random forest. We observed that the random forest model outperforms the decision tree model. The results from these two algorithms are summarized below:

Decision Tree Model: Test set RMSE of 0.6 million and R-square of 96 percent.

Random Forest Model: Test set RMSE of 0.5 million and R-square of 97.1 percent.

To learn more about data science with R, please refer to the following guides:

1