H2O is a fast, scalable, open-source machine learning and artificial intelligence platform that can be used to build machine learning models on large data sets. In addition, H2O can integrate with major programming languages such as R, Python, and Spark.
In this guide, you will learn the basics of building machine learning models using H2O and R.
Unemployment is a big socio-economic and political concern for any country, and managing it is a chief task for any government. In this guide, you will build regression algorithms for predicting unemployment within an economy.
The data comes from the US economic time series data available from http://research.stlouisfed.org/fred2. It contains 574 rows and 5 variables, as described below:
psavert
: personal savings rate
pce
: personal consumption expenditures, in billions of dollars
uempmed
: median duration of unemployment, in weeks
pop
: total population, in millions
unemploy
: number of unemployed populations, in thousands. This is the dependent variable.You will evaluate the performance of the model using two metrics: R-squared value and Root Mean Squared Error (RMSE). Ideally, lower RMSE and higher R-squared values are indicative of a good model.
Start by loading the required libraries and the data.
1library(plyr)
2library(readr)
3library(dplyr)
4
5dat <- read_csv("data.csv")
6glimpse(dat)
Output:
1Observations: 564
2Variables: 5
3$ pce <dbl> 531.5, 534.2, 544.9, 544.6, 550.4, 556.8, 563.8, 567.6, 568.…
4$ pop <dbl> 199808, 199920, 200056, 200208, 200361, 200536, 200706, 2008…
5$ psavert <dbl> 11.7, 12.2, 11.6, 12.2, 12.0, 11.6, 10.6, 10.4, 10.4, 10.6, …
6$ uempmed <dbl> 5.1, 4.5, 4.1, 4.6, 4.4, 4.4, 4.5, 4.2, 4.6, 4.8, 4.4, 4.4, …
7$ unemploy <dbl> 2878, 3001, 2877, 2709, 2740, 2938, 2883, 2768, 2686, 2689, …
The output shows that all the variables in the dataset are numerical variables (labeled as 'dbl').
You will build the model on the training set and evaluate its performance on the test set. This is called the holdout-validation approach for evaluating model performance.
The first line of code below sets the random seed for reproducibility of results. The second line creates an index for randomly sampling observations for data partition. The next two lines of code create the training and test sets, while the last two lines print the dimensions of the training and test sets. The training set contains 70 percent of the data while the test set contains the remaining 30 percent.
1set.seed(100)
2index = sample(1:nrow(dat), 0.7*nrow(dat))
3
4train = dat[index,]
5test = dat[-index,]
6
7dim(train)
8dim(test)
Output:
1394 5
2
3170 5
4
You have created the data partition and will build the predictive model using H2O and R. However, before building machine learning models, you must connect h2o
with R. The first step is to install the h2o
package, which is done with the code below.
1install.packages("h2o")
2library(h2o)
Once you have installed the library, launch the cluster and initialize it with the code below.
1localH2O <- h2o.init(nthreads = -1)
2h2o.init()
Output:
1Connection successful!
2
3R is connected to the H2O cluster:
4 H2O cluster uptime: 24 minutes 26 seconds
5 H2O cluster timezone: Etc/UTC
6 H2O data parsing timezone: UTC
7 H2O cluster version: 3.30.0.1
8 H2O cluster version age: 2 months and 7 days
9 H2O cluster name: H2O_started_from_R_nbuser_xnt904
10 H2O cluster total nodes: 1
11 H2O cluster total memory: 0.80 GB
12 H2O cluster total cores: 2
13 H2O cluster allowed cores: 2
14 H2O cluster healthy: TRUE
15 H2O Connection ip: localhost
16 H2O Connection port: 54321
17 H2O Connection proxy: NA
18 H2O Internal Security: FALSE
19 H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
20 R Version: R version 3.5.3 (2019-03-11)
The output above shows that the connection between h2o
and R is successful. This means you are ready to build the machine learning models. To begin, transfer the data from R to an h2o
instance. This is done with the code below.
1train.h2o <- as.h2o(train)
2
3test.h2o <- as.h2o(test)
The next step is to identify variables to be used in modeling. This is done with the code below.
1#dependent variable
2y.dep <- 5
3
4#independent variables
5x.indep <- c(1:4)
You are now ready to build regression models using R and H2O.
The simplest form of regression is linear regression, which assumes that the predictors have a linear relationship with the target variable. The input variables are assumed to have a Gaussian distribution. Another assumption is that the predictors are not highly correlated with each other (a problem called multi-collinearity).
A multiple linear regression model in H2O can be built using the h2o.glm()
function, which can be used for all types of regression algorithms such as linear, lasso, ridge, logistic, etc. The first line of code below builds the multiple linear regression model, while the second line prints the performance of the model on the training dataset.
1mlr.model <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "gaussian")
2
3h2o.performance(mlr.model)
Output:
1|======================================================================| 100%
2H2ORegressionMetrics: glm
3** Reported on training data. **
4
5MSE: 3236195
6RMSE: 1798.943
7MAE: 1410.828
8RMSLE: 0.2562514
9Mean Residual Deviance : 3236195
10R^2 : 0.4692928
11Null Deviance :2402569520
12Null D.o.F. :393
13Residual Deviance :1275060937
14Residual D.o.F. :389
15AIC :7036.148
The above output shows that the RMSE and R-squared values for the linear regression model on the training data is 1.8 million and 47%, respectively. These numbers are not great, as the R-squared value is low. Later, you will try the random forest model to improve model performance.
The model evaluation will happen on the test data, but the first step is to use the model for generating predictions on test data. The code below generates predictions on the test data and saves it as a data frame.
1predict.mlr <- as.data.frame(h2o.predict(mlr.model, test.h2o))
For evaluating the model performance on test data, you will create a function to calculate the evaluation metrics, R-squared and RMSE. The code below creates the evaluation metric function.
1eval_results <- function(true, predicted, df) {
2 SSE <- sum((predicted - true)^2)
3 SST <- sum((true - mean(true))^2)
4 R_square <- 1 - SSE / SST
5 RMSE = sqrt(SSE/nrow(df))
6
7
8 # Model performance metrics
9 data.frame(
10 RMSE = RMSE,
11 Rsquare = R_square
12 )
13
14}
Now use the predictions and the evaluation function to print the evaluation result on test data.
1#evaluation on test data
2eval_results(test$unemploy, predict.mlr$predict, test)
Output:
1A data.frame: 1 x 2
2RMSE Rsquare
3<dbl> <dbl>
42037.001 0.5154574
The above output shows that the RMSE and R-squared values on the test data are two million and 51%, respectively. These are still not great results, which shows that linear regression is not the right algorithm for this data. You will next build a powerful random forest model to see if the performance improves.
Random forest algorithms are called forests because they are the collection, or ensemble, of several decision trees. In random forest, instead of trying splits on all the features, a sample of features is selected for each split, thereby reducing the variance of the model.
In R, the h2o. randomForest()
function is used to train the random forest algorithm. The first line of code below builds the model on the training data, while the second line prints the performance summary of the model.
1rforest.model <- h2o.randomForest(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 1000, mtries = 3, max_depth = 4, seed = 1122)
2
3h2o.performance(rforest.model)
Output:
1H2ORegressionMetrics: drf
2** Reported on training data. **
3** Metrics reported on Out-Of-Bag training samples **
4
5MSE: 529099.3
6RMSE: 727.3921
7MAE: 537.3119
8RMSLE: 0.08856347
9Mean Residual Deviance : 529099.3
The above output shows that the RMSE on the training data is 0.73 million. The next step is to evaluate the model performance on the test data, which is done with the code below.
1predict.rf <- as.data.frame(h2o.predict(rforest.model, test.h2o))
2eval_results(test$unemploy, predict.rf$predict, test)
Output:
1RMSE Rsquare
2<dbl> <dbl>
3647.5397 0.9510354
The above output shows that the RMSE and R-squared on the test data are 0.65 million and 95%, respectively. The performance of the random forest model is far superior to the multiple linear regression model built earlier.
In this guide, you learned about the basics of building machine learning models using H2O and R. You learned how to launch the h2o
cluster and integrate it with the R-session. Finally, you built a couple of regression models.
To learn more about Data Science with R, please refer to the following guides: