1
H2O is a fast, scalable, open-source machine learning and artificial intelligence platform that can be used to build machine learning models on large data sets. In addition, H2O can integrate with major programming languages such as R, Python, and Spark.
In this guide, you will learn the basics of building machine learning models using H2O and R.
Unemployment is a big socio-economic and political concern for any country, and managing it is a chief task for any government. In this guide, you will build regression algorithms for predicting unemployment within an economy.
The data comes from the US economic time series data available from http://research.stlouisfed.org/fred2. It contains 574 rows and 5 variables, as described below:
psavert
: personal savings rate
pce
: personal consumption expenditures, in billions of dollars
uempmed
: median duration of unemployment, in weeks
pop
: total population, in millions
unemploy
: number of unemployed populations, in thousands. This is the dependent variable.
You will evaluate the performance of the model using two metrics: R-squared value and Root Mean Squared Error (RMSE). Ideally, lower RMSE and higher R-squared values are indicative of a good model.
Start by loading the required libraries and the data.
1 2 3 4 5 6
library(plyr) library(readr) library(dplyr) dat <- read_csv("data.csv") glimpse(dat)
Output:
1 2 3 4 5 6 7
Observations: 564 Variables: 5 $ pce <dbl> 531.5, 534.2, 544.9, 544.6, 550.4, 556.8, 563.8, 567.6, 568.… $ pop <dbl> 199808, 199920, 200056, 200208, 200361, 200536, 200706, 2008… $ psavert <dbl> 11.7, 12.2, 11.6, 12.2, 12.0, 11.6, 10.6, 10.4, 10.4, 10.6, … $ uempmed <dbl> 5.1, 4.5, 4.1, 4.6, 4.4, 4.4, 4.5, 4.2, 4.6, 4.8, 4.4, 4.4, … $ unemploy <dbl> 2878, 3001, 2877, 2709, 2740, 2938, 2883, 2768, 2686, 2689, …
The output shows that all the variables in the dataset are numerical variables (labeled as 'dbl').
You will build the model on the training set and evaluate its performance on the test set. This is called the holdout-validation approach for evaluating model performance.
The first line of code below sets the random seed for reproducibility of results. The second line creates an index for randomly sampling observations for data partition. The next two lines of code create the training and test sets, while the last two lines print the dimensions of the training and test sets. The training set contains 70 percent of the data while the test set contains the remaining 30 percent.
1 2 3 4 5 6 7 8
set.seed(100) index = sample(1:nrow(dat), 0.7*nrow(dat)) train = dat[index,] test = dat[-index,] dim(train) dim(test)
Output:
1 2 3 4
394 5 170 5
You have created the data partition and will build the predictive model using H2O and R. However, before building machine learning models, you must connect h2o
with R. The first step is to install the h2o
package, which is done with the code below.
1 2
install.packages("h2o") library(h2o)
Once you have installed the library, launch the cluster and initialize it with the code below.
1 2
localH2O <- h2o.init(nthreads = -1) h2o.init()
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Connection successful! R is connected to the H2O cluster: H2O cluster uptime: 24 minutes 26 seconds H2O cluster timezone: Etc/UTC H2O data parsing timezone: UTC H2O cluster version: 3.30.0.1 H2O cluster version age: 2 months and 7 days H2O cluster name: H2O_started_from_R_nbuser_xnt904 H2O cluster total nodes: 1 H2O cluster total memory: 0.80 GB H2O cluster total cores: 2 H2O cluster allowed cores: 2 H2O cluster healthy: TRUE H2O Connection ip: localhost H2O Connection port: 54321 H2O Connection proxy: NA H2O Internal Security: FALSE H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 R Version: R version 3.5.3 (2019-03-11)
The output above shows that the connection between h2o
and R is successful. This means you are ready to build the machine learning models. To begin, transfer the data from R to an h2o
instance. This is done with the code below.
1 2 3
train.h2o <- as.h2o(train) test.h2o <- as.h2o(test)
The next step is to identify variables to be used in modeling. This is done with the code below.
1 2 3 4 5
#dependent variable y.dep <- 5 #independent variables x.indep <- c(1:4)
You are now ready to build regression models using R and H2O.
The simplest form of regression is linear regression, which assumes that the predictors have a linear relationship with the target variable. The input variables are assumed to have a Gaussian distribution. Another assumption is that the predictors are not highly correlated with each other (a problem called multi-collinearity).
A multiple linear regression model in H2O can be built using the h2o.glm()
function, which can be used for all types of regression algorithms such as linear, lasso, ridge, logistic, etc. The first line of code below builds the multiple linear regression model, while the second line prints the performance of the model on the training dataset.
1 2 3
mlr.model <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "gaussian") h2o.performance(mlr.model)
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
|======================================================================| 100% H2ORegressionMetrics: glm ** Reported on training data. ** MSE: 3236195 RMSE: 1798.943 MAE: 1410.828 RMSLE: 0.2562514 Mean Residual Deviance : 3236195 R^2 : 0.4692928 Null Deviance :2402569520 Null D.o.F. :393 Residual Deviance :1275060937 Residual D.o.F. :389 AIC :7036.148
The above output shows that the RMSE and R-squared values for the linear regression model on the training data is 1.8 million and 47%, respectively. These numbers are not great, as the R-squared value is low. Later, you will try the random forest model to improve model performance.
The model evaluation will happen on the test data, but the first step is to use the model for generating predictions on test data. The code below generates predictions on the test data and saves it as a data frame.
1
predict.mlr <- as.data.frame(h2o.predict(mlr.model, test.h2o))
For evaluating the model performance on test data, you will create a function to calculate the evaluation metrics, R-squared and RMSE. The code below creates the evaluation metric function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
eval_results <- function(true, predicted, df) { SSE <- sum((predicted - true)^2) SST <- sum((true - mean(true))^2) R_square <- 1 - SSE / SST RMSE = sqrt(SSE/nrow(df)) # Model performance metrics data.frame( RMSE = RMSE, Rsquare = R_square ) }
Now use the predictions and the evaluation function to print the evaluation result on test data.
1 2
#evaluation on test data eval_results(test$unemploy, predict.mlr$predict, test)
Output:
1 2 3 4
A data.frame: 1 x 2 RMSE Rsquare <dbl> <dbl> 2037.001 0.5154574
The above output shows that the RMSE and R-squared values on the test data are two million and 51%, respectively. These are still not great results, which shows that linear regression is not the right algorithm for this data. You will next build a powerful random forest model to see if the performance improves.
Random forest algorithms are called forests because they are the collection, or ensemble, of several decision trees. In random forest, instead of trying splits on all the features, a sample of features is selected for each split, thereby reducing the variance of the model.
In R, the h2o. randomForest()
function is used to train the random forest algorithm. The first line of code below builds the model on the training data, while the second line prints the performance summary of the model.
1 2 3
rforest.model <- h2o.randomForest(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 1000, mtries = 3, max_depth = 4, seed = 1122) h2o.performance(rforest.model)
Output:
1 2 3 4 5 6 7 8 9
H2ORegressionMetrics: drf ** Reported on training data. ** ** Metrics reported on Out-Of-Bag training samples ** MSE: 529099.3 RMSE: 727.3921 MAE: 537.3119 RMSLE: 0.08856347 Mean Residual Deviance : 529099.3
The above output shows that the RMSE on the training data is 0.73 million. The next step is to evaluate the model performance on the test data, which is done with the code below.
1 2
predict.rf <- as.data.frame(h2o.predict(rforest.model, test.h2o)) eval_results(test$unemploy, predict.rf$predict, test)
Output:
1 2 3
RMSE Rsquare <dbl> <dbl> 647.5397 0.9510354
The above output shows that the RMSE and R-squared on the test data are 0.65 million and 95%, respectively. The performance of the random forest model is far superior to the multiple linear regression model built earlier.
In this guide, you learned about the basics of building machine learning models using H2O and R. You learned how to launch the h2o
cluster and integrate it with the R-session. Finally, you built a couple of regression models.
To learn more about Data Science with R, please refer to the following guides:
1