Author avatar

Deepika Singh

Getting Started with H2O

Deepika Singh

  • Jun 23, 2020
  • 11 Min read
  • 105 Views
  • Jun 23, 2020
  • 11 Min read
  • 105 Views
Data
Data Analytics
Machine Learning

Introduction

H2O is a fast, scalable, open-source machine learning and artificial intelligence platform that can be used to build machine learning models on large data sets. In addition, H2O can integrate with major programming languages such as R, Python, and Spark.

In this guide, you will learn the basics of building machine learning models using H2O and R.

Data

Unemployment is a big socio-economic and political concern for any country, and managing it is a chief task for any government. In this guide, you will build regression algorithms for predicting unemployment within an economy.

The data comes from the US economic time series data available from http://research.stlouisfed.org/fred2. It contains 574 rows and 5 variables, as described below:

  1. psavert: personal savings rate

  2. pce: personal consumption expenditures, in billions of dollars

  3. uempmed: median duration of unemployment, in weeks

  4. pop: total population, in millions

  5. unemploy: number of unemployed populations, in thousands. This is the dependent variable.

Evaluation Metrics

You will evaluate the performance of the model using two metrics: R-squared value and Root Mean Squared Error (RMSE). Ideally, lower RMSE and higher R-squared values are indicative of a good model.

Start by loading the required libraries and the data.

1
2
3
4
5
6
library(plyr)
library(readr)
library(dplyr)

dat <- read_csv("data.csv")
glimpse(dat)
{r}

Output:

1
2
3
4
5
6
7
Observations: 564
Variables: 5
$ pce      <dbl> 531.5, 534.2, 544.9, 544.6, 550.4, 556.8, 563.8, 567.6, 568.…
$ pop      <dbl> 199808, 199920, 200056, 200208, 200361, 200536, 200706, 2008…
$ psavert  <dbl> 11.7, 12.2, 11.6, 12.2, 12.0, 11.6, 10.6, 10.4, 10.4, 10.6, …
$ uempmed  <dbl> 5.1, 4.5, 4.1, 4.6, 4.4, 4.4, 4.5, 4.2, 4.6, 4.8, 4.4, 4.4, …
$ unemploy <dbl> 2878, 3001, 2877, 2709, 2740, 2938, 2883, 2768, 2686, 2689, …

The output shows that all the variables in the dataset are numerical variables (labeled as 'dbl').

Data Partitioning

You will build the model on the training set and evaluate its performance on the test set. This is called the holdout-validation approach for evaluating model performance.

The first line of code below sets the random seed for reproducibility of results. The second line creates an index for randomly sampling observations for data partition. The next two lines of code create the training and test sets, while the last two lines print the dimensions of the training and test sets. The training set contains 70 percent of the data while the test set contains the remaining 30 percent.

1
2
3
4
5
6
7
8
set.seed(100) 
index = sample(1:nrow(dat), 0.7*nrow(dat)) 

train = dat[index,] 
test = dat[-index,] 

dim(train)
dim(test)
{r}

Output:

1
2
3
4
394 5

170 5
 

Connecting H2O and R

You have created the data partition and will build the predictive model using H2O and R. However, before building machine learning models, you must connect h2o with R. The first step is to install the h2o package, which is done with the code below.

1
2
install.packages("h2o")
library(h2o)
{r}

Once you have installed the library, launch the cluster and initialize it with the code below.

1
2
localH2O <- h2o.init(nthreads = -1)
h2o.init()
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         24 minutes 26 seconds 
    H2O cluster timezone:       Etc/UTC 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.30.0.1 
    H2O cluster version age:    2 months and 7 days  
    H2O cluster name:           H2O_started_from_R_nbuser_xnt904 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   0.80 GB 
    H2O cluster total cores:    2 
    H2O cluster allowed cores:  2 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
    R Version:                  R version 3.5.3 (2019-03-11) 

The output above shows that the connection between h2o and R is successful. This means you are ready to build the machine learning models. To begin, transfer the data from R to an h2o instance. This is done with the code below.

1
2
3
train.h2o <- as.h2o(train)

test.h2o <- as.h2o(test)
{r}

The next step is to identify variables to be used in modeling. This is done with the code below.

1
2
3
4
5
#dependent variable 
y.dep <- 5

#independent variables 
x.indep <- c(1:4)
{r}

You are now ready to build regression models using R and H2O.

Linear Regression

The simplest form of regression is linear regression, which assumes that the predictors have a linear relationship with the target variable. The input variables are assumed to have a Gaussian distribution. Another assumption is that the predictors are not highly correlated with each other (a problem called multi-collinearity).

A multiple linear regression model in H2O can be built using the h2o.glm() function, which can be used for all types of regression algorithms such as linear, lasso, ridge, logistic, etc. The first line of code below builds the multiple linear regression model, while the second line prints the performance of the model on the training dataset.

1
2
3
mlr.model <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "gaussian")

h2o.performance(mlr.model)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|======================================================================| 100%
H2ORegressionMetrics: glm
** Reported on training data. **

MSE:  3236195
RMSE:  1798.943
MAE:  1410.828
RMSLE:  0.2562514
Mean Residual Deviance :  3236195
R^2 :  0.4692928
Null Deviance :2402569520
Null D.o.F. :393
Residual Deviance :1275060937
Residual D.o.F. :389
AIC :7036.148

The above output shows that the RMSE and R-squared values for the linear regression model on the training data is 1.8 million and 47%, respectively. These numbers are not great, as the R-squared value is low. Later, you will try the random forest model to improve model performance.

Evaluating the Model

The model evaluation will happen on the test data, but the first step is to use the model for generating predictions on test data. The code below generates predictions on the test data and saves it as a data frame.

1
predict.mlr <- as.data.frame(h2o.predict(mlr.model, test.h2o))
{r}

For evaluating the model performance on test data, you will create a function to calculate the evaluation metrics, R-squared and RMSE. The code below creates the evaluation metric function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
eval_results <- function(true, predicted, df) {
  SSE <- sum((predicted - true)^2)
  SST <- sum((true - mean(true))^2)
  R_square <- 1 - SSE / SST
  RMSE = sqrt(SSE/nrow(df))
  
  
  # Model performance metrics
  data.frame(
    RMSE = RMSE,
    Rsquare = R_square
  )
  
}
{r}

Now use the predictions and the evaluation function to print the evaluation result on test data.

1
2
#evaluation on test data
eval_results(test$unemploy, predict.mlr$predict, test)
{r}

Output:

1
2
3
4
A data.frame: 1 x 2
RMSE	Rsquare
<dbl>	<dbl>
2037.001	0.5154574

The above output shows that the RMSE and R-squared values on the test data are two million and 51%, respectively. These are still not great results, which shows that linear regression is not the right algorithm for this data. You will next build a powerful random forest model to see if the performance improves.

Random Forest

Random forest algorithms are called forests because they are the collection, or ensemble, of several decision trees. In random forest, instead of trying splits on all the features, a sample of features is selected for each split, thereby reducing the variance of the model.

Training the Model

In R, the h2o. randomForest() function is used to train the random forest algorithm. The first line of code below builds the model on the training data, while the second line prints the performance summary of the model.

1
2
3
rforest.model <- h2o.randomForest(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 1000, mtries = 3, max_depth = 4, seed = 1122)

h2o.performance(rforest.model)
{r}

Output:

1
2
3
4
5
6
7
8
9
H2ORegressionMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

MSE:  529099.3
RMSE:  727.3921
MAE:  537.3119
RMSLE:  0.08856347
Mean Residual Deviance :  529099.3

The above output shows that the RMSE on the training data is 0.73 million. The next step is to evaluate the model performance on the test data, which is done with the code below.

1
2
predict.rf <- as.data.frame(h2o.predict(rforest.model, test.h2o))
eval_results(test$unemploy, predict.rf$predict, test)
{r}

Output:

1
2
3
RMSE		Rsquare
<dbl>		<dbl>
647.5397	0.9510354

The above output shows that the RMSE and R-squared on the test data are 0.65 million and 95%, respectively. The performance of the random forest model is far superior to the multiple linear regression model built earlier.

0