H20 Cluster on AWS EC2: Pro’s, Con’s and How to Get Started

Published in

SingleStone

8 min readMay 29, 2019

While my last few blog articles have focused more on machine learning and deep learning, I thought it might be a nice change of pace to write about DevOps as it relates to ML and data engineering. After all, without data engineering, machine learning opportunities are limited. And without DevOps, setting up a repeatable, cost-effective infrastructure is much more difficult and time consuming. DevOps engineers create infrastructure as code that can be deployed or modified easily, quickly and on-demand. Data engineers create reliable, repeatable pipelines to move data from disparate source systems to a centralized location where data scientists turn it into value. Data engineers, therefore, are the unsung heroes of the analytics landscape (full disclosure: I was a data engineer for many years).

I recently started experimenting with H2O, and I have to say, it’s an interesting option. Having said that, it has its strengths and weaknesses. For instance, I don’t find it to be a good option for data munging. However, once you have a data set ready for machine learning, it’s worth taking the time to learn.

Why Consider H2O?

H2O is “an open source, in-memory, distributed, fast and scalable machine learning” platform. You can use it with either Python or R, and it even has a variant called Sparkling Water that you can use with Apache Spark. Perhaps its most intriguing feature is “AutoML” — a feature that allows H2O to perform analyses against your data using multiple supervised algorithms, returning a “leaderboard” detailing the most effective ones.

Scaling Out

One great feature of H2O is that you can “scale out” by creating a multiple-node cluster. While the official cluster setup instructions are fairly straightforward, I have a few tips to add for:

Using Terraform to automatically set up (and destroy) a cluster of virtually any size by running a single command
Setting up a cluster of less expensive instances on Amazon’s EC2 “spot market”
Running H2O as a service so it can be stopped and started with systemctl on systemd Linux variants (Amazon Linux 2, Ubuntu, etc.)

Sure, you can run H2O on your local machine, but when you run it on a cluster of cheap cloud VMs it’s a force to be reckoned with. In this blogpost, I’ll show you how to create a three-node H2O cluster for less than $0.10 an hour that will perform more than twice as fast as running it on a decently-equipped MacBook Pro (TO BE VERIFIED).

Caveats

A few disclaimers before we get started:

This should not be used as a guide to create a production cluster.
I’m assuming you’re running some flavor of Linux (including OS X) on your local machine.*
You have installed the AWS Command Line Interface (AWS CLI) and Terraform.
You have your AWS credentials file properly set up in the .aws subdirectory of your user’s home directory.
We will launch instances in your default VPC and assign public IP addresses to each. Amazon creates a default VPC for you in each region when you create your AWS account.
We will use Amazon Linux 2 for this demo.
The scripts use t2.large EC2 instances with a spot price of $.03 per hour. You’ll need to verify the spot price for t2.large nodes in your chosen availability zone is lower than that for your spot requests to be fulfilled, or edit the Terraform files to change the spot price to something higher.

* Note for Windows users: If you’re running Windows ≤ 8, the easiest way to play along is to install Virtualbox and Ubuntu. If you’re on Windows 10, consider setting up Windows Subsystem for Linux . If you’re on Windows 10 Enterprise, Pro or Education, you can set up Hyper-V and create an Ubuntu VM through Hyper-V.

You can use a machine image to set up an H2O server with all the bells and whistles of the commercial product, including “Driverless AI.” However, a single node of the vendor-recommended instance type (p3.2xlarge) will run you more than $3.00 an hour. True, these instances are much more powerful than the t2.large instances we’ll be using, but you get what you pay for.

Assumptions and Summary of Steps

This article assumes you have decent knowledge of AWS, including setting up an SSH config file on your local machine, creating and working with EC2 instances and security groups, and using the AWS command line interface. Luckily, Terraform does much of the heavy lifting for us.

Here’s an overview of what you’ll need to do:

Clone the GitHub repo with this article’s companion code.
Create and set up a few variables in aterraform.tfvars file in the root directory of the repo you cloned.
Run the bash script to kick off the Terraform code, allowing it to run to completion.
Connect to the Web interface via a browser.
Destroy resources after you’re done experimenting.

Briefly, the Terraform code:

a. Creates a security group to allow you to connect to the VMs via SSH (for troubleshooting and SSH tunneling) and allows communication between the nodes (opens ports 54321 and 54322 TCP and UDP for communication between the nodes).

b. Creates three EC2 instances and creates a file in each node’s home directory (here, /home/ec2-user) called flatfile.txt that includes the private IP addresses of all three nodes and starts H2O on all of the nodes as a service.

c. Associates the security group with these instances.

Setup Variables

First, take a look at the variables.tf file. There are several variables with default values, and some with no values. You will assign values to the values with no default values next (if you need to override any of the other variables, you can do that using the following process as well).

Navigate to the repo’s root folder and create or edit the terraform.tfvars file. You’ll need to enter values for at least four of the variables:

iam_instance_profile — name of the IAM role to assign to the EC2 instances
key_name — the name of your SSH key (without the .pem suffix)
vpc_id — the id of the VPC you will use
subnet_id — the id of the subnet within that VPC

Here is an example of how to assign a value to a variable:

iam_instance_profile = "My-IAM-Role-For-EC2-Instances"

You probably won’t have to edit any of the other variable values, but keep in mind that if you‘re not launching in the us-east-1 AWS region, you’ll need to also change the availability zone as well as the AMI ID because those are region-specific.

Once you’ve edited your terraform.tfvars file, browse to the repo’s root folder in a terminal window and type the following to kick off the code in the Terraform template:

sh launch_ondemand.sh /path/to/ssh/key/mykey.pem

The key you use here should be the same one you set up in the terraform.tfvarsfile. The user data code in bootstrap_h2o_python3.sh (see first script below) is run on each instance to install Java, Python3, some Python modules, and H2O, copy some files from the GitHub repo, and set up H2O to run as a service on systemd Linux variants like Amazon Linux 2.

The second script below defines how the service is started, stopped and reloaded, while the third script (liberated from an excellent Stack Overflow post) is the actual code that’s run when you type sudo service h2o start (or stop or restart) at the command line.

Below is a screenshot of the inbound security rules for the security group for H2O, created by the Terraform script. It allows inter-node communication and allows you to use SSH and SCP with your nodes.

Set up a dynamic forward in your local SSH config file for forward all local connections on port 8157 to the remote host

Below is a sample SSH config file you can use to connect with your H2O instances. Note the local forward on the first server. As long as you’re connected to that h2o1 server via SSH and H2O is running, you’ll be able to connect to the Web-based GUI by entering http://127.0.0.1:8157 in your local machine’s Web browser. Edit this file to replace the IP address with the public IP address for the h2o1server. You may choose to do the same for the other nodes, but you don’t need to unless you’re planning to connect to them via SSH for troubleshooting.

When the Terraform code completes, it writes the private IPs and public IPs toflatfile.txt and public_ips.txt respectively, and uses the list of public IPs to copy the flatfile.txt to the home directory of each H2O node. H2O uses this file to be aware of the available nodes in the cluster.

Once the launch script completes and you’ve set up your SSH config file for port forwarding, SSH up to the node of your H2O cluster you designated ash2o1and enter the following command:

sudo systemctl status h2o

You should see something like the following:

This is a good sign — H2O has been set up successfully on at least one node. While still connected to h2o1via SSH, open a Web browser window and enter the following to connect to H2O Flow:

http://127.0.0.1:8157

You should see the H2O Flow GUI appear. This happens because the LocalForward instruction you added to the h2o1 node in your ssh config file, and because you have an open SSH connection to that node.

A list of many of the commands you can issue with H2O Flow, the Web-based interface for H2O

You can verify that you properly set up your cluster by clicking the Admin menu and selecting Cluster Status.

Select Cluster Status from the Help menu to ensure all three nodes were added to the cluster

Example of a properly launched H2O cluster

The values will vary, but you should see these four lines, one for each node of the cluster and one for the aggregate values of all the nodes.

That’s it! You’ve just set up an inexpensive, powerful H2O cluster for $0.09 per hour. Feel free to experiment with the H2O Flow interface in your browser for a while to get the feel for what it can do.

Finally, go back to the directory with your Terraform code and enter terraform destroy to delete the cluster nodes and security group. You’ll have to enter yes at the prompt and wait until you see a message confirming all resources were successfully destroyed.

Important Caveat: It’s important to know that H2O is not set up for high availability at the time of this writing, so if one of the nodes goes down, you’ll need to restart the whole cluster rather than just restarting the service or the failed node’s VM.

In a future blog post, I’ll show you how to make use of H2O’s powerful AutoML feature to do automated machine learning.