Running H2O.ai Machine Learning Platform on a Hadoop/Yarn Cluster

9 min readFeb 2, 2016

Tags: H2O.ai, Hadoop, Yarn, Hive, Cloudera, Machine Learning

For the past few weeks, I’ve been experimenting with the Hadoop stack and H2O.ai machine learning library, and wanted to share my experience. I welcome your feedback, comments and corrections.

Problem To Solve

Lets consider this simple image recognition problem; the goal here is to write a program that distinguishes between the deceptively similar handwritten digits “4” and “9” using machine learning techniques. A large data set will be used to train a machine learning model that can predict wether an image contains a “4” or a “9”.

An image representing a hand-written “9”

The data used is downloaded from The University of California, Irvine (UCI) website (https://archive.ics.uci.edu/ml/datasets/Gisette). Basically, the training set is composed of 2 files:

gisette_train.data: Each row in this training data set contains 5000 integers which represent the pixels in an image. The corresponding classification of each image (i.e. “4” or “9”) is stored in gisette_train.labels (1 represents “4” and -1 represents “9”).

This tutorial is divided into 2 parts. In the first part, H2O.ai machine learning library is used to train a model on the problem described above without taking advantage of Hadoop. Of course, this means the problem is small in size and can fit a computer’s memory. The second part, Hadoop cluster will be used to run H2O.ai algorithms on the same problem.

An important feature of H2O.ai, is its ability to export the trained models as Plain Java Objects (POJO). The POJOs can be incorporated into a Java code to provide real-time predictions.

This tutorial will use DigitalOcean cloud service to create Ubuntu instances (Ubuntu 14.04 64bit).

Running H2O.ai without Hadoop

This section demonstrates the use of H2O.ai library to train a machine learning model. The steps are:

Create an Ubuntu (14.04) instance on DigitalOcean
Download H2O.ai library
Install R
Clone the following project from GitHub https://github.com/husseinmoghnieh/h2o_gisette.git
Run an R script to generate the model
Run a Java code to test the generated POJO model

Setting Ubuntu on DigitalOcean

Login to DigitalOcean service and create an Ubuntu instance (14.04 Trusy Tahr 64 bit). We will name this server as ch-master. If you are not familiar with this process, please watch the following short video.

Once the server is created, you will receive an email with the password for the root access (Note: It is a bad security practice to create a server with root password access, instead use SSH key only).

Configuring the Instance

Uninstall OpenJDK and install OracleJDK 7

sudo apt-get purge openjdk-\*
sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer

Let’s continue the installation process

sudo apt-get install git
sudo apt-get install mavensudo apt-get install libcurl4-gnutls-dev
sudo apt-get install r-base-core

wget http://h2o-release.s3.amazonaws.com/h2o/rel-tibshirani/8/h2o-3.6.0.8-cdh5.4.2.zip
unzip h2o-3.6.0.8-cdh5.4.2.zip

Type R to access the R console and run the following commands to install dependency packages and also H2O’s R library.

install.packages("RCurl")
install.packages("jsonlite")
install.packages("statmod")
install.packages("/root/h2o-3.6.0.8-cdh5.4.2/R/h2o_3.6.0.8.tar.gz",repos = NULL, type = "source")

Now quit R console by typing: quit()

Running H2O to create the models

Clone this project from GitHub which contains the training data set and the source code/scripts needed to complete this tutorial.

git clone https://github.com/husseinmoghnieh/h2o_gisette.git
cd h2o_gisette

Now we are ready to train a model by running the following R script.

Rscript scriptLocal.R

It is worth going over the R script the steps used to create and test the model. Basically, the script loads the training data set from ./data/ folder, splits the data into 2 sub-sets (80% to train the model and 20% to validate the model). Then it uses binary regression by calling H2O’s Generalized Linear Model function to create the model.

gisette_model <- h2o.glm(x = preset, y = “CLASSIFICATION”,   training_frame = train, validation_frame = valid, model_id = “GBMModel”, family = “binomial”)

Finally, the script takes the first 2 rows of the data and runs a prediction on them to produce the following results.

predict: 1
p(-1)    : 0.07957686
p(1)     : 0.9204231predict: -1
p(-1)    : 0.9462881
p(1)     : 0.05371189

The above results can be interpreted as such: For the first image, the prediction is of class 1 (i.e. image contains “4”) with a probability of 0.92. For the second test case, the prediction is of class -1 (i.e. image contains “9”) with a probability of 0.94.

The last step of the R script downloads the POJO model into a ./tmp folder.

Prediction using H2O’s POJO model

Let’s copy the POJO model into the Java folders by running

./copy_model.sh

Now we are ready to run a standalone Java code (PredictMain.java) which takes advantage of the saved POJO model to run a prediction on 2 test cases (i.e. 2 image data) fetched from ./resources folder.

./run_pojo.sh

The above command will compile, build and run the project to produce the following results. As you can notice, the prediction results produced by the Java code is similar to the prediction generated by the R script.

prediction…
prediction: 1 0.0795768580550732 0.9204231419449268
prediction: 0 0.9462881129145255 0.053711887085474495

Let’s look at the code that generated the above results. Basically, in PredictMain.java, the data of 2 images are loaded from ./resources and passed to EasyPredictModelWrapper (POJO model wrapper).

GBMModel gbmModel = new GBMModel(); // POJO modelEasyPredictModelWrapper gisetteModel = new EasyPredictModelWrapper(gbmModel);File file = new File(classLoader.getResource("pred_test_1_new.csv").getFile());
FileReader fileReader  = new FileReader(file);
CSVReader reader = new CSVReader(fileReader);
List myEntries = reader.readAll();
String[] header = (String []) myEntries.get(0);
String[] values = (String []) myEntries.get(1);
RowData test1 = new RowData();
for (int i=0; i < values.length; i++){
    test1.put(header[i], values[i]);
}BinomialModelPrediction p = predictGisette(test1);

System.out.println("prediction: " +  p.labelIndex + "\t" +  p.classProbabilities[0] + "\t" + p.classProbabilities[1]);private static BinomialModelPrediction predictGisette(RowData row) throws PredictException {
    return gisetteModel.predictBinomial(row);
}

Setting up Hadoop Cluster on DigitalOcean using Cloudera Manager.

Now that we learned how to use H2O, lets redo the above steps and generate the model by running H2O on a Hadoop cluster. Although the problem we are working on does not fall into the BigData space, the steps are applicable to a large problem.

Setting up the servers on DigitalOcean

The Ubuntu instance created in the previous part of this tutorial (i.e. ch-master) will be used as the namenode in the Hadoop cluster.

Now, let’s create two additional Ubuntu instances which will call ch-slave-1 and ch-salve-2 as shown below. The slave nodes will be used as datanodes in the Hadoop cluster.

Using a DNS service, assign URLs to the above created servers. For example:

ch-master.logp.info 192.241.188.20
ch-slave-1.logp.info 162.243.81.188
ch-slave-2.logp.info 162.243.64.216

Configuring the Instances

sudo apt-get update
sudo apt-get install ntp

Sudo without password

Now, we want these servers to run sudo command without password, so on each of the servers, run the following command

sudo visudo

Modify the following line from:

root ALL=(ALL:ALL) ALL

root ALL=(ALL:ALL) NOPASSWD: ALL

Modifying /etc/hosts

Modify /etc/hosts in each of the instances

sudo vi /etc/hosts

remove the following line

127.0.1.1 ch-(node_name) (node_name)

and add

192.241.188.20 ch-master.logp.info ch-master
162.243.20.186 ch-slave-1.logp.info ch-slave-1
162.243.90.152 ch-slave-2.logp.info ch-slave-2

Install Hadoop/Yarn/Hive on the Cluster

In your home directory run the following:

wget https://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.binchmod u+x cloudera-manager-installer.binsudo ./cloudera-manager-installer.bin

Now, you are ready to start the installation process of Hadoop using Cloudera Manager. Simply browse to the following URL and login:

ch-master.logp.info:7180
username: admin password: admin

Follow the steps in this short video to install Hadoop, Yarn and Hive.

Hadoop cluster is all setup now. Login to this URL to check Hadoop jobs:

http://ch-master.logp.info:8088/cluster

As you notice, only 6.57GB memory is assigned to Hadoop. At this point, it is important that you familiarize yourself with Cloudera Manager to configure your Hadoop cluster. Particularly, we would like to take advantage of the full memory capacity of our datanodes. So, in Yarn configuration, we will increase the container memory to 7.5GB as shown below.

Restart Hadoop from Cloudera Manager for the configuration to take effect

Now check again that the assigned memory to your Hadoop cluster has increased

Running H2O.ai on Hadoop

Setup HDFS folders

sudo -u hdfs hadoop fs -mkdir /user/root
sudo -u hdfs hadoop fs -chown root /user/root

Copy training data to HDFS

./setup_data_to_score.sh

Start H2O on Hadoop

cd /root/h2o-3.6.0.8-cdh5.4.2
hadoop jar h2odriver.jar -nodes 2 -mapperXmx 3g -output output1

Notice the message displayed when executing the above commands, Open H2O Flow in your webbroser: <ip address>. In this particular case, the IP address was that of slave-2. At this point, it is important that you modify Rscript.R to point to this H2O instance.

localH2O <- h2o.init(ip = "ch-slave-2.logp.info", port = 54321, startH2O = FALSE)

You can login to the H2O’s web interface.

ch-slave-2.logp.info:54321

Running R script on Hadoop

Now, you are ready to run R script to generate the model

./Rscript script.R

Browse to H2O’s web interface: ch-slave-2.logp.info:54321 and click on Admin->Water Meter to view the CPU activities of each of the datanodes (i.e. slave servers). Both are busy training the machine learning model.

H2O web interface showing datanodes CPU activities (Water Meter)

Similar to the first part of this tutorial, copy the model by running

./copy_model.sh

and run a prediction test using

./run_pojo.sh

Run Hive

Hive is an SQL-like interface to data stored in HDFS. In this section, I’ll show you how to use Hive to run prediction on large input data stored in HDFS.

Access the Hive console by typing: hive

Run the following commands in Hive to create a scoring user-defined function (UDF). The UDF can be used as a query function in a Hive SQL.

ADD JAR ./localjars/h2o-genmodel.jar;
ADD JAR ./target/ScoreGisetteData-1.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION scoredata AS 'ai.h2o.hive.udf.ScoreDataUDF';

Now, we want to create a temporary table to store the input data (full code is in hive_temp_table.sql).

drop table gisette_table;  // if neededCREATE EXTERNAL TABLE gisette_table(C1 INT,C2 INT,C3 INT,...,INT 5000) row format delimited fields terminated by ' ' stored as textfile location '/user/root/GisetteScoreTest';

you can use Hive to score large amount of data in a table as shown below (full code is in hive_score.sql)

SELECT scoredata(C1,C2,C3,C4,C5,C6,C7...,C500) from gisette_table limit 5;

Thank you…