Building a machine learning application using H2O.ai

Deploying your first machine learning application from scratch

The objective of this post is to give an overview of training, building and deploying machine learning models using H2O.ai. We are going to train a model and then build an application for serving predictions.

H2O.ai ecosystem

The Data

The dataset contains a set of features that determine the quality of wine like pH, citric acidity, sulphates, alcohol, etc. The data looks like this:

Red wine quality dataset

Our model will be trained to predict the quality of the wine.

Training the model

import h2o
h2o.init()
data = h2o.upload_file("data/winequality-red.csv")

We will train a Gradient Boosting Machine model, let’s import the H2OGradientBoostingEstimator and define the model parameters.

from h2o.estimators.gbm import H2OGradientBoostingEstimatorparams = {
"ntrees": 500,
"learn_rate":0.01,
"max_depth":8,
"min_rows":5,
"sample_rate":0.8,
"col_sample_rate":0.8
}

We need to define our training and target columns. Next, we will split the data into train/test, we are going to use 70% of the data for training and 30% for test our model performance.

target = "quality"
train_cols = [x for x in data.col_names if x != target]
train, test = data.split_frame(ratios=[0.7,])

With the previous steps, we can now train our model, the n_folds=5 the parameter is used to specify the number of folds used in the cross-validation.

model = H2OGradientBoostingEstimator(nfolds=5, **params)
model.train(x=train_cols, y=target, training_frame=train)

Once the model is trained, we can use the test data to validate our model performance, for this h2ohas a function called model_performace, which we can use to get a summary of evaluation metrics for our model.

model.model_performance(test)ModelMetricsRegression: gbm
** Reported on test data. **
MSE: 0.38677682335383257
RMSE: 0.6219138391721417
MAE: 0.4452295071238603
RMSLE: 0.09729164095632066
Mean Residual Deviance: 0.38677682335383257

Now we can export our model into a MOJO (we will talk about this soon), our MOJO will be a zip file with the name of our model.

model.download_mojo(path="model/",get_genmodel_jar=True)

What is a MOJO?

H2O-generated MOJO and POJO models are intended to be easily embeddable in any Java environment. The only compilation and runtime dependency for a generated model is the h2o-genmodel.jar file produced as the build output of these packages.

We can use our MOJO to make a batch prediction using the PredictCsv class, real-time prediction using h2o with Spark Streaming, Kafka or Storm. Or you can expose your model as a REST API (that's is what we are going to do next).

Building our REST API

I’m building the application using Maven, so we need to add our dependencies into the pom.xml file.

Required dependencies

Next, let’s create a class Model that will receive as parameter our MOJO path, and then load our model.

Our Model class has a function called predict which receives as input a RowData containing the features names and the corresponding values. Then we can call our model and get the prediction value.

The next step is to create our web app, for this, we are going to create a Javalinapp with a /predict endpoint. For keeping this simple, we are going to use a GET method and send the parameters as querystring.

The endpoint function will take the parameters and then create a new RowData called modelParams , we will call our model.predict(modelParams) function to get the prediction results. And parse the results into a JSON format.

Getting predictions

curl -X GET http://localhost:8080/predict?fixed_acidity=7.0&volatile_acidity=0.7&citric_acid=0&residual_sugar=1.9&chlorides=0.076&free sulfur_dioxide=11&total_sulfur dioxide=34&density=0.9978&pH=3.51&sulphates=0.56&alcohol=9.4

And our prediction results:

{
"prediction":"5.472786367941547",
"status":"ok"
}

Conclusions

Please feel free to comment or suggest.

spikelab

Our writings on Data Science, predictive modeling, big data analytics, and more.

Matias Aravena Gamboa

Written by

Data Engineer and machine learning enthusiast. Valdivia, Chile

spikelab

spikelab

Our writings on Data Science, predictive modeling, big data analytics, and more.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade