H2o Flow: H2o.ai

Introduction to AutoML: Using H2O Flow

Build a Machine learning model without writing a single line of code

Shashvat G
The Startup

--

Photo by Marvin Meyer on Unsplash

Have you ever wanted to play around with Machine learning? Wish you could build a machine learning model without having to write any code? Lack of programming experience is stopping you? Happy Days are here — H2o.ai has a solution for you, it’s called H2o Flow.

It is widely known that the field of Data Science is highly dynamic and there is something new in DS stack every day. The advent of AutoML is certainly a commendable push to this stack. Not only it enables people with zero programming background to create and build an ML model but also it makes the life of Data scientists much easier.

So, What is AutoML?

Well, AutoML is here to stay, an abbreviation for Automated Machine learning that aims to automate training and building of machine learning models. AutoML ensures machine learning is available even to people with no particular background in programming.

Gone are the days when Data scientists needed to build a baseline model from scratch(depending upon the use case, Of course!). Enter AutoML, just a few clicks, and your baseline is ready.

H2O is licensed under the Apache License, Version 2.0 and has a Web User interface called H2o Flow.

Basically, H2o Flow is a web-based interactive tool that allows data selection from various sources, visualization, a seamless environment for model building, prediction, evaluating, and exporting your model. In my opinion, H2o’s core strength is it’s distributed in-memory processing.

H2O is written in Java and it supports algorithms used frequently in Data science, For instance, GBM, Random Forest, and Stacked Ensembles. H2O works with R, Python, Scala on Hadoop/Yarn, Spark, or your laptop.

Installation

You can get H2o here. You would probably need Java installed in your system for H2o Flow to work. Also, if you have python installed on your system, you can simply install this package with a conda install. see below :

conda install -c h2oai h2o

You can safely ignore the above step if you gonna use H2o Flow. Notably, H2o is also available for R, and Hadoop. Downloading H2o will replicate a zip file on your system and that contains everything you need to get started.

Head to your terminal, unzip, and start H2o. Here are the commands that might be helpful:

cd ~/Downloads unzip h2o-3.30.0.6.zip cd h2o-3.30.0.6 java -jar h2o.jar

Go to your browser and open H2o.

H2O Flow Home Page

In this post, we’ll not discuss the menu icons and bar at the top as these are fairly self-explanatory. You can always check the help section available on the right.

We will use USA housing Dataset which is available here to build a simple regression model that can predict house prices based on given features such as Area Population, Average Area Income. We will follow these steps in order to build a model:

  • Import Files
  • Split Frame
  • Build Model
  • Prediction

Please note that there are other steps in Machine learning Pipeline like EDA(Exploratory Data Analysis), Pre-processing, etc. which are not covered in this tutorial.

Let’s get right to the routines that will help you build a simple regression model.

1. Import Files

This is where you can define your data source, parse your input file, and define parser as CSV, AUTO, XLSX, or even parquet files(source file format for Hadoop which is efficient in terms of storage and performance).

Import files into H2o

Set up your parser as required, see below for reference.

Parser set up

Once it is loaded into H2o, you should be able to view a nice summary of columns in your dataset, the number of columns, rows, and even missing values.

Loaded Data

H2o also allows you to impute values for missing fields for example with the mean value. Simply, Click on the column labels, this will take you to that particular column details where you can perform actions such as Impute and Inspect. Inspect the values for generating plots such as distribution plots and line charts.

2. Split Frame

Basically, It is dividing your dataset into training and testing. If you have some background in Machine learning, you must be aware that we divide the data into train and test for optimal fit of the model. Essentially, we don’t want the model to simply cram the input data, and as a result, perform poorly on unseen data. You can specify train and test ratio, names for your training set, the test set, and set seed so that the same random set of rows is generated every time(Leave it as it is if you are not sure).

Split train and test data

3. Build Model

We should select the model we want to build in this step. The available options include Deep Learning, Gradient Boosting Machine, and GLM (Generalised Linear Modeling) to name a few.

Since our problem here is to predict house prices on the basis of the given set of features, we will select the GLM model, set model parameters, columns to consider in training the model, and most importantly, our target variable i.e Price. We should also set a training frame as a train and validation frame as a test as we created in the above step. Optionally, cross-validation can also be done as required.

Model building

It is important to note that there are several advanced options for building a model if you are not sure what should be selected, you can safely skip that part, scroll to the bottom and hit Build model. Once you do it, your model should be ready.

Trained GLM Model

We can download the model, export it for further integration with other systems or applications, or continue using H2o Flow to make predictions. H2o also allows us to download it as Plain Java Object file POJOs which are used for increasing the readability and re-usability of a program, or Maven compatible format(MoJo).

4. Predict

In the previous step, we basically built and trained the model based on certain features such as the Average area, number of bedrooms etc. Now, our final step is to make predictions based on our trained model.

Predict on Test Data

As soon as you hit predict, it comes up with a list of predictions on the test frame we created in the second step. Note that the below list is not exhaustive. Here are some of the predictions:

Predictions

Evaluating machine learning models is a significant part of the ML pipeline. The evaluation criteria vary and depends on several factors depending upon the business requirement or the use case. In Regression, common evaluation criteria are Mean Absolute Error(MAE), Root mean squared error(RMSE), or R2 Score. H2o has all of these built-in and it displays them once you click action to inspect predictions.

Conclusion

In this post, we used H2o Flow to create a very simple regression model to predict house prices based on the USA housing data set. We didn’t write a single line of code in this exercise to predict house prices. Clearly, there is much more to regression and H2o flow than we have covered here and you can explore more on H2o Flow and Machine learning as you dive in further. Stacked Ensembles, GBMs on H2o Flow are definitely worth a try.

I’d love to hear your thoughts about H2o, Machine learning, and Data Science in the comments below.

If you found this useful and know anyone you think would benefit from this, please feel free to send it their way.

--

--

Shashvat G
The Startup

Data Scientist | Analyst who aspires to continuously learn and grow in Data Science Space. Find him on LinkedIn https://www.linkedin.com/in/shashvat-gupta/