H2O’s Automated Machine Learning

Sidra Naseem
DiveDeepAI
Published in
6 min readMar 4, 2022

What is H2O.ai?

H2O is an open-source platform provided by H2O.ai Company which aims to facilitate its users and business partners with machine learning accessibility and insights of data without having in depth knowledge of deploying and fine-tuning machine learning algorithms. H2O.ai has a huge community which includes 12,000 big organizations and 129,000 data scientists for their open-source platform. Thus, whether you’re a data scientist or an organization looking to get started with artificial intelligence, H2O.ai’s products can be your go-to options to streamline your digital transformation and improve your business efficiency. The products by H2O.ai include, H2O, Driverless AI, Steam, Deep Water etc. and the intuition behind the development of each product is to provide their business partners with best experience whilst needing them to have any expertise.

Figure1. Products byH2O.ai

H2O Framework

This product is a primary offering by H2o.ai which is an open source, in-memory and distributed machine learning platform which allows to build and productionize supervised, unsupervised and other Machine Learning models e.g. Quantiles, Early Stopping, Word2Vec. H2O platform integrates very well with the Hadoop ecosystem of tools for large data processing including spark processing engine. This product also incorporates well with Conda for environment management, quick installations and managing package dependencies. An amazing feature of H2O is its intelligent parser which presumes the schema and handles data incoming from multiple sources in multiple formats.

Features of H2O framework

The capability suite of H2O framework includes a great number of features such as:

· Clusters: H2O is supports parallel computations to improve efficiency and models scalability on clusters. In clusters, memory is always in compressed columnar format which makes data easy to read in parallel when stored on different nodes.

· Flow: Flow is an interactive user-interface which can be used to run code snippet, analyze and visualize data just like jupyter notebook. Additionally, the tabular data visualization of H2O’s Flow is that you can interact and modify tables by pointing and clicking. Flow can run on your local host and no programming experience is required to run it since H2O operations can be performed without any lines of code. Plus, It integrates really well with REST API, CoffeeScript and R scripts.

· AutoML: Automatic Machine Learning is an end-to-end process of using Machine learning algorithms and automating most of the steps of machine learning pipeline. It is specifically designed for non-machine learning users with diverse backgrounds to help them solve complex scenarios.

H2O’s Automatic Machine Learning

Machine learning has been used to solve a variety of problems and has gained considerable fame among Artificial Intelligence experts and a large number of audiences from diverse disciplines. However, the success of ML depends only on machine learning experts since they can perform preliminary steps on data to create machine learning pipeline. The complexity of this process demands an off-shelf solution for non-machine learning experts to carry out their tasks without any concerns about complex underlying details. AutoML feature of H2O framework helps with these such impediments and a users is required to perform very few steps such as:

Figure2.User Steps

All other processes of a machine learning pipeline such as feature preprocessing, generating different models and selecting an ensemble model are automated with the goal of finding the best model with minimum number of parameters.

Demonstration of AutoML

Since H2O provides a set of unique and divergent features, it is a driving force for the innovation of some of the most efficient an faster machine learning models. To understand the efficiency of H2O’s AutoML, let us take look on following demonstration and learn how to build a model using AutoML.

For the simplicity of this demonstration, we will build a simple classification model which will detect diabetes. The dataset used in this demonstration can be downloaded from here. This demo project is written and implemented in Jupyter notebook.

Creating a cluster y running following commands,

import h2oh2o.init()

The output of above lines will look like this:

Figure3.cluster creation

Once a new cluster is created, load the dataset and it will create a new cluster.

After the cluster is created now load the data and instantiate AutoML.

diabetes_data = h2o.import_file("diabetes.csv")diabetes_data.head(5)
Figure4.Data Visualization

To understand data a little better we use descrice() function which will return description of data types, missing values and other attribute information in the dataset.

diabetes_data.describe()
Figure5.Description of data

Note: Typical AutoML considers all problems as regression problems unless it is indicated. So, to avoid this confusion,following command is used so that the targets are converted to enum which are going to be symbolic names:

diabetes_data[‘Outcome’] = diabetes_data[‘Outcome’].asfactor()

Divide data into training and test dataset using split_frame() and Assign labels and targets names to new variables x,y.

diabetes_split = diabetes_data.split_frame(ratios = [0.8])
db_train = diabetes_split[0]
db_test = diabetes_split[1]x=[‘Pregnancies’,’Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’ ,’BMI’,’DiabetesPedigreeFunction’,’Age’]y=’Outcome’

Now that all these steps are completed, the dataset can be used to feed into AutoML function. Whilst AutoML is running, it will show a leaderboard of all the models that it ran and their results along with what worked best.

automl = H2OAutoML(max_models = 30, max_runtime_secs=300, seed = 1)automl.train(x = x, y = y, training_frame = db_train)leader = automl.leaderboardleader.head()leader.head(rows=leader.nrows)

The leader board shows us that a GBM model gives us the best accuracy and minimum MSE.

Figure7.Results f best model

Let us make a prediction on test data to understand if the model is working correctly.

predictions = automl.predict(db_test[:-1])(predictions['predict']==db_test['Outcome']).as_data_frame(use_pandas=True).mean()
Figure8.Prediction on test data

The best model achieved 78% accuracy which is pretty good considering no pre-processing or any feature engineering has been performed on the dataset. The selected model can be saved using:

h2o.save_model(automl.leader, path = “your_directory_path”)

For the final step, since all these steps are performed on the cluster we created, it is now time to release occupied memory by running following command.

h2o.shutdown()
Figure9. ShutDown

Conclusion

Essentially, the main purpose of AutoML is to automate some repetitive processes in ML pipeline creation and hyperparameters tuning. For novice users, H2O can be perfect platform to learn and resolve complex scenarios with its easy-to-use interface. H2O Framework is powerful enough for advanced developers because it provides a wrapper function to perform modeling related tasks which requires a lot of coding otherwise.

--

--