H2O-Automated ML

Karteek Menda
5 min readJul 10, 2021

--

@credits: H2O.ai

Mission to make AI for everyone

Hello Aliens….

H2O.ai is the open source leader in AI and machine learning with a mission to democratize AI for everyone. The motive of H2O is to provide a platform which made easy for the non-experts to do experiments with machine learning.

So, the automation of your Machine Learning algorithms which will reduce the daily task of Data Scientists. This gives us the leverage to skip the selection of baseline models. By using this, we can simply get to an understanding of which model to go with and then by finalizing the model, we can carry out out optimization on it to get better results.

AutoML helps in automatic training and tuning of many models within a user-specified time limit. Some of the key features of H2OAutoML are it can do data pre-processing(encoding), data cleaning(missing value imputation) also provides us nice leaderboard view of the models with various metrics. So, we can pick any model from it and analyze the model output further. It also provides a deployment ready code. It gives in multiple formats like Mojo, Pojo, Binary formats. Out of which Mojo is the recommended format when the model size is huge. It uses GPU’s for XGBoost model.

Architecture:

H2O architecture can be divided into different layers in which the top layer will be different APIs, and the bottom layer will be H2O JVM.

H2O’s core code is written in Java that enables the whole framework for multi-threading. Even though it is written in Java, it provides interfaces for R, Python and few others shown in the architecture.

In short, we can say that H2O is an open source, in memory, distributed, fast and scalable machine learning and predictive analytics that allow building machine learning models to be an ease.

Now let me walk you through some use case which will give you some clarity on how it works.

For this demonstration, I am taking a imbalanced dataset which is Bank Churn dataset from Kaggle. You can find that here.

Lets go through the code.

Step-1: To install H2O, you need to have java run time environment as it is developed on java. So, install the java run time environment.

!apt-get install default-jre
!java -version

Step-2: Install H2O and import it.

!pip install h2o
import h2o

Step-3: Initialize the H2O cluster.

h2o.init()

The h2o.init() command is pretty smart and does a lot of work. At first, it looks for any active h2o instance before starting a new one and then starts a new one when instance are not present. Once the instance is initiated, we can see the flow running on http://127.0.0.1:54321.

Step-4: Load the data to H2O frame.

from h2o.automl import H2OAutoML
bank_data = h2o.import_file('Churn.csv')

The reason why I have chosen this dataset is it has class imbalance and some decent data preprocessing needs to be done and see how AutoML deals with all this.

Step-5: EDA of the data.

bank_data.describe()
bank_data.types

Can do EDA , but this is not our main focus, so skipping this. The dataset is having 14 columns out of which the target variable is “Exited”(which says whether a customer has churned or not) there are some features like “RowNumber”, “CustomerID”, “Surname” which can be removed. There are some features which are categorical(Gender, Geography).

Step-6: Split the data to train, test sets with a split of 80% and 20% respectively.

train, test= bank_data.split_frame(ratios = [.8], seed = 1234)

Step-7: selecting the predictors and the predicted variable.

y = 'Exited'
x = bank_data.columns
x.remove(y)
x.remove("RowNumber")
x.remove("CustomerId")
x.remove("Surname")

Step-8: Usage of H2OAutoML.

aml = H2OAutoML(max_models=20, seed = 10, balance_classes=True, exclude_algos = ["StackedEnsemble", "DeepLearning"], verbosity = "info", nfolds=0)

So, here I want the top 20 models, and we have a class imbalance so set it to True would balance the data and I don’t want the algorithms like Stacked Ensemble, Deep Learning. So I can exclude them. And cross validations as 0 to keep it simple, But you can tweak this for better results. By default “nfolds” will be set to 5. Also, we can use some time boxes such that the models will not run for more than that specified time.

Step-9: Call the train function by passing input features and the output column and training data frame.

aml.train(x=x,y=y,training_frame= train)

It will train for 20 different models and each time leader board gets updated.

Step-10: Check the leaderboard to see the top performing models.

lb = aml.leaderboard
lb.head()

In this case, GBM tops the list then comes XGBoost.

Step-11: Use the “leader” and predict on test dataset.

prediction = aml.leader.predict(test)

And you can see the predictions with probabilities being churned or not.

Step-12: Generate a performance report.

aml.leader.model_performance(test)

This report shows the metrics of the model.

Step-13: So out of all the models, lets take a particular model from the leaderboard and analyze it further. Here I want XGBoost to be analyzed further. Lets take this XGBoost model.

model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
xgb=h2o.get_model([mid for mid in model_ids if "XGBoost" in mid][0])

Step-14: Let see the output which gives us the model details and the metrics.

xgb

Step-15: Plot for the variable importance which gives us the list of the most significant variables. The top variables contribute more to the model than the bottom ones and also have high predictive power in classifying churn and no-churn customers.

xgb.varimp_plot()

Conclusion:

H2O provides an easy-to-use open source platform for applying different ML algorithms on a given dataset. During testing, you can fine tune the parameters to these algorithms. H2O supports AutoML that provides the ranking amongst the several algorithms based on their performance. It can also handle Big Data. This is definitely a boon for Data Scientist to apply the different Machine Learning models on their dataset and pick up the best one to meet their needs.

Happy Learning…………

Thanks for reading the article! If you like my article do 👏 this article. If you want to connect with me in Linkedin, please click here.

I will try to explain H2O Flow in my upcoming articles.

This is Karteek Menda.

Signing Off

--

--

Karteek Menda

Robotics GRAD Student at ASU, Project Engineer in Dynamic Systems and Control Lab at ASU, Ex - Machine Learning Engineer, Machine Learning Blogger.