Data Scientist in a Box: H2O Driverless AI

Published in

Systems AI

4 min readJun 7, 2019

Over the past few years, machine learning has become an important part of many organizations. There exists a gap between the relatively small number of data scientists and the increasing demand for data-driven business purposes leading to increasing interest in automated machine learning software. To make it clear: I am not talking about replacing data scientist, I am talking about making them more efficient.

Automated Machine Learning is here to automate certain tasks and build the models for you. H2O Driverless AI is designed to help both data scientists and not to work on their projects much faster by using automation, state of the art algorithms deployed in powerful GPU’s to accelerate the whole process. It automates several time-consuming aspects of a typical data science workflow: data visualization, feature engineering, predictive modeling, model interpretability, and automatic report creation.

But with all that said, how good it is?

To test H2O Driverless AI’s capabilities, I began looking for an interesting data science project to replicate within the software to get a sense of model training times and performance. I came across a project involving the development of a model for predicting a patient’s length of stay (LOS) at a hospital. The motivation behind the project stemmed from the extremely high cost of U.S. hospital stays 377.5 billion per year . Having the capability of predicting the LOS of a patient can help hospitals to optimize specific treatments plans to minimize the standard LOS (save money) and assist them with logistics in general (better bed allocation planning, etc.). Nevertheless, this project made me wonder if H2O Driverless AI could be used to build a similar model in reduced time with acceptable accuracy and with less effort.

So, to give some brief background on the project, the author did an amazing job preparing the data. He used the MIMIC database, a freely available critical care dataset developed by the MIT Lab. It has over 25 different datasets and health records associated with around 40,000 critical care patients. There are some steps required to access to the dataset and you can find more information here. By combining 4 different datasets and preprocessing the data, he ended up with a final dataset of 10621 observation and 52 columns (features). The target column of the dataset is the Length of Stay, which was basically the difference of the admission and discharge times (calculated in days).

Having the dataset ready the guy used the Scikit-learn library to develop and train several models. As a 1st step, he compared a Stochastic Gradient Descent regression, Gradient Boosting Regression, Linear Regression, K-nearest Neighbours and Random Forest using the R2 metric. Gradient Boosting Regression achieved the highest R2 score. As a 2nd step, he applied a Grid Search on the Gradient Boosting Regression to find the optimum parameters. He achieves at the end an RMSE loss of 9.83. And here’s my question again:

Can H2O do better than that?

I followed the same procedure and preparation of the data acquiring at the end the exact same dataset, so it can be a fair comparison. I imported the datasets into the Driverless AI software and using the default parameters I run an experiment(training). The experiment was run on an IBM Power Accelerated Compute Server (AC922) with 4 Nvidia V100 GPU’s.

The H2O Driverless AI final model was a Gradient Boosting Machine same as the other article’s model.

Comparison with articles model, average, median

The Root Mean Square Error was used as the comparison metric between the two final models and also the Average and Median. And here are the results:

h2o_model = 9.02, Gradient Boosting model (previous article’s model) = 9.83, Average = 12.55, Median = 13.07

In the graph above you can see the comparison of the h2o driverless ai model (gradient boosting machine) that was automatically developed compared with the author’s model (Gradient boosting machine), the Average and Median.

The H2O Driverless AI model achieved a better RMSE score of 8.0 % compared to the model trained in the previous article.

How does Driverless AI model achieve better results?

Basically, under the hood of Driverless AI there is an automatic feature engineering process. It employs libraries of algorithms and feature transformations to automatically engineer new, high-value features for the imported dataset and by taking advantage of the IBM AC922 Accelerated Server it can to that fast. That was enough to drive the accuracy higher for the specific model.

This comparison wasn’t made to prove that H2O Driverless AI might be better than humans. It just shows how automated machine learning can be used to help data scientists and not in their work, by automating certain tasks in their machine learning workflow. Obviously, everyone can build their own models from scratch and that’s fair enough, but when time becomes highly valuable, you might need to consider other options.

Data Scientist in a Box: H2O Driverless AI

Comparison with articles model, average, median

Written by Giorgos Aniftos