Introduction to H2O.ai
With a plethora of complex tools for machine learning, I want to introduce a company looking to democratize artificial intelligence, making machine learning simple and ubiquitous for all.
Artificial Intelligence…at this point you know that it is the wave of the future, going to make you very rich if you add it to your career’s skills list and eventually going to kill us all. As those first two points excite the average person, the next step is usually them trying to figure out how to learn more about the field. They typically find an online course or some introduction where after, the logical progression is to get some hands on experience. That is when they run into what I like to call the “tooling issue.”
There are so many tools to choose from it becomes hard to discern which tool is right for the job. I’m not going to compare and contrast tools, as the ones in Figure 1 represent less than a tenth of available tools, but what I will do is introduce a company with a mission to simplify and democratize artificial intelligence for all. Whether they ultimately deliver on that or not is still up in the air, but at the very least they have given us a good starting point for our machine learning efforts.
Who is H2O.ai?
H2O.ai is the company behind open-source Machine Learning (ML) products like H2O, aimed to make ML easier for all. Their open-source community includes more than 129,000 data scientists and 12,000 organizations. They also have almost half of the Fortune 500 using their software with a 330% growth in users over the past two years. Analyst have also taken note as they’ve been classified as a Leader with the most completeness of vision, among 16 vendors included in Gartner’s 2018 Magic Quadrant for Data Science and Machine Learning Platforms. Their business partners include major players like Microsoft, IBM, NVIDIA, Splunk, Databricks, MapR, Anaconda, Cloudera and a few others. So how did they gain such a large following? By mostly delivering on their promise to make machine learning accessible and allow business users to extract insights from data, without needing expertise in deploying or tuning machine learning models.
If you’ve ever worked in a large company and had to deal with a slow moving digital transformation, your probably excited about a company aiming to simply usage of a complex software but also skeptical of them as almost every tech company makes that promise and many fail to deliver. Lets analyze H2O.ai deeper and see how they deliver on this promise along with use cases with their product running in production.
H2O.ai states it is making ML accessible by allowing business users to extract insights from data, without needing expertise in deploying or tuning machine learning models. How are they doing this? Through their suite of machine learning products we’ll discuss.
- H2O. The primary offering. An open source, in-memory, distributed, ML and predictive analytics platform allowing you to build and productionize ML models. Contains supervised and unsupervised models like GLM and K-Means clustering, and a simple to use web-UI called Flow.
- Deep Water. H2O + a tighter integration with TensorFlow, MXNet and Caffe for GPU based DL workloads.
- Sparkling Water. H2O + a tighter Spark integration for customers to utilize their existing Spark ecosystem with H2o’s ML. We’ll learn more about this later in our use cases section.
- Steam. Company’s enterprise (meaning not free) offering for building and deploying applications. Scientist can train and deploy ML models, making them accessible over APIs for developers to integrate into applications.
- Driverless AI. Bit of a misnomer as it isn’t exclusive to autonomous driving but is H2o + a simplified wrapper to help enterprise’s non-technical employees prepare data, calibrate parameters and determine optimal algorithms for tackling specific business problems with ML. Makes Automatic Feature Engineering, Model Tuning, Selection and Ensambles (using multiple learning algorithms to obtain better predictive performance) easy to use for those who don’t even know what those terms mean. Quick video demo.
Now that we have a high level overview of the products offered by H2O.ai, lets explore their technical capabilities.
Technical Features & Capabilities
H2o’s products provide an open-source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform. The “fast” comes from the data being distributed across the cluster and stored in memory in a compressed columnar format, allowing you to read the data in parallel. H2O’s core code is written in Java and, like many other modern applications, H2O provides a REST API for access to all of the software’s capabilities from an external program or script via JSON over HTTP. The Rest API is used by H2O’s web interface (Flow UI), R binding (H2O-R), and Python binding (H2O-Python).
Available Machine Learning Algorithms
When it comes to available machine learning algorithms, H2O has a nice set of available algorithms for users to leverage.
- Supervised Learning. Deep Learning (Neural Networks), Distributed Random Forest (DRF), Generalized Linear Model (GLM), Gradient Boosting Machine (GBM), Naïve Bayes Classifier, Stacked Ensembles, XGBoost
- Unsupervised. Aggregagtor, Generalized Low Rank Models (GLRM), K-Means Clustering, Principal Component Analysis (PCA)
- Others. Quantiles, Early Stopping, Word2Vec
H2O’s data parser has built-in intelligence to guess the schema of the incoming dataset and supports data ingest from multiple sources in various formats.
Big Data & Anaconda Integration
For big data, H2O integrates well with the Hadoop ecosystem of tools, including Spark. The supported Hadoop distributions are Cloudera CDH, Hortonworks HDP, MapR 4.x+ and IBM Open Platform. While H2O integrates with Hadoop and Spark, they are not required to run H2O unless you want to deploy H2O on a Hadoop or Spark cluster.
H2O also integrates with Conda, the open source package and environment management system used by data scientist, that quickly installs, runs and updates packages and their dependencies. Conda and Anaconda can be used on Anaconda Cloud, where packages, notebooks, projects and environments are shared during collaborations. Conda is NOT required to run H2O unless you want to run H2O on the Anaconda Cloud.
For the sake of brevity I’ll skip over a some of the technical details but if you want to learn more about things like the architecture components, JVM management, CPU management or things like Fluid Vector Frames then follow the links.
Customers and Use Cases
So who is using this technology and for what use use cases? Glad you asked. H2O highlights their usage by the financial, insurance and healthcare industries among others. Examples of specific solutions enabled by H2O:
- Cisco. Used H2O in creation of a P2B statistical model that tries to predict whether a certain company will buy a certain product in a given time frame in the future. Model outputs probability that a company will buy and the amount of money a company will spend if they do buy the product. Used by sales, fueled by demographics, past purchase behavior, contacts, marketing interactions, cust satisfaction surveys, macroeconomic indicators, purchases/non-purchase list.
- Capital One. Systems and Operations Group used H2O for Capital One banking app (5K users a minute, 300K an hour) as they found H2o satisfied their governance requirements, leveraged current data science and ML resources (R, Python, Spark, etc), is open-source, usable and scalable.
- Equifax. Built a product on top of H2O, called Ignite, which is their “revolutionary portfolio of premier data and advanced analytics solutions.”
- Kaiser Permanente. Found that medical/surgical ward patients urgently transferred to ICU show evidence of physiologic derangement 6–24hrs in advance of the ICU transfer. These patients are less than 5% of all patients in a hospital but are 25% of all ICU admissions, 20% of all deaths in a hospital and 12.5% of days patients stay overnight in a hospital. Used H2O to predicting these patients visit to the ICU in advance, which reduced mortality rates by 2–5x.
Too high level for you? Want to get into more details on how a company can utilize H2O to create ML models and gain better insights and server their customers through data analysis? Sure, that's up next
Large Scale ML and Predictions for Travel Services Customer
I want to now provide a more specific use case for a large travel services organization, bringing in over $68B in revenue a year. This company wanted to use ML for scoring hotels and destination recommendations for millions of users and select from 1M+ keywords during bidding on ad platforms, ensure the most effective ad placements. They also wanted to run Machine Translation jobs to transform hotel/flight descriptions to one of the website’s 43 different supported languages depending on the user’s language preference.
Machine Learning Requirements
Based on analysis of their existing system and previous attempts at deploying machine learning, they wanted a solution that:
- scales well
- easy to use
- statistically sound
- easily moved to production
Their initial attempts included:
- Using SparkML. Found it was unstable, not feature rich and difficult to productionize with slow predictions (2015)
- Using Sparkling Water (h2o+Spark). Solved most of their problems except for scaling as Sparkling Water was tied to YARN which broke during scaling their 1,000+ node cluster.
- Using Google Translate APIs for machine translation. Found Google’s APIs weren’t as accurate as training models against their own data sets
The final solution was to work with H2o and contribute orchestration code upstream for Sparkling Water. In doing so they developed an “external cluster mode” for H2O which got rid of YARN dependency, helping them scale over 1K+ nodes with no issues.
We’ll now walk through a high level overview of the architecture and model training used in production to make predictions.
Offline: Model Creation and Training (Safe, non-Production)
- End user interacts with the website by creating a click-stream event which could be looking at a hotel, selecting a flight, reading reviews, etc.
- For JSON events not requiring a prediction, data is sent to a Spark Cluster for workflows to run on top of data. Data transformed to store and for model training.
- Data Scientist then create data frames on the data set in Spark and send to the H2o cluster.
- In H2o, the scientist chooses the correct algorithms, train and build models. Also have access to a feature store with a web UI where they can discover or reuse features, see which are available online vs offline, assign ownership of features and enforce quality.
- After training, they export their models in Plain Old Java Object (POJO) or Model ObJect Optimized (MOJO) formats to the prediction processing system for use in production.
Online: Hotel Scoring Example (Real-Time Processing)
- End user creates a click-stream event requiring a prediction when they search for a hotel in NY.
- That JSON event is sent to a Kafka topic
- Spark takes the event from Kafka and runs transformations on it, similar to data wrangling done offline, outputting the value to another Kafka topic and Cassandra as the persistent back end.
- The stream processing application then makes an API call to the ML model which has an API gateway in front to scale prediction processing requests as needed.
- For hotel query, ~10,000 hotels are scored for that specific user and returned to the user where they hopefully click and and book the top result.
While we’ve covered a lot, we have only scratched the surface of the capabilities of the H2O platform. H2o seems well poised to reach their goals of democratizing AI for everyone as adoption of their product offering is growing at a rapid pace. If you want to learn more, checking out their keynote and sessions from the 2017 H2O World is a good place to start. Hope this introduction to H2O has been helpful.