Gentle Introduction to AutoML from H2O.ai
In recent years, the trend for data science skills and its demand had outpaced the skill supply. As artificial intelligence penetrates every corner of the industry its hard to place data scientists in every possible use case.
To bridge this gap, companies have started building frameworks that automatically process the dataset and build a baseline model. We see many of these implementations going open-source. According to one of the industry leaders, H2O.ai,
AutoML interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time constraint or limit on the number of total models trained.
According to Google Trends, the rise of Auto ML began in Q2 2017:
AutoML is a function in H2O that automates the process of building large number of models, with the goal of finding the “best” model without any prior knowledge. In this article, we will look into AutoML from H2O.ai.
The implementation is available in both R and Python API and the current version of AutoML (in H2O 3.20 ) performs:
- Trains and cross-validates a default Random Forest (DRF), an Extremely Randomized Forest (XRT), a random grid of Gradient Boosting Machines (GBMs), a random grid of Deep Neural Nets, a fixed grid of GLMs.
- AutoML then trains two Stacked Ensemble models.
- First ensemble containing all the models and second ensemble containing just the best performing model from each algorithm class.
The installation procedure is quite simple. All you need to do is have the following dependencies installed and then
pip install ;
pip install requests
pip install tabulate
pip install "colorama>=0.3.8"
pip install futurepip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o
If you are already having anaconda installed you could directly proceed with the
conda install -c h2oai h2o=188.8.131.52
Note: When installing H2O from
pip in OS X El Capitan, users must include the
--user flag. For example -
pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o --user
For R and Hadoop installation please refer to the official documentation here.
Start the H2O.ai instance by importing h2o.ai and H2OAutoML instance.
from h2o.automl import H2OAutoML
If the setup was successful then will see the following cluster information.
In this example, we are going to use a dataset from DataHack Practice problem Loan Prediction III
The goal here is to predict whether or not a loan will be paid by the customer wherein we are provided with details like — Gender, Marital Status, Education, and others.
First, let’s import the training set and check out
.head() and the datatypes of the data frame.
df = h2o.import_file('train_u6lujuX_CVtuZ9i.csv')
Let’s check the datatypes with
As you can see in this example, the datatype of our target variable —
Loan_Status is enum type. If it's referred as int type, then you must change the data type to enum using the following command :
df[target] = df[target].asfactor()
Note: Failing to do so makes AutoML think this is a regression problem which comes at a great cost if you are running models for 10+ hours.
So, gotta be careful there. I wonder whether H2O.ai developers can convert this automatically in backend if the target
Now we have to separate the features and target variables. AutoML functions take features and the target in
y = "Loan_Status"
x = ['Gender','Married','Education','ApplicantIncome',
Great! Now we are ready to fire up the AutoML
aml = H2OAutoML(max_models = 30, max_runtime_secs=300, seed = 1)
aml.train(x = x, y = y, training_frame = df)
You can then configure values for
max_models to set explicit time or number-of-model limits on your run. The model will train on the parameters provided. For this tutorial, we are training the models with few features and for about 2 mins.
Once the model is trained, you can access the Leaderboard. The leader model is stored at
aml.leader and the leaderboard is stored at
aml.leaderboard The leaderboard stores the snapshot of the top models. The top models are usually the stacked ensembles as they can easily outperform a single trained model. To view the entire leaderboard, specify the
rows argument of the
head() method as the total number of rows:
lb = aml.leaderboard
lb.head(rows=lb.nrows) # Entire leaderboard
The best model scored 0.77431 AUC. That’s a great score given that we have not done preprocessing or model tuning of any sort!
Prediction and Saving the model
You could use the best leader model to make prediction. This can be done by using the following command:
preds = aml.predict(test)
The next step would be to save the trained model. There are two ways to save the leader model — binary format and MOJO format. If you’re taking your leader model to production, then it is suggested to use MOJO format since it’s optimized for production use.
h2o.save_model(aml.leader, path = "./Loan_Pred_Model_III_shaz13")
Our take on AutoML
AutoML is here to stay. I am eager to see the direction where it goes to further advancements in data science. A single automated mixer certainly cannot outperform a human creative mind when it comes to feature engineering but in my experience, AutoML is worth exploring.
Although AutoML alone won’t get you top spot in machine learning competitions, it is definitely worth considering as an addition alongside your blended and stacked models. In recent competitions, the AutoML model boosted my score considerably which led me to explore and concentrate on the blending part. I highly recommend checking out H2O.ai’s AutoML. And, do let me know what do you think about it and your experiences with other automated modelling functions.