Automatic model selection: H2O AutoML

Mohtadi Ben Fraj
All things AI
Published in
3 min readJan 19, 2018

In this post, we will use H2O AutoML for auto model selection and tuning. This is an easy way to get a good tuned model with minimal effort on the model selection and parameter tuning side.

We will use the Titanic dataset from Kaggle and apply some feature engineering on the data before using the H2O AutoML.

Load Dataset

# Handle table-like data and matrices
import numpy as np
import pandas as pd
# get titanic & test csv files as a DataFrame
train = pd.read_csv(“../input/train.csv”)
test = pd.read_csv(“../input/test.csv”)

Feature Engineering

First thing is to remove two features from the data. We remove the ‘Cabin’ and ‘Ticket’ features just because more complicated feature engineering is necessary and it is not the context of this post.

train.pop(‘Cabin’)
test.pop(‘Cabin’)
train.pop(‘Ticket’)
test.pop(‘Ticket’)

We extract the passenger title from the name feature and group the titles in 4 categories.

dataset_title = [i.split(‘,’)[1].split(‘.’)[0].strip() for i in train[‘Name’]]
train[‘Title’] = dataset_title
train[‘Title’].head()
dataset_title = [i.split(‘,’)[1].split(‘.’)[0].strip() for i in test[‘Name’]]
test[‘Title’] = dataset_title
test[‘Title’].head()
# Convert to categorical values Title
train[“Title”] = train[“Title”].replace([‘Lady’, ‘the Countess’,’Countess’,’Capt’, ‘Col’,’Don’, ‘Dr’, ‘Major’, ‘Rev’, ‘Sir’, ‘Jonkheer’, ‘Dona’], ‘Rare’)
train[“Title”] = train[“Title”].map({“Master”:’0', “Miss”:’1', “Ms”:’1', “Mme”:’1', “Mlle”:’1', “Mrs”:’1', “Mr”:’2', “Rare”:’3'})
test[“Title”] = test[“Title”].replace([‘Lady’, ‘the Countess’,’Countess’,’Capt’, ‘Col’,’Don’, ‘Dr’, ‘Major’, ‘Rev’, ‘Sir’, ‘Jonkheer’, ‘Dona’], ‘Rare’)
test[“Title”] = test[“Title”].map({“Master”:’0', “Miss”:’1', “Ms”:’1', “Mme”:’1', “Mlle”:’1', “Mrs”:’1', “Mr”:’2', “Rare”:’3'})
train.pop(‘Name’)
test.pop(‘Name’)

Filling missing values

We fill missing values with the mean value for numerical features and the most frequent value for categorical features.

train[‘Age’] = train[‘Age’].fillna(train[‘Age’].mean())
test[‘Age’] = test[‘Age’].fillna(test[‘Age’].mean())
train[‘Embarked’] = train[‘Embarked’].fillna(train[‘Embarked’].mode()[0])
test[‘Embarked’] = test[‘Embarked’].fillna(test[‘Embarked’].mode()[0])
train[‘Fare’] = train[‘Fare’].fillna(train[‘Fare’].mean())
test[‘Fare’] = test[‘Fare’].fillna(test[‘Fare’].mean())

Mean target encoding

means = train.groupby(‘Age’).Survived.mean()train[‘Age_mean_target’] = train[‘Age’].map(means)
test[‘Age_mean_target’] = test[‘Age’].map(means)
means = train.groupby(‘Pclass’).Survived.mean()train[‘PClass_mean_target’] = train[‘Pclass’].map(means)
test[‘PClass_mean_target’] = test[‘Pclass’].map(means)
means = train.groupby(‘Title’).Survived.mean()train[‘Title_mean_target’] = train[‘Title’].map(means)
test[‘Title_mean_target’] = test[‘Title’].map(means)
means = train.groupby(‘Embarked’).Survived.mean()
train[‘Embarked_mean_target’] = train[‘Embarked’].map(means)
test[‘Embarked_mean_target’] = test[‘Embarked’].map(means)

Log transformation for Fare

train[“Fare”] = train[“Fare”].apply(log1p)
test[“Fare”] = test[“Fare”].apply(log1p)

Convert numerical feature to categorical

def num2cat(x):
return str(x)
train[‘Pclass_cat’] = train[‘Pclass’].apply(num2cat)
test[‘Pclass_cat’] = test[‘Pclass’].apply(num2cat)
train.pop(‘Pclass’)
test.pop(‘Pclass’)

Family size feature

We extract the family size for each passenger.

train[‘Family’] = train[‘SibSp’] + train[‘Parch’] + 1
test[‘Family’] = test[‘SibSp’] + test[‘Parch’] + 1
train.pop(‘SibSp’)
test.pop(‘SibSp’)
train.pop(‘Parch’)
test.pop(‘Parch’)

Getting Dummies from all other categorical features

Apply one hot encoding of categorical features

for col in train.dtypes[train.dtypes == ‘object’].index:
for_dummy = train.pop(col)
train = pd.concat([train, pd.get_dummies(for_dummy, prefix=col)], axis=1)
for col in test.dtypes[test.dtypes == ‘object’].index:
for_dummy = test.pop(col)
test = pd.concat([test, pd.get_dummies(for_dummy, prefix=col)], axis=1)

Model selection and tuning

This is the core of this post. We will use H2O AutoML for model selection and tuning.

import h2o
from h2o.automl import H2OAutoML
h2o.init()
H2O cluster status

We load the train and test data on H2O and select the training features and target feature.

htrain = h2o.H2OFrame(train)
htest = h2o.H2OFrame(test)
x =htrain.columns
y =’Survived’
x.remove(y)
# This line is added in the case of classification
htrain[y] = htrain[y].asfactor()

For the AutoML function, we just specify how long we want to train for and we’re set. For this example, we will train for 120 seconds.

aml = H2OAutoML(max_runtime_secs = 120)
aml.train(x=x, y =y, training_frame=htrain)
lb = aml.leaderboard
print (lb)
print(‘Generate predictions…’)test_y = aml.leader.predict(htest)
test_y = test_y.as_data_frame()
H2O AutoML leaderboard

In 120 seconds, AutoML trained 14 models. Some of these models are Gradient Boosting, Extra trees, Random Forest and Deep learning models. Also, it performed stacking of these models to get better AUC score.

This very powerful and saves a lot of time when first deciding on the model choice and parameters and can put you on the right direction.

If you want to learn more about parameter tuning, you can check these specific parameter tuning posts for Gradient Boosting, Random Forest, SVC and KNN.

--

--