Automate the building of machine learning models using AutoML (H2O) in Python and R, so easy!

Mochamad Kautzar Ichramsyah
CodeX
Published in
10 min readDec 9, 2023
Photo by Farhan on Unsplash

About the H2O package

The H2O is an open-source software for data analysis that provides a platform for building machine learning models. It is designed to be scalable, fast, and easy to use, and it supports the most widely used machine learning algorithms. H2O can be used for classification, regression, clustering, and more tasks.

For further explanation about H2O, you can visit the website here.

H2O logo

Key features of the H2O package

  1. Scalability
    H2O is designed to scale horizontally, allowing users to efficiently handle large datasets and models. It can be deployed on a single machine or in a distributed computing environment, such as Apache Hadoop or Spark.
  2. Ease of Use
    H2O provides a user-friendly interface for building and deploying machine learning models. It supports multiple programming languages, including Python, R, and others.
  3. Machine Learning Algorithms
    H2O includes a variety of machine learning algorithms, such as generalized linear models, deep learning, gradient boosting machines, random forests, k-means clustering, and more.
  4. AutoML
    H2O’s AutoML functionality automates the machine learning model-building process. It can automatically train and tune various models, allowing users to find the best-performing model for their specific task without extensive manual tuning.
  5. Data Exploration and Visualization
    H2O provides tools for exploring and visualizing data, making it easier for users to understand the characteristics of their datasets.
  6. Integration
    H2O can be integrated with popular data science tools and platforms, including Jupyter Notebooks, RStudio, and more.

From all the key features explained above, I am very interested to know more about AutoML, because it would make the job much easier if possible to automate the machine learning model-building process.

Using Google Colab

In this post, I will try to use Google Colab which is available to be run in Python and R. If you don’t know how to change the runtime type, you can easily go to Runtime > Change runtime type, then use which type you prefer. At the end of the post, I will share my Google Colab with the Python and R code used in this post.

Change runtime type in Google Colab

H2O AutoML in Python

First of all, we need to install the H2O package.

# Installing the H2O package 
!pip install h2o

Then, we need to import the package, especially the AutoML, and initialize the H20.

# Importing necessary packages 
import h2o
from h2o.automl import H2OAutoML

# Initialize H2O
h2o.init()

Some steps we need to do before using the AutoML:

  1. We will use the Breast Cancer dataset from the sklearn package.
  2. We split the dataset into train and test.
  3. We must convert the dataset into H2OFrame, so we can automate the machine learning model-building using AutoML.
# Import the Breast Cancer dataset from sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the dataset
data = load_breast_cancer()
x = data.data
y = data.target
df = pd.DataFrame(x, columns = data.feature_names)

# Split the dataset into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 13)

# Convert x_train and y_train to pandas DataFrames
x_train_df = pd.DataFrame(x_train, columns = data.feature_names)
y_train_df = pd.DataFrame({'target': y_train})

# Merge x_train and y_train into a single DataFrame
train_df = pd.concat([x_train_df, y_train_df], axis = 1)

# Convert x_test and y_test to pandas DataFrames
x_test_df = pd.DataFrame(x_test, columns = data.feature_names)
y_test_df = pd.DataFrame({'target': y_test})

# Merge x_test and y_test into a single DataFrame
test_df = pd.concat([x_test_df, y_test_df], axis = 1)

# Convert the pandas DataFrame to an H2OFrame
breast_cancer_h2o_train = h2o.H2OFrame(train_df)
breast_cancer_h2o_test = h2o.H2OFrame(test_df)

The last few steps before we use the AutoML:

  1. We know that the Breast Cancer dataset has a target with two possible outcomes, 0 and 1, which means we must convert it into a factor.
  2. We need to assign the independent variables and the dependent variable to x and y.
  3. Finally, we can use AutoML to automate the machine-learning model-building.
Python AutoML output (1)

The AutoML output (1) above shows us the machine-learning model created with the best performance compared to the other models, including the model summary and model metrics. In this case, the chosen model:

  1. Uses a Gradient Boosting Machine (GBM) with the identity of GBM_4_AutoML_1_20231207_133923.
  2. AutoML automatically “splits” the dataset into train and test, but we will ignore that because we have been splitting the dataset manually to ensure the performance of the model is working for any dataset.
Python AutoML output (2)

The AutoML output (2) above shows us the confusion matrix after the chosen model, including the maximum metrics at their respective thresholds. We can see that it has 100% accuracy in predicting the target feature with the values of 0 and 1.

AutoML output (3)

The AutoML output (3) above shows us the Gain/Lift table to tell us the cumulative effect of the calculated metrics. It also shows the metrics for the model on “cross-validation data”, which uses the same dataset as the test, that’s why I preferred to split the data into train and test manually before using the AutoML.

Python AutoML output (4)

The AutoML output (4) above shows us the confusion matrix for the model with the “cross-validation data”, which surprisingly has a different output, but the conclusion is still the same.

  1. The model is very good to use, it has 97% accuracy and 100% precision.
  2. You can also check the other numbers such as TNS (True Negative Score), FPS (False Positive Score), FNR (False Negative Rate), and so on.
Python AutoML output (5)

The AutoML output (5) above shows us the Gain/Lift table which is very similar to AutoML output (3), the different of course it uses the “cross-validation data”.

Python AutoML output (6)

The AutoML output (6) above shows us the model with “cross-validation data” metrics summary from each cross-validation and summarizes it in the column mean and sd.

Python AutoML output (7)

The AutoML output (7) above shows us the scoring history including the timestamp, duration, and other details such as number_of_trees, training_logloss, training_classification_error, and so on.

Python AutoML output (8)

The AutoML output (8) above shows us the variable importances based on the chosen model. We can see that:

  1. worst_radius is the variable with the highest impact on the prediction of breast cancer. In this case, its impact is around 58%+ of the prediction value is affected by this variable.
  2. The variables that are also important are worst_perimeter (15%+), worst_concave_points (7%+), mean_concave_points and worst_area (each 3%+), and so on.
  3. The least important variables are concavity_error (0.007%), symmetry_error (0.03%), texture_error (0.07%), and so on.
# Display the leaderboard
leaderboard = model.leaderboard
print(leaderboard)

If we want to know the model leaderboard, which will show us all the machine-learning models created by AutoML, we can execute the code above to get the result below.

Python AutoML output (9)

For the last step, we need to cross-validate the chosen model by using the model in the test data we created manually before running the AutoML.

# Display the prediction using test data with the chosen model
predictions = model.leader.predict(breast_cancer_h2o_test)
print(predictions)
Python AutoML output (10)

The AutoML output (10) shows us the predicted probability value using the chosen model on the test dataset.

# Display the model performance
performance = model.leader.model_performance(breast_cancer_h2o_test)
print(performance)
Python AutoML output (11)

The AutoML output (11) shows us the confusion matrix using the chosen model on the test dataset, including the metrics, such as:

  1. Accuracy is 97%+
  2. Precision is 100%

If we want to put some visualization from the AutoML outputs, we can easily use the code below:

# Example for the chosen model
feature_importance = model.leader.varimp(use_pandas = True)
feature_importance.plot(kind = 'bar', x = 'variable', y = 'percentage')
Python Feature importance plot
# Confusion matrix visualization
from sklearn.metrics import confusion_matrix
import seaborn as sns

y_true = breast_cancer_h2o_test['target'].as_data_frame()
y_pred = predictions['predict'].as_data_frame()
cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
Python Confusion matrix visualization

For the finishing touch, don’t forget to shut down the H20 cluster.

# Stop the H20 cluster
h2o.shutdown()

H2O AutoML in R

It will be quite similar to Python. In this case, we will build a regression model, not a classification like the previous example in Python.

# Install and load h2o package
install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R")
library(h2o)

# Initialize H2O cluster
h2o.init()
# Load the mtcars dataset
data(mtcars)

# Convert the dataset to an H2O Frame
h2o_df <- as.h2o(mtcars)

# Split data into training and testing sets
splits <- h2o.splitFrame(h2o_df, ratios = c(0.8), seed = 123)
train_data <- h2o.assign(splits[[1]], "train_data")
test_data <- h2o.assign(splits[[2]], "test_data")
R AutoML output (1)

As usual, we split the data into train and test data for cross-validation after creating the machine learning models.

# Identify the response variable
response_column <- "mpg"

# Run H2O AutoML for regression
aml <- h2o.automl(x = setdiff(names(train_data), response_column),
y = response_column,
training_frame = train_data,
max_runtime_secs = 60) # You can adjust the runtime or other parameters

# View AutoML Leaderboard
leaderboard <- h2o.get_leaderboard(aml)
print(leaderboard)
leader_model <- aml@leader
R AutoML output (2)

From R AutoML output (2) we also get the leaderboard that consists of all the machine learning models automatically created using AutoML, we found the GBM_grid_1_AutoML_1_20231208_40808_model_27 is the best to predict the mpg feature of our mtcars dataset.

# Make predictions on test data
predictions <- h2o.predict(leader_model, newdata = test_data)
print(predictions)

# Look into the model performance
performance = h2o.performance(leader_model, newdata = test_data)
print(performance)
# Get the summary of leader_model
summary(leader_model)
R AutoML output (3)

Based on R AutoML output (3), we can see that the hp or horsepower is a variable with the biggest importance to predicting mpg or miles_per_gallon from the mtcars dataset.

# Extract predicted values from the H2O Frame
predicted_values <- as.vector(predictions$predict)
actual_values <- as.vector(test_data$mpg)

plot(actual_values, predicted_values, main = "Predicted vs. Actual",
xlab = "Actual Values", ylab = "Predicted Values")
abline(0, 1, col = "red", lty = 2) # Add a 45-degree line for comparison
R Predicted vs Actual

Using simple code, we can visualize the Predicted vs Actual values from the predictions and the test_data.

# Computer additional metrics
mse <- h2o.mse(performance)
rmse <- sqrt(mse)
r_squared <- h2o.r2(performance)

cat("Mean Squared Error:", mse, "\n")
cat("Root Mean Squared Error:", rmse, "\n")
cat("R-squared:", r_squared, "\n")
R Additional metrics

We also can extract the metric measurements such as MSE, RMSE, and R-squared (common metrics for regression machine learning models).

Conclusion

In conclusion, H2O-AutoML stands as a powerful tool for automating machine learning model building, allowing us data analysts, data scientists, or whatever the role but you need to explore the machine learning perspective, to rapidly experiment with various algorithms and models.

In Python, the h2o Python library seamlessly integrates with our existing data science and data analytics stack, enabling us to uncover insights such as feature importances.

In R, the h2o R package provides a similar experience. With just a few lines of code, we can generate leaderboards, choose the best mode, make predictions, and so on.

Whether we choose R or Python, the H2O-AutoML empowers us to focus on the essence of the data science, data analytics, and machine learning problem rather than getting too busy with the intricacies of model selection.

This post enlightens and encourages us to dare to start creating machine-learning models to solve our problems!

Google Colab for Python and R code in this post

  1. Python
  2. R Code

--

--

Mochamad Kautzar Ichramsyah
CodeX
Writer for

Data analytics professional with 10 years of experience at tech companies in Indonesia.