Automate the building of machine learning models using AutoML (H2O) in Python and R, so easy!
About the H2O
package
The H2O
is an open-source software for data analysis that provides a platform for building machine learning models. It is designed to be scalable, fast, and easy to use, and it supports the most widely used machine learning algorithms. H2O
can be used for classification, regression, clustering, and more tasks.
For further explanation about H2O
, you can visit the website here.
Key features of the H2O
package
- Scalability
H2O
is designed to scale horizontally, allowing users to efficiently handle large datasets and models. It can be deployed on a single machine or in a distributed computing environment, such as Apache Hadoop or Spark. - Ease of Use
H2O
provides a user-friendly interface for building and deploying machine learning models. It supports multiple programming languages, including Python, R, and others. - Machine Learning Algorithms
H2O
includes a variety of machine learning algorithms, such as generalized linear models, deep learning, gradient boosting machines, random forests, k-means clustering, and more. - AutoML
H2O
’s AutoML functionality automates the machine learning model-building process. It can automatically train and tune various models, allowing users to find the best-performing model for their specific task without extensive manual tuning. - Data Exploration and Visualization
H2O
provides tools for exploring and visualizing data, making it easier for users to understand the characteristics of their datasets. - Integration
H2O
can be integrated with popular data science tools and platforms, including Jupyter Notebooks, RStudio, and more.
From all the key features explained above, I am very interested to know more about AutoML, because it would make the job much easier if possible to automate the machine learning model-building process.
Using Google Colab
In this post, I will try to use Google Colab which is available to be run in Python and R. If you don’t know how to change the runtime type, you can easily go to Runtime > Change runtime type, then use which type you prefer. At the end of the post, I will share my Google Colab with the Python and R code used in this post.
H2O AutoML in Python
First of all, we need to install the H2O
package.
# Installing the H2O package
!pip install h2o
Then, we need to import the package, especially the AutoML
, and initialize the H20
.
# Importing necessary packages
import h2o
from h2o.automl import H2OAutoML
# Initialize H2O
h2o.init()
Some steps we need to do before using the AutoML
:
- We will use the Breast Cancer dataset from the
sklearn
package. - We split the dataset into train and test.
- We must convert the dataset into H2OFrame, so we can automate the machine learning model-building using
AutoML
.
# Import the Breast Cancer dataset from sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import pandas as pd
# Load the dataset
data = load_breast_cancer()
x = data.data
y = data.target
df = pd.DataFrame(x, columns = data.feature_names)
# Split the dataset into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 13)
# Convert x_train and y_train to pandas DataFrames
x_train_df = pd.DataFrame(x_train, columns = data.feature_names)
y_train_df = pd.DataFrame({'target': y_train})
# Merge x_train and y_train into a single DataFrame
train_df = pd.concat([x_train_df, y_train_df], axis = 1)
# Convert x_test and y_test to pandas DataFrames
x_test_df = pd.DataFrame(x_test, columns = data.feature_names)
y_test_df = pd.DataFrame({'target': y_test})
# Merge x_test and y_test into a single DataFrame
test_df = pd.concat([x_test_df, y_test_df], axis = 1)
# Convert the pandas DataFrame to an H2OFrame
breast_cancer_h2o_train = h2o.H2OFrame(train_df)
breast_cancer_h2o_test = h2o.H2OFrame(test_df)
The last few steps before we use the AutoML:
- We know that the Breast Cancer dataset has a
target
with two possible outcomes, 0 and 1, which means we must convert it into a factor. - We need to assign the independent variables and the dependent variable to
x
andy
. - Finally, we can use
AutoML
to automate the machine-learning model-building.
The AutoML
output (1) above shows us the machine-learning model created with the best performance compared to the other models, including the model summary and model metrics. In this case, the chosen model:
- Uses a Gradient Boosting Machine (GBM) with the identity of
GBM_4_AutoML_1_20231207_133923
. AutoML
automatically “splits” the dataset into train and test, but we will ignore that because we have been splitting the dataset manually to ensure the performance of the model is working for any dataset.
The AutoML
output (2) above shows us the confusion matrix after the chosen model, including the maximum metrics at their respective thresholds. We can see that it has 100% accuracy in predicting the target
feature with the values of 0 and 1.
The AutoML
output (3) above shows us the Gain/Lift table to tell us the cumulative effect of the calculated metrics. It also shows the metrics for the model on “cross-validation data”, which uses the same dataset as the test, that’s why I preferred to split the data into train and test manually before using the AutoML
.
The AutoML
output (4) above shows us the confusion matrix for the model with the “cross-validation data”, which surprisingly has a different output, but the conclusion is still the same.
- The model is very good to use, it has 97% accuracy and 100% precision.
- You can also check the other numbers such as TNS (True Negative Score), FPS (False Positive Score), FNR (False Negative Rate), and so on.
The AutoML
output (5) above shows us the Gain/Lift table which is very similar to AutoML
output (3), the different of course it uses the “cross-validation data”.
The AutoML
output (6) above shows us the model with “cross-validation data” metrics summary from each cross-validation and summarizes it in the column mean
and sd
.
The AutoML
output (7) above shows us the scoring history including the timestamp, duration, and other details such as number_of_trees, training_logloss, training_classification_error, and so on.
The AutoML
output (8) above shows us the variable importances based on the chosen model. We can see that:
worst_radius
is the variable with the highest impact on the prediction of breast cancer. In this case, its impact is around 58%+ of the prediction value is affected by this variable.- The variables that are also important are
worst_perimeter
(15%+),worst_concave_points
(7%+),mean_concave_points
andworst_area
(each 3%+), and so on. - The least important variables are
concavity_error
(0.007%),symmetry_error
(0.03%),texture_error
(0.07%), and so on.
# Display the leaderboard
leaderboard = model.leaderboard
print(leaderboard)
If we want to know the model leaderboard, which will show us all the machine-learning models created by AutoML
, we can execute the code above to get the result below.
For the last step, we need to cross-validate the chosen model by using the model in the test data we created manually before running the AutoML
.
# Display the prediction using test data with the chosen model
predictions = model.leader.predict(breast_cancer_h2o_test)
print(predictions)
The AutoML
output (10) shows us the predicted probability value using the chosen model on the test dataset.
# Display the model performance
performance = model.leader.model_performance(breast_cancer_h2o_test)
print(performance)
The AutoML
output (11) shows us the confusion matrix using the chosen model on the test dataset, including the metrics, such as:
- Accuracy is 97%+
- Precision is 100%
If we want to put some visualization from the AutoML
outputs, we can easily use the code below:
# Example for the chosen model
feature_importance = model.leader.varimp(use_pandas = True)
feature_importance.plot(kind = 'bar', x = 'variable', y = 'percentage')
# Confusion matrix visualization
from sklearn.metrics import confusion_matrix
import seaborn as sns
y_true = breast_cancer_h2o_test['target'].as_data_frame()
y_pred = predictions['predict'].as_data_frame()
cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
For the finishing touch, don’t forget to shut down the H20 cluster.
# Stop the H20 cluster
h2o.shutdown()
H2O AutoML in R
It will be quite similar to Python. In this case, we will build a regression model, not a classification like the previous example in Python.
# Install and load h2o package
install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R")
library(h2o)
# Initialize H2O cluster
h2o.init()
# Load the mtcars dataset
data(mtcars)
# Convert the dataset to an H2O Frame
h2o_df <- as.h2o(mtcars)
# Split data into training and testing sets
splits <- h2o.splitFrame(h2o_df, ratios = c(0.8), seed = 123)
train_data <- h2o.assign(splits[[1]], "train_data")
test_data <- h2o.assign(splits[[2]], "test_data")
As usual, we split the data into train and test data for cross-validation after creating the machine learning models.
# Identify the response variable
response_column <- "mpg"
# Run H2O AutoML for regression
aml <- h2o.automl(x = setdiff(names(train_data), response_column),
y = response_column,
training_frame = train_data,
max_runtime_secs = 60) # You can adjust the runtime or other parameters
# View AutoML Leaderboard
leaderboard <- h2o.get_leaderboard(aml)
print(leaderboard)
leader_model <- aml@leader
From R AutoML
output (2) we also get the leaderboard that consists of all the machine learning models automatically created using AutoML
, we found the GBM_grid_1_AutoML_1_20231208_40808_model_27
is the best to predict the mpg
feature of our mtcars
dataset.
# Make predictions on test data
predictions <- h2o.predict(leader_model, newdata = test_data)
print(predictions)
# Look into the model performance
performance = h2o.performance(leader_model, newdata = test_data)
print(performance)
# Get the summary of leader_model
summary(leader_model)
Based on R AutoML
output (3), we can see that the hp
or horsepower is a variable with the biggest importance to predicting mpg
or miles_per_gallon from the mtcars
dataset.
# Extract predicted values from the H2O Frame
predicted_values <- as.vector(predictions$predict)
actual_values <- as.vector(test_data$mpg)
plot(actual_values, predicted_values, main = "Predicted vs. Actual",
xlab = "Actual Values", ylab = "Predicted Values")
abline(0, 1, col = "red", lty = 2) # Add a 45-degree line for comparison
Using simple code, we can visualize the Predicted vs Actual values from the predictions
and the test_data
.
# Computer additional metrics
mse <- h2o.mse(performance)
rmse <- sqrt(mse)
r_squared <- h2o.r2(performance)
cat("Mean Squared Error:", mse, "\n")
cat("Root Mean Squared Error:", rmse, "\n")
cat("R-squared:", r_squared, "\n")
We also can extract the metric measurements such as MSE, RMSE, and R-squared (common metrics for regression machine learning models).
Conclusion
In conclusion, H2O-AutoML
stands as a powerful tool for automating machine learning model building, allowing us data analysts, data scientists, or whatever the role but you need to explore the machine learning perspective, to rapidly experiment with various algorithms and models.
In Python, the h2o
Python library seamlessly integrates with our existing data science and data analytics stack, enabling us to uncover insights such as feature importances.
In R, the h2o
R package provides a similar experience. With just a few lines of code, we can generate leaderboards, choose the best mode, make predictions, and so on.
Whether we choose R or Python, the H2O-AutoML
empowers us to focus on the essence of the data science, data analytics, and machine learning problem rather than getting too busy with the intricacies of model selection.
This post enlightens and encourages us to dare to start creating machine-learning models to solve our problems!