Databricks Freaky Friday Pills #6: Auto ML

Gonzalo Zabala
SDG Group

--

Welcome back! We’ve reached the final step of our journey through our ML end-to-end solution in Databricks. Over the past six months, we’ve covered a lot of grounds, learning not only about Databricks but also about key concepts related to ML Architecture, MLOps solutions, Model Governance, and more.

In this final article of the series (for now), we will focus on exploring AutoML and how we can apply this paradigm using only Databricks tools.

Let’s get started!

1. Auto ML

AutoML is the process of automating tasks to apply ML solutions to real-world problems. AutoML is generally oriented to non-programming or machine learning experts. The processes included in AutoML usually provide a drag-and-drop system with boxes or similar interfaces and frameworks that convert the inner procedures into a black box.

An ML solution typically involves a series of steps that experts must consider and work on to provide a robust solution. Among these steps, we can find feature extraction, model selection, hyperparametrization or model monitoring. The idea behind AutoML is to absorb all these steps into an automated process where the user is agnostic to the inner functions and transformations applied. This “black box” is not always identical between frameworks and it can absorb different steps of the ML solution.

Auto ML Scope

As you can see in the image below there are multiple AutoML solutions and all of them are solving different parts of the ML pipeline.

AutoML allows data scientists to focus on areas that require deep expertise and creativity. This hybrid approach maximizes efficiency, ensuring that the overall ML solution is both robust and high-performing. Indeed, recent AutoML solutions from cloud hyperscalers and specific AutoML SaaS platforms illustrate the diverse and specialized approaches to automating machine learning processes. These platforms often offer segmented products tailored to various aspects of the ML workflow, providing comprehensive and specialized tools:

1. AutoML Complete:

In End-to-End Pipeline, solutions encompass the entire ML lifecycle from data preparation, feature engineering, model selection, and hyperparameter tuning, to deployment and monitoring. They aim to automate the whole process, making it accessible and efficient for users with varying levels of expertise.

2. Auto ML Feature & Model Engineering:

The Feature Engineering part involves tools that focus on cleaning and enhancing data quality by handling missing values, detecting outliers, and ensuring consistency. Additionally, pre-processing tasks such as normalization, encoding categorical variables, and scaling are automated. Feature extraction processes automatically identify and generate relevant features from raw data.

On the other hand, Model Selection lets users choose among various easy ad-hoc algorithms and select the ones that best fit the specific dataset and problem. Additionally, these platforms systematically compare model performance to identify the most effective model.

3. Auto ML Param & Monitor:

The latest Auto ML solution proposal fosters the last two steps of ML development and deployment monitoring. The Model Hyperparameterization automates the search for the best hyperparameter settings, enhancing model performance by efficiently exploring a wide range of configurations to identify the optimal setup.

Once our model is deployed, this solution provides an intuitive tool for monitoring various assets, including validation metrics performance and statistical features for input and output data. These solutions typically trigger alerts, allowing users to easily identify critical issues related to the ML solution.

It is worth mentioning an increase in use case ad-hoc AutoML solutions. The rise of GenAI is the root cause of multiple platforms offering pre-configured pipelines tailored for specific use cases like text, video, or image processing, leveraging GenAI advancements.

Other examples gaining share in the AutoML market are marketing tools like Cassandra, an MMM (marketing mix modeling) tool that leverages the use of AI architectures behind the walls of a beautiful and easy user interface. This solution provides an integrated marketing tool to measure the impact of your company marketing channels.

We commented earlier that these solutions are offered by all the cloud hyperscalers and some other platforms. Some of the most famous AutoML tools are:

Limitations

AutoML has its limitations as you may have already imagined. The main one is related to the solution design capabilities, being the AutoML a limitation in terms of customization, but we will also cover other limitations like output or cost control:

Limitations in terms of design capabilities

  • Custom metrics: AutoML solutions lack plenty of design customization capabilities. It is common to find only a subset of the most used metrics. For instance, to build the example shown in the following Section, we’re not able to reproduce complex validation metrics to improve our model performance or explainability. Moreover, there’s no possibility to define custom business metrics for some models. This casuistic implies limitations in terms of architecture and solution designs for complex business requirements. Normally, in an AutoML tool, we are attached to the metrics provided by the AutoML solution.
  • Model selection: The previous comment not only applies to metrics but also to the model selection. AutoML tools usually provide a limited set of models for each use case. Not to mention that it is possible that our use case cannot be covered by the models provided by the AutoML tool. Again, the complexity of our business case can impact the intrinsic mathematical paradigm that solves the problem. This paradigm may even not be considered by the AutoML provider and therefore is impossible to solve the use case.
  • Libraries: When we customize a use case solution, there is a dependency on the libraries we use to address it. These libraries might solve our specific needs in ways that the AutoML provider does not support. As mentioned earlier, the model or feature engineering libraries we rely on for a custom solution may not be available in the AutoML tool. Additionally, the library used by the AutoML tool might not align with our desired solution, especially if a specific version of the library is required.
  • Feature selection: These automatic tools usually expect the data to be specially treated and cleaned in the way of working with missing data, imbalance classes, or unformatted columns schema among others. Our solution could expect the data to be specially engineered and this is not the way most common ML solutions find the initial state of data.

Limitations in terms of output control and accuracy (over-fitting)

The AutoML solutions usually aim for accuracy instead of explainability. These solutions lack in terms of balancing the operations of creating models with complex applications and interpretation in real-world environments. The feature extraction phase or model selection can result in a highly accurate application, but it may not offer actionable insights for business decisions. Consequently, it may not guide business direction to improve use case efficiency. This means that the potential business improvements suggested by the model may not be practical due to real-world constraints or a lack of alignment between the model and real-world scenarios.

Limitations in terms of cost control

One of the significant challenges in AutoML applications is cost control. Resource costs can escalate considerably due to inefficient pipelines and suboptimal model selection processes. Without well-designed and streamlined workflows, the computational and storage resources required for training, tuning, and deploying models can become prohibitively expensive. Additionally, financial costs can also rise if the AutoML processes are not optimized. Poorly managed resources lead to higher expenses for cloud services, infrastructure, and maintenance. This inefficiency can diminish the overall return on investment (ROI) for an AutoML project.

This is, we’ve covered AutoML definitions, types, pros and cons. But this is a Databricks article so let’s focus on the AutoML solution offered by them.

2. Auto ML on Databricks

Having covered the main topics about AutoML, let’s deep dive into the Databricks solution proposal, the requirements, capabilities, and limitations.

AutoML Workflow

The implementation of AutoML in Databricks follows the next workflow steps. All of them are performed automatically by the solution:

  1. Data Preparation: AutoML prepares the dataset for model training, including detecting imbalanced data for classification problems. For splitting datasets, we can choose between random, chronological, or manual split. We are also able to integrate the feature store for enhanced feature extraction and explicability.
  2. Model Training and Tuning: It iterates over the dataset to train and tune multiple models, using open-source components that can be easily edited and integrated into ML pipelines. AutoML for Databricks automatically distributes hyperparameter tuning trials across cluster worker nodes. Also consider that with Databricks Runtime 9.1 LTS ML or above, AutoML samples large datasets to fit into the memory of a single worker node. The model selection is made between decision trees, random forest, logistic or linear regressors, XGBoost, LightGBM, Prophet, or Auto-ARIMA. It covers classification and regression use cases.
  3. Algorithm Evaluation: The solution provided by Databricks evaluates models using algorithms from scikit-learn, xgboost, LightGBM, Prophet, and ARIMA.
  4. Results: Finally, it displays results and provides Python notebooks with source code for each trial run, allowing review, reproduction, and modification. It also calculates and saves summary statistics on the dataset in a notebook for later review. These results notebooks can also include SHAP values from the feature importance page so we increase our solution explicability.

Databricks AutoML Experiment Example

Let’s put this knowledge into practice and create a solution using AutoML UI in Databricks. To create a new AutoML experiment, we can go to “Experiments” and create a new AutoML Experiment:

Then we proceed to input the main characteristics of our AutoML experiment, like the ML problem type, the input dataframe, or the prediction target. In the figure below, on the right side, we can see some options related to how missing values are treated for a schema (replaced by mean, 0, or any other specific value).

For the input training dataset, you can choose directly from the workspace data catalog:

In the advanced options shown below, there are configuration inputs to be considered. From this perspective, we can choose between evaluation metrics, model frameworks to use, timeout, a time column used for validation and testing split, and the positive label for our target column.

In the next step, we can manage the feature store and select feature tables to join our main input training dataset.

Once the configuration is ready and the features selected, we can proceed to the training step. In this step, a bunch of training runs will be triggered until the timeout is reached. Once the timeout is reached, the best performance run based on the evaluation metric will be selected:

Among the facilities provided by AutoML in Databricks, we find that if we click on the training source notebook, we will be redirected to the auto-generated notebook by the tool. Also, any possible issue or change we would like to apply directly to the notebook can be applied from here.

Once the training phase is completed, the subsequent steps align with an MLOps workflow using MLflow. You can select the model generated from the best-performing run and choose any action from the MLflow page, such as registering the model or downloading result artifacts.

As you may have noticed, this process does not cover important implementations of the model solution used in our Model governance article. Based on the AutoML solution provided by Databricks, we’re not able to apply important configurations like multiple target binary columns, or the procedures applied to balance classes. This is the biggest downside of AutoML, the lack of customization that data engineers and data scientists usually apply to their models. Also, this lack of customization affects the computation of metrics and the model’s explainability.

Conclusions

While AutoML solutions like the one provided by Databricks offer significant advantages, such as accelerating processes for non-ML expert teams, quickly creating model mock-ups for POCs, and integrating parts of ML end-to-end solutions that might otherwise be hindered by limited technical or human resources, they also have limitations in terms of customization. It is crucial to recognize these limitations when developing solutions with AutoML tools to fully grasp their capabilities and the challenges they address. If these downsides are properly understood, the usage of AutoML tools offers just an advantage to accelerate project processes.

This is the end of a series of six articles that provide a holistic view of ML solutions applied using just the Databricks platform. It was an amazing journey for both of us. Angel and I are incredibly grateful for your feedback. This is not a goodbye, but a “See you later”, the Freaky Friday Pills will come back!

References

For this article, we have used the following references:

  1. AutoML. (2024). What is AutoML? Retrieved June 19, 2024, from https://www.automl.org/automl/#:~:text=What%20is%20AutoML%3F,accelerate%20research%20on%20Machine%20Learning.
  2. Databricks. (n.d.). AutoML. In Databricks documentation. Retrieved June 19, 2024, from https://docs.databricks.com/en/machine-learning/automl/index.html

Who we are

  • Gonzalo Zabala is a consultant in AI/ML projects participating in the Data Science practice unit at SDG Group España with experience in the retail and pharmaceutical sectors. Giving value to the business by providing end-to-end Data Governance Solutions for multi-affiliate brands. https://www.linkedin.com/in/gzabaladata/
  • Ángel Mora is a ML architect and specialist lead participating in the architecture and methodology area of the Data Science practice unit at SDG Group España. He has experience in different sectors such as the pharmaceutical, the insurance, telecommunications or the utilities sector, managing different technologies in the Azure and AWS ecosystems. https://www.linkedin.com/in/angelmoras/

--

--