Creating Configurable Data Pre-Processing Pipelines by Combining Hydra and Sklearn

Published in

BeyondMinds

5 min readSep 9, 2021

This post is authored by Eli Simhayev, Research Engineer at BeyondMinds, and Benjamin Bodner, Algorithm Researcher at BeyondMinds.

Data pre-processing pipelines are an integral component in any machine learning (ML) system and can have significant effects on the performance of ML models. This article presents a software tool, which supports efficient experimentation and deployment of ML systems. Specifically, the tool generates configurable data pre-processing pipelines, which facilitate collaboration and reproducibility of results.

This proposed “Hydra Sklearn Pipeline”, as the name implies, is based on the combination of two highly effective machine learning frameworks, Sklearn and Hydra. It enables storing pre-processing pipelines in config (yaml) files, instead of in the source code itself. This provides several clear benefits over traditional pipelines, making them easier to store, change, share and track versions, which is needed for reproducibility. These benefits make the experimentation phase for creating machine learning systems more efficient, as well as simplify their deployment process.

Sklearn Pipelines

Sklearn pipelines are widely used in a variety of tabular and time-series tasks, such as classification, regression, anomaly detection and more (for a great introduction to sklearn pipelines, check out these articles on Medium and towards data science).

The sklearn library offers a plethora of pre-processing methods, such as feature transformers and selectors, and machine learning models which can be combined seamlessly into unified sklearn pipelines.

Though highly versatile, these pipelines are typically hard-coded and non-configurable, making it hard to add/remove steps, as well as share and track the pipeline configurations. This means that there’s no simple way to change the order or selection of components in the pipeline, without changing the source code itself. As a result, tracking the different pipeline configurations and understanding how they affect the performance of the ML system becomes a difficult task.

The following examples show how any change in the sklearn pipeline components or hyperparameters requires changing the source code itself:

Code Example 1: A standard pre-processing pipeline. Steps are hardcoded into the code itself

Worse yet, we are often experimenting with several different pipelines, as in this example, which uses different pipelines for different models:

Code Example 2: Different pipelines for different models. The code becomes messy as pipelines are added

This means that we need to keep copies of these different versions of the pipeline, in the source code itself. This can be difficult to manage, create messy code and make it hard to track the relationships between the performance metrics and the pipeline that generated them.

Ideally, we would want to have the source code fixed while enabling any selection and order of the pre-processing steps, via a configuration.

Enter Hydra

Hydra is an open-source Python framework created by Facebook AI Research (FAIR), for elegantly configuring complex applications. It enables efficient hierarchical configuration management, such as choosing a selection of components, in which each component has its own parameters. Most importantly for our purposes, it enables instantiating any Python object, from local files or imported libraries, with configurable input arguments.

This allows us to instantiate any pre-processing step, using the same command:

step = hydra.utils.instantiate(step_config)

The instructions for which components to instantiate, as well as the input arguments, are defined in “step_config”. We now move onto explaining how to construct a “step_config” and combine multiple steps into a full pipeline.

Hydra Sklearn Pipeline

Here is an example yaml configuration file which generates the exact same pre-processing pipeline that is hard coded in Code Example 1 above:

Code Example 3: Hydra configuration file, used to create the pipeline shown in Code Example 1

This yaml config is a Hydra config, meaning it provides the instructions used by Hydra to instantiate configurable objects (the pre-processing steps, in our case).

Let’s break down what is going on here:

target_: hydra_sklearn_pipeline.make_pipeline — specifies which object or method to instantiate (the implementation of this method is described in the next example.)
steps_config — is a list that defines the pre-processing steps we want to execute. Each step is organized as follow:

_target_ — class to instantiate the step, e.g. sklearn’s SimpleImputer
param1 — step’s first parameter, e.g. strategy: 'constant' for SimpleImputer
param2 — step’s second parameter, etc.

For more information regarding configurable instantiation using Hydra, please see their great documentation and tutorials.

Note that the steps_config field includes all the instructions for which pre-processing steps to create, as well as their parameters. Thus, adding a new step can be done easily via the config file, without changing the source code.

Given this config file, the function to create a sklearn-hydra pipeline is:

Code Example 4: Function to create a Hydra-Sklearn pipeline

As you can see, each step is instantiated using the same hydra command we’ve seen in the previous section. After being created, the steps are then added to a list and “piped” together, using the sklearn Pipeline object. This gives us the configurable pipeline we’ve been looking for!

We can also rewrite the pre-processing steps from Code Example 2 as configurable config files, as we did above. This is done using multiple yaml files, one for each pipeline, in the following structure defined by Hydra:

Configs hierarchy, enabling selection from multiple pre-processing pipeline options

Even though we made yaml files for multiple pipeline configurations, in each experiment we will select only one specific configuration to work with. This is done using a main config, which contains a “pointer” to the pipeline we want to run:

Code Example 6: config.yaml, which points to “decision_tree.yaml” during the creation of the preprocessing_pipeline

This also simplifies the driver code, which now only contains a single line for any configurable pipeline, regardless of the number of preprocessing steps!

Code Example 7: The full driver code, which reproduces Code Example 1

Most importantly, we have now successfully detached the pipeline configuration from the source code. This enables us to track different pipeline versions, associate them with performance metrics and share the configurations efficiently between team members.

Making pipelines more configurable than ever before

To summarize, we’ve shown here the hydra sklearn pipelines, a configurable way to use sklearn pipelines, which enables us to store pre-processing pipelines in config (yaml) files, instead of in the source code itself, making it easy add or remove steps, by simply commenting in\out.

Most importantly, these hydra-pipelines are easy to track, share and facilitate reproducibility of experiments.

The full code examples can be found in the accompanying GitHub repository:

GitHub - elisim/hydra-sklearn-pipelines: Hydra-Sklearn preprocessing pipelines

This repository accompanying the blog post: Creating Configurable Data Pre-Processing Pipelines by Combining Hydra and…

github.com