How to Structure Your Machine Learning Project to Run Hundreds of Pipelines?

A simple framework to organize your project

7 min readFeb 11, 2022

Introduction

A Machine Learning (ML) project, at first step, begins with problem definition and then data collection. Once data gathering and ML problem definition steps have been completed, then it’s time to code. In this article, we want to propose an organized structure for your ML projects to simplify not only finding the best hyperparameters configuration but also finding the best combination of pipeline components.

Most of the ML problems in real world are dealing with imbalanced data. To this end, we define such problem in this article to tackle it. So, consider an imbalanced data classification problem. You probably encounter with bunch of ML algorithms, over-sampling methods, statistical imputation approaches for missing values, encoding methods, and feature scaling techniques. For this article we use H1N1 and Seasonal Flu Vaccines dataset.

You can find the complete code here.

Each main component in above pipeline has a dozens of models, for example, there are bunch of oversampling algorithms, so there are many combinations and the goal is to find best one. Each combination of these models need to be tuned, it means the whole pipeline should be tuned over hyperparameter space.

Now, lots of questions are raised in your mind such as how to structure my project folders; How to design pipeline for training and tuning; how to manage ML algorithms, oversamplers, imputers hyperparameters (HPS); how could I handle this huge amount of results.

Prerequisites

Knowing following concepts give you a great idea what we try to talk about.

‌Basic machine learning concepts.
Basic python (3 or higher versions)
Having a little experience with ML libraries such as scikit-learn, XGBoost, Dask, pandas, or etc.
And some other libraries like imbalanced-learn, category_encoders would help you.

In this article we are going to address these practical issues in the following order:

Organizing project directories
Data flow between directories
Manage the models (ML models, Imputers, Oversamplers, etc)
Designing pipeline
Training and tuning pipelines
Handling results of running configurations

1. Organizing Project Directories

You can organize your project as follows:

The foundation of above structure is based on [1] .In this section we’re going to describe the structure in general.

Configs contains yaml files, each yaml files correspond to a pipeline component that should be tuned. for example if you want use SVM and XGBoost you should put their HBS into ml_model_hp.yaml file. So separating HBS of component in pipeline makes its management easier.

The following example shows a yaml config file:

Config file structure

In models directory, we create a python file for each component in pipeline. In this example the pipeline contains encoder, imputer, oversampler and ML algorithm, so we have a python file for each one respectively.

pipelines directory contains _make_pipeline.py module which contains the pipeline implementation.

2. Data Flow Between Directories

(The diagram above depicts data flow architecture that describes in previous section)

main_run.ipynb notebook is where the training, tuning and testing are done. This file must access pipeline components in models, and to create pipeline itself, it should access pipelines.

In order to read processed data at main_run.ipynb, it should access utils. Also utils module itself read raw data from data directory.

_load_data.py module in utils is responsible to read raw data and performs some basic data preprocessing steps on it.

In models, there is module called _fetch_hyperparameter.py which is responsible for fetching HBS from configs.

3. Manage The Models (ML algorithms, Imputers, Oversamplers, etc.)

In order to mange the models that includes ML models, imputers, oversamplers, etc. we separate them into some different modules. In our example four modules are defined as follows:

_ml_algorithms.py: All ML algorithms that you’re using in your problem are imported here. ML algorithms could be implemented by a third party library (scikit-learn, XGBoost) or by your own. Imported ML algorithms are packed with their HBS as a dictionary and then are returned.

When get_ml_algo(["SVC","XGB"])is called, following dictionary is returned:

ML models with their hyperparameters

_oversamplers.py: In this example, we are addressing imbalanced problem class by using oversampler algorithms. Oversamplers also have their HBS which is need to be tuned. The functionality of this module is like _ml_algorithms.py module.

When get_oversampler(["SMOTE","SVMSMOTE"])is called, following dictionary is returned:

Oversamplers with their hyperparameters

_imputers.py: Another main component in our pipeline is imputer that is filling missing values. If your dataset has many feature with many missing value data, be sure to choose a good imputer. In this example KNNImputer (from scikit-learn) is chosen. We customize these algorithms and put it into lib directory.
_encoders.py: You probably heard about One-Hot encoding for categorical features. But there are some different encoding ways that are less well-known. There is an excellent article about them which I recommend to read out[2]. Here in this example we’re using JamesStein and BackwardDifference (from category_encoders lib).
_fetch_hyperparameter.py: This module is used to read HBS from yaml files.

4. Designing Pipeline

To create our pipeline, it needs to gather all components together and start feeding data. But before creating our pipeline we must realize answers of one question. The question is: What order should these components have?

In previous section we reviewed our main components which we want to apply to an imbalanced class problem. Those components include imputing missing values, encoding categorical feature, scaling continuous feature, oversampling, and classifying.

The pipeline should start with imputing missing values. Because, the most of the subsequent steps need complete data without any malformation. And certainly classification (ML algorithm) is the final step. Encoding and scaling steps should be done before oversampling. Because oversampling methods use learning algorithms (e.g. SMOTE and SVMSMOTE use KNN and SVMSMOTE respectively), it had better to feed them by preprocessed data.

Oversampling techniques have to be used just during training time not in testing time. By using pipeline from imbalanced-learn library the mentioned problem is solved. Note that an error will be raised if scikit-learn pipeline is used.

pipelines directory contains _make_pipeline.py module which contains the pipeline implementation.

As it is seen in above implementation the order of pipeline components is based on what has said earlier. Imputing, encoding, scaling, oversampling and classifying. Each component in pipeline has its name, this name will be used in HPS tuning. To apply different preprocessing pipelines to to different subsets of features ColumnTransformer is used. RobustScaler is used for normalizing continuous features.

5. Training and Tuning Pipelines

Now it’s time to take a look at main_run.ipynb to gather and use all mentioned parts together. The process of training and tuning are done in a function, called run. In this function there are four loops that iterate over ML models, imputers, oversamplers and encoders. In each round, a pipeline is created and then passed to RandomizedSearchCV. So it seems the process of iterating over models is like grid search and then tune each configuration of pipeline with RandomizedSearchCV. The run function is given below:

save_results() is a function to handle outputs and save pipeline which will be detailed in section 6.

There are some important points in this part, described as follows:

Using a fixed random_seed. There is a good article about importance of fixing random seed[3]. We fixed random_seed as a global variable in main_run.ipynb and it is used in train_test_split, StratifiedKFold and RandomizedSearchCV. Also random_state parameter of models, includes ML models, imputers, oversamplers and encoders, is fixed in yaml files.
Because the problem is imbalanced, it’s important to determine the stratify parameter in train_test_split function. As a result, the distribution of label class in train and test data will be same.

We’ve used RandomizedSearchCV from dask-ml library instead of scikit-learn to tune hyperparameters. Dask-ml is much faster than scikit-learn[4]. RandomizedSearchCV from dask-ml does not support oversampling methods from imbalanced-learn library. The reason is, dask-ml is using scikit-learn Pipeline, which doesn't handle fit_resample and also doesn't pass transformed labels down the pipeline[5]. There is a good repository which solves this problem and you should install dask-ml from this repository.pip install git+https://github.com/Alilarian/dask-ml.
For evaluating multiple metrics in tuning process, a dictionary is passed to scoring parameter of RandomizedSearchCV, defining names as keys and metric callables as values. There is function in _evaluation_metrics.py module in utils directory which generate this dictionary. In this module there is another function for evaluating test data with different metrics.

6. Handling results of running configurations

In RandomizedSearchCV, n_iter parameter specifies the number of parameter settings for which pipeline is run. So if you set it to 100 then the pipeline will be run 100 times with different hyperparameters. But here we want the best configuration of hyperparameters which maximize f1 score, and save corresponding results. Also we save the whole pipeline as pkl file. In utils there is function called save_results as follows:

Conclusion

In this article, we proposed an organized structure for a ML project to simplify finding the best hyperparameters configuration and also finding the best combination of pipeline components.

If you have any insights, suggestions, please feel free to make changes to the code or let me know.

You can find the complete code here.

[1]: https://drivendata.github.io/cookiecutter-data-science/

[2]: https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159

[3]:https://odsc.medium.com/properly-setting-the-random-seed-in-ml-experiments-not-as-simple-as-you-might-imagine-219969c84752

[4]:https://blog.dask.org/2020/08/06/ray-tune

[5]:https://stackoverflow.com/questions/59463878/implement-smoteenn-in-dask-randomizedsearchcv