Orchestrating Machine Learning Pipelines

Photo by Sigmund on Unsplash

Machine Learning (ML) Pipelines help maintain the order of various sequential steps in a workflow from basic data injection to cleaning, model training, monitoring, and deployment. The reliability and efficiency of ML Models deployed in production can be increased by codifying the entire process from start to end in a single automated flow.

Orchestrating an ML Pipeline

A typical well-planned pipeline is comprised of different main stages and substages, while still being easy to understand. The complexity is determined by how the pipeline diagnoses and corrects its own input data errors by itself, shift to a different path, gains input, implements the corrective actions, and moves forward. In some scenarios, it picks the most suitable algorithm as a predefined set and gets the correct hyperparameters to train the model. This is based on situational trends in data, which predict the most precise outputs.

In simple terms, each stage of a pipeline should process the input data from its proceeding stage and deliver the output for its succeeding unit.

Fig 1: Steps and basic components of an ML Pipeline

Stage 1: Get Source Data

In an ML Pipeline, there can be several sources from which the data is ingested: structured data from Data Engineering Pipelines, or unstructured data from direct sources such as blogs, websites, surveys, forms, etc. Therefore, the variety in the data types and the velocity it arrives needs to be carefully planned while making sure the uniformity of the data flowing into the subsequent stage. Further, if the pipeline can detect any missing data points in a source, it can detect other sources to replace them.

Stage 2: Get Engineered Features

Data cleansing and quality checks occur in the pre-processing stage and this data can be used to directly derive variables and engineered features. At this stage of the pipeline, it is important to connect all required variables and create a master table to ensure that the granularity of the data set is maintained. Further, combining the features should not make erroneous aggregates or duplicates.

Stage 3: Train the model

The training components of a fully-automated model training pipeline should be empowered enough to make its own decisions in line with the key matrices it monitors. Here are a few additional stages the main training pipeline should follow:

· Based on the size of the master table, it can decide on the sample size to be trained to efficiently manage resources

· Split the training, test, and validation data sets based on dynamic and intelligent criteria

· Reduce the dimensions with matrices in an acceptable operating range

· Tune hyperparameters and obtain the optimum set of values

· Preserving the model artefacts for future runs

Stage 4: Apply

Applying or predicting is another important stage of the pipeline. This could involve many calculations and it might not be possible to use the prediction directly in the end delivery objective. In that case, the data set is divided into subsets (segmentation) to create different models that apply/predict based on the characteristics of each data set.

Stage 5: Generate the Intervention

This stage can be coupled with the business metrics which monitor the value levers and the dispersion of actionable endpoints. All outputs from ML Pipelines need to be equipped with adequate data points for the end-user to interpret the data comfortably and accurately.

Rewards of a Pipeline Structure

Today, many service providers specialize in providing platforms to build ML Pipelines easier and faster. Azure, Kubeflow, and AWS are a few examples.

1. Reusability of Modules & Consistency

A typical example of code reuse is treating the test data and similarly predicting data frames. If the codification is manual, there may be scenarios where the same functionality runs twice. Likewise, there are many instances the same preprocessing or treatments need to be performed several times in an ML process. The recursive definition of functions can be minimized when all components are embedded in a pipeline properly and called where necessary.

2. Modularity

This helps to structure and split the different processes of a pipeline, which results in the proper organization of a complex code. The isolation of each functionality module gives a robust failure-proof high-quality code.

3. Easiness of Tracking and Versioning

Data and output paths can be tracked easily through different iterations of pipeline runs. A properly structured configuration file can feed all dynamic and situational parameters into the pipeline. The scripts and data can be managed separately for increased productivity.

4. Extensive Collaboration

Machine learning engineers and data scientists get the opportunity of working together in orchestrating these ML pipelines to achieve the smooth flow and the complex transformation of the entire system into a fully automated masterpiece!

5. Avoid Data Leakage

ML Data Leakage refers to the problem of unintentionally passing valuable information about the holdout data set (test and validation) to the training data set. This occurs when the Data Scientist normalizes, standardizes, or applies other methods of data scaling at the data processing stage using global minimum and maximum values for most parameters, then split the data into train and test, and feeds it into the model training. This can be avoided if the split happens before the pre-processing or feature engineering stage. The modularized functions of the pipeline can be used to minimize the burden of processing data for several iterations.

6. Flexibility

Modularized components of the pipeline are easy to replace, modify, or rework without affecting other stages/parts of the pipeline. This enables the Plug & Play capabilities where needed.

7. Improved Scalability

When the complete workflow appears as a pipeline, the different components can be scaled separately.

Features of a Production Ready Pipeline

1) Unit Tastings

All the benefits of writing unit testing in software engineering projects, apply to data science projects as well. This helps to ensure a particular modularized function in the code executes and generates outputs expectedly. This predominantly comes in handy in situations where a specific code block needs to be changed or some modification needs to happen without affecting the intended output which will help not to break the lineage.

Therefore, unit testing serves as a pipeline to detect edge cases and prevent unexpected outputs.

2) Error Handling and Loggings

This is vital when troubleshooting the ML Pipeline. All general errors, exceptional errors, warnings, and messages help the ML Engineer to correctly identify the problem, debug/react, and ensure that the error is testable. There are multiple ways to code the error handling mechanism using different languages such as Python, Scala, Java, etc. but the fundamentals are the same. The trace statements should be self-explanatory to record the problem and guide the team toward the root cause.

3) Manage Configurations Separately

When orchestrating a data pipeline, it is necessary to feed the environmental variables and other dynamic variables into code blocks. For example, the storing location of the database, intermediate, and output files get connected when testing locally but is different when compared to the one targeted in production. Hence, the name or address of the database would be a configurable variable. Moreover, in the algorithm section, there will be many configurable parameters such as the number of iterations the model is supposed to be trained in, sample sizes, particular variables of inclusions or exclusions based on different conditions, and the timing windows (date ranges) that each iteration is supposed to be executed in.

Managing such configuration variables is crucial since the predictions or the main outputs can be impacted drastically by these input environmental variables. Therefore, carefully add an organized and structured YAML file and set variables easily and painlessly. Then load the file on application start-up and assign the values to respective variables.

Available ML Ops Tools for Pipelines

Azure, Kubeflow, and AWS pipelines are popular platforms for Data Scientists and ML Engineers to build and experiment with ML Pipelines. These are independent executables workflows that encapsulate a series of steps or tasks within the pipeline. Kubeflow can be deployed to build ML Pipelines on both Azure and AWS environments which run on Docker containers.

These off-the-shelf ML Pipeline tools are vital for average companies which have budget constraints and are unable to have a full Data Scientist Team to implement ML Projects. These tools help build stable data flows and environments for Data Processing Pipelines. Without a full-fledged workforce, modern businesses can implement and monitor complex ML Pipelines efficiently.

Conclusion

This article explored how ML Pipelines can be orchestrated from data ingestion to value delivery. It was possible to divide the ML components into different steps, process them, and pass them to the next layer. There are ample rewards in such a pipeline, and it is important to adhere to best practices to ensure a hassle-free smooth flow in the pipeline. Businesses are now shifting towards ML Pipelines as they provide flexibility and scalability.

Written by Nadeesha Ekanayake, Senior Data Scientist.

--

--

OCTAVE - John Keells Group
OCTAVE — John Keells Group

OCTAVE, the John Keells Group Centre of Excellence for Data and Advanced Analytics, is the cornerstone of the Group’s data-driven decision making.