7 Key Design Principles to Build Effective Data Pipelines

Published in

QuantumBlack, AI by McKinsey

6 min readAug 27, 2020

Design a data pipeline that is efficient & scalable. Learn key principles such as modularity, simplicity, automation and monitoring for optimal performance

water or oil pipeline to depict design principles for building a effective data pipeline

Facing an ever-growing set of new tools and technologies, high functioning analytics teams have come to rely increasingly on data engineers. Building and managing production data engineering pipelines is an inherently complex process which can prove hard to scale without a systematic approach.

To help navigate this complexity, we have compiled our top pieces of advice for deploying successful machine learning solutions. In this article, we will examine some of the key design principles which help data engineers of all experience levels effectively build and manage data pipelines.

Follow a multi-layered approach for processing the data

To maintain high data quality and build pipelines on a strong foundation, consider processing the raw data in the following stages:

1) Read the ingested data sets from disparate sources in data lakes using data connectors and perform data typing, basic cleansing and standardisation.

2) Harmonise and simplify the disparate data sources by combining various data sets and build a unified business domain layer, which can be reused for various analytics and reporting use cases.

3) Engineer the model features that represent the underlying business problem and assist in proving the initial list of hypotheses. In this layer, consider designing for reusability, discoverability, backfilling and precomputation of features.

4) Create a unified view of all the features of the use case at a granular level of unit of analysis. Thereafter, you can apply feature selection, encoding and imputation to prepare the final model input layer, which contains the actual features to be used by the ML model.

*Conceptual view of training pipeline for machine learning*

Democratize Data with Metadata

Data dumped into a data lake is less likely to be reusable if there is no metadata. Leverage an enterprise-grade data governance tool to understand the data’s origin, format, lineage, and how it is organised, classified and connected.

Consider collecting business metadata which focuses on the content and condition of the data and details related to data governance. In addition, collect technical metadata and operational metadata using metadata collection scripts. This can be done in multiple ways like Custom python function through Hooks in Kedro (or) Get Metadata activity in Azure data factory (or) ML Metadata in TensorFlow

Establishing reliable metadata layer improves data alignment across data silos and helps create a uniform language to interpret data. Depending on the scenario, you could follow any one of metadata architectural patterns such as Centralised / De-centralised / Hybrid Metadata approach. Preferably, critical metadata can be stored and controlled centrally, and less critical metadata can be managed decentrally.

Make the training data reproducible in nature

In machine learning, reproducibility is the ability to recreate a workflow which reaches the same conclusion as the original work. Irreproducible models can have a significant business impact, leading to a loss of time and effort, and even loss of reputation. In order to achieve reproducibility, the most straightforward solution is to save a snapshot of the raw, curated and model input data every time a predictive model is trained.

Using specific version of pipeline source code (maintained in code repository like Git) and curated data, it is easier to reproduce specific version of the model training data if the versions are tracked properly. Consider choosing appropriate tools which supports versioning of code and data.

*Use the right version of code, configuration and data to reproduce ML results*

Embrace automation in the data pipeline by following MLOps methodology

Adopting MLOps methodology helps in scaling analytics use case from the development stage to production-ready stage quickly and safely. To efficiently deploy ML models from development into operation, the pre-production environments used to develop and test code must be as close to production environments as possible.

Automation ensures that responses to various data compliance violations can be made in a timely, reliable, and sustainable way. Consider using robust orchestration and workflow management tools to schedule your data pipeline jobs, which can automatically retry, and restart failed workflow jobs when possible. For selecting the best candidate model, carry out many ML experiments with various features, configurations and models while tracking the metrics and logs.

Make the pipeline flexible to handle concept and data drifts

Over time, machine learning models can deteriorate in their accuracy and predictive power due to model drift and data drift. It is important to detect and address these drifts in data by implementing business driven data validation rules on the independent and target variables, by monitoring the statistical properties of variables over the period of time and by continuously re-fitting the models to handle the drift.

*Illustration of different concept drifts in machine learning (Source: IEEE Computer society)*

Build a flexible and scalable data processing pipeline

In order to process big data for analytics with different latency requirements, there are two important data processing architectures that serve as a backbone. For many industries where batch and streaming use cases are different, Lambda is more reliable in updating the data lake with larger data sets and is efficient in devising ML models to predict upcoming events in a robust manner. It offers good balance between speed and reliability, using separate batch and speed layers.

On the other hand, if you want to deploy big data architecture by using less expensive single technology stack and require it to deal effectively on the basis of unique events occurring on the runtime, then select the Kappa architecture for your real-time data processing needs.

Consider storing the data in a data lake in storage optimised format such as Parquet, ORC or Avro as per your needs. Parquet with Snappy compression is highly optimised for Spark-based pipelines.

Lambda architecture for data analytics — *Lambda architecture for analytics*

Leverage a development workflow framework

Frameworks like Kedro, Azure ML pipeline (Python SDK), Prefect and Cookiecutter provide a standard approach that allows you to:

Worry less about how to write production-ready code.
Spend more time building data pipelines that are robust, modular, scalable, deployable, reproducible and versioned.
Standardise the way that your team collaborates across your project.
Easily package the project as a docker container for scalable deployments.

In our experience it is more effective to use a notebook environment (Jupyter, Databricks) for interactive data analysis, and an Integrated Development Environment (such as PyCharm, VS Code) for developing data pipelines with associated enforcement of coding standards.

These guiding principles have been born out of the last 10 years of data engineering for end-to-end machine learning solutions. We are sure there are lots of other principles, so please do let us know of any further approaches you have found effective in managing data pipelines. We hope you have found this useful and informative to aid your pursuit of deploying analytics projects.

Authored by: Saravanakumar Subramaniam, Principal Data Engineer, Toby Sykes, Global Head of Data Engineering

Many thanks to contributors Tom Goldenberg, Junior Principal and Evangelos Theodoridis, Principal, QuantumBlack.