Pipeline Jungles

Suteja Kanuri
Apr 26 · 4 min read

This word came to my mind after I had numerous workloads running in production and I couldn't guarantee the SLA of my machine learning workloads anymore. I felt I eventually ran into a pipeline jungle problem.

What is a pipeline jungle?

Machine learning workloads are powered by multiple real time or batch ingestion jobs, feature engineering jobs and most likely all of which are developed and owned by various departments and multiple engineering teams. The data in all the jobs are generally sourced from variety of data sources — data lakes, data warehouses and events. The responsibility and ownership of the correctness & completeness of the data resides with the individual engineering teams. Machine learning workloads are always the consumers of the various upstream workloads. When the output of the machine learning model is incorrect, a considerate amount of time is spent in debugging the data from all the upstream jobs.

This leads to the problem of pipeline jungle.

How does a Pipeline Jungle look like?

Illustration of a typical pipeline jungle

Pipeline jungle consists of various jobs which solves multiple business problems end to end.

Does this look complex? Definitely to me! How do we decipher the pipeline jungle? How do we navigate the pipeline jungle and provide correct SLA per job to the technology/business teams? How do we ensure the right contract per job is guaranteed for the lifetime of the job?

I have some ideas around it —

  1. Spawn out monitoring on top of each job to check for data recency, data quality, data correctness, schema changes and many more and assign status to each check

Monitoring on top of the pipeline jungle is essential and should run periodically in a batch or real time manner based on the data flow. This would help in reporting total number of times SLA breach occurs over a given span of time. This way, each engineering team can be held responsible on owning the quality of the data. Understanding various matrices, data quality checks and SLA breaches is essential as it gives quantifiable reliability over the delivery to the business teams.

Advantages of monitoring a Pipeline Jungle?

  1. Engineering teams can confidently guarantee the SLA of the workloads to the product owners with greater ease by quantifying them

Ideas to build monitoring jobs around pipeline jungles

(if buying a data observability or monitoring platform is not feasible)

Each engineering team is responsible for delivering certain use cases and workloads in production. Build a re-usable monitoring component for every use case using the respective dependant upstream jobs which monitors various metrics and assigns a status to the metric check. When all the checks are passed, only then trigger the engineering/ML workload. This can be a quick simple way to enforce the idea of monitoring and get initial value from it.

Example: Suppose there is a ML workload W whose upstream jobs are J1, J2, J3. Develop 3 monitoring jobs M1, M2, M3 for each of the jobs and run the various checks and assign a status to each check. M1, M2, M3 can be spark jobs and the result can be stored in HIVE tables for simplicity. Workload W should only run if all the status for M1, M2, M3 is a Pass.

Sample monitoring output

What are your thoughts around this? How are you navigating your pipeline jungles? Would love to hear them out.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Suteja Kanuri

Written by

I am all about data.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com