This word came to my mind after I had numerous workloads running in production and I couldn't guarantee the SLA of my machine learning workloads anymore. I felt I eventually ran into a pipeline jungle problem.
What is a pipeline jungle?
Machine learning workloads are powered by multiple real time or batch ingestion jobs, feature engineering jobs and most likely all of which are developed and owned by various departments and multiple engineering teams. The data in all the jobs are generally sourced from variety of data sources — data lakes, data warehouses and events. The responsibility and ownership of the correctness & completeness of the data resides with the individual engineering teams. Machine learning workloads are always the consumers of the various upstream workloads. When the output of the machine learning model is incorrect, a considerate amount of time is spent in debugging the data from all the upstream jobs.
This leads to the problem of pipeline jungle.
How does a Pipeline Jungle look like?
Pipeline jungle consists of various jobs which solves multiple business problems end to end.
Does this look complex? Definitely to me! How do we decipher the pipeline jungle? How do we navigate the pipeline jungle and provide correct SLA per job to the technology/business teams? How do we ensure the right contract per job is guaranteed for the lifetime of the job?
I have some ideas around it —
- Spawn out monitoring on top of each job to check for data recency, data quality, data correctness, schema changes and many more and assign status to each check
- If any of the checks fail, the downstream dependant job should not progress
- The responsibility of monitoring should not be centralised by one team (SRE team or a production support team) but must be de-centralised across all the engineering teams. This is a scalable way of making the monitoring reliable.
Monitoring on top of the pipeline jungle is essential and should run periodically in a batch or real time manner based on the data flow. This would help in reporting total number of times SLA breach occurs over a given span of time. This way, each engineering team can be held responsible on owning the quality of the data. Understanding various matrices, data quality checks and SLA breaches is essential as it gives quantifiable reliability over the delivery to the business teams.
Advantages of monitoring a Pipeline Jungle?
- Engineering teams can confidently guarantee the SLA of the workloads to the product owners with greater ease by quantifying them
- Time to debug by the engineering teams can be drastically reduced when there is monitoring over every job
- Computation resources (that are proportional to cost) can be reduced. Example: If one of the many upstream job’s monitoring has detected a failure on the contract, the downstream jobs would not run till the issue is solved by the owner team, thereby saving the cluster compute resources
- Reports for each job can be pulled out periodically to understand the jobs report maximum SLA breaches and one can deep dive to understand what types of checks fail most often and eventually take corrective action.
Ideas to build monitoring jobs around pipeline jungles
(if buying a data observability or monitoring platform is not feasible)
Each engineering team is responsible for delivering certain use cases and workloads in production. Build a re-usable monitoring component for every use case using the respective dependant upstream jobs which monitors various metrics and assigns a status to the metric check. When all the checks are passed, only then trigger the engineering/ML workload. This can be a quick simple way to enforce the idea of monitoring and get initial value from it.
Example: Suppose there is a ML workload W whose upstream jobs are J1, J2, J3. Develop 3 monitoring jobs M1, M2, M3 for each of the jobs and run the various checks and assign a status to each check. M1, M2, M3 can be spark jobs and the result can be stored in HIVE tables for simplicity. Workload W should only run if all the status for M1, M2, M3 is a Pass.
What are your thoughts around this? How are you navigating your pipeline jungles? Would love to hear them out.