Data Pipelines — Agile considerations

Important aspects of data pipelines

Published in

Acing AI

5 min readMay 12, 2020

In the life cycle of an organization, data science teams start with getting all the data in a centralized location like a data lake. That is followed by building scalable pipelines to ensure data gets to the data lake on an ongoing basis. The journey of an organization from a nascent stage to a mature company is highlighted in its data lineage. Building a data platform and a network of data pipelines is a foundational step in a data science organization. However, I have seen teams optimize far too often for wrong considerations and principles while building these pipelines.

If the data science team builds products — “cars”; the data pipelines are the “roads” for those cars to run on.

Typically, the destination for a data pipeline is a data lake, such as Hadoop or parquet files on S3, or a relational database, such as Redshift, Bigquery or in the current world today, a Snowflake or a Delta lake instance. While building a data pipeline, it is important to consider a few important aspects:

Scalability: If the organization is investing in data science heavily, the goal is to uncover network effects or insights and products with data. The pipelines will always need to process more data than what they were built with. A data pipeline should be able to scale to billions of data points, and potentially trillions as a product scales. A high performing system should not only be able to store this data, but make the complete data set available for querying. The pipelines should be built with vertical and horizontal scalability constraints in mind.
Ease of Recovery: People who have worked on the data pipelines know that this is one of the most important issues that needs to be tackled. Whenever a network partition happens, the nodes in the event bus or the pipelines should be able to recover with minimal human intervention required. Most of the upgrades that have happened in the last few years in data infrastructure have been investments in this area.
Latency: The data science teams should be able to monitor and query recent event data in the pipeline. These query results should be available within minutes or seconds of the event being sent to the data collection endpoint. This is useful for testing purposes and for building data products that need to update in near real-time.
Ad-hoc Querying: An ideal data pipeline should support both long-running batch queries and smaller interactive queries that enable teams to explore data and understand the relationships within the data without having to wait minutes or hours when sampling data.

The longer the team has to wait, the larger the entropy between ideas and execution in the team.

Versioning: This is an important aspect of a pipeline which sometimes gets ignored. It should be easy to move to older or newer versions of the pipeline if version change causes issues on the destination side with respects to data. Event metadata and event data itself both need version control as described in data science version control.
Monitoring: Like with all engineering systems, pipelines need monitoring as well. If data is no longer being received, or alert data is no longer being received for a particular section of the pipeline, then the data pipeline should generate alerts through the incident management process.
Testing: QA in data science talks about how to perform QA for data science projects and pipelines. Testing these pipelines belongs to the ‘system integration testing’ step. Data Pipelines are usually either running as a batch job or as a long-running streaming application, but they could be a source of integration issues if their output changes and it affects the model or the destination data set. Therefore, having integration and data tests as part of our deployment pipelines to catch those issues is important.

There’s a number of other useful properties that a data pipeline should have, but this is a good starting point for thinking long term.

Tools

If the team prefers open source tools it could be easier to version control, test, and deploy. For example, if the team is using Spark, the data pipeline can be written in Scala, ScalaTest or spark-testing-base can be used to test it end to end. Additionally, the team could package the job as a JAR artifact that can be versioned and deployed on a deployment pipeline using GoCD. This is one end to end scenario which the team can use. If the team does not need to look back at aggregations Storm could be used. Event framework could be Kafka or Kinesis. Destination for this data could be S3. Apache airflow can also be used to orchestrate pipeline based schedules. If the team is using Snowflake, snowpipes provide a good pipeline framework. Databricks delta lake also provides methods to build scalable pipelines. Pachyderm uses containers to execute the different steps of the pipeline and also solves the data versioning and data provenance issues by tracking data commits and optimizing the pipeline. MLflow defines a file format to specify the environment and the steps of the pipeline, and provides both an API and a CLI tool to run the project locally or remotely. DVC can help with versioning.

Recommendations

Teams based on their maturity could use one or more aspects or properties described above. Additional properties could be added to make pipelines more robust as these pipelines and the organization scales. Teams with heavier load of data (consumer internet companies) which do over 100B events/ day usually go with homegrown pipelines which suit their scale and are relatively cheaper to build, maintain and flexible to scale. Teams in enterprise companies which do not need that kind of scale could go with other alternatives which are less flexible. More homegrown and flexible pipelines built using open source alternatives are the holy grail of the ‘roads’ which need to be build for data science teams to produce high quality ‘cars’.

Subscribe to our Acing Data Science newsletter for more such content.

Acing Data Science Newsletter

www.acingdatascienceinterviews.com

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

Data Pipelines — Agile considerations

Important aspects of data pipelines

Tools

Recommendations

Newsletter

Acing Data Science Newsletter

Written by Vimarsh Karbhari