Structuring a Robust Data Pipeline

Published in

The Startup

10 min readMar 3, 2020

We all know and love data. Data holds insight. Data helps you make decisions. But data is a dog (pun here)… things can get real messy if you’re not careful. Let’s talk about how to build good dogs :)

Prologue

“Data pipeline” is a pretty broad term; when I say “data pipeline”, I’m referring to infrastructure that supports ETL workflows. If you don’t know what that is, familiarizing yourself with ETL would provide valuable context.
We work with existing data sources when building most ETL workflows. Resulting pipelines are brittle and hard to test. In order to build good pipelines, we need to think about data from the beginning instead of as an afterthought. This article will talk about how to build good data sources which will make our ETL workflow simpler. We’ll go over properties every ETL workflow should have but it might be hard to introduce some concepts into existing ETL workflows; especially ones that work with aggregated data sources.

Motivation

A data pipeline is just like any other software system:

It will start off simple and grow in complexity over time.
You will need to make decisions involving tech debt and how to pay that debt down in the future.
You will need to worry about scalability, maintainability, and operational stability.

If your data pipeline doesn’t have a strong architectural foundation, it will break. I’ve had the fortune (at the misfortune of my company) of experiencing such a failure. A critical data pipeline for our FinOps department exploded.

Some employees were stalled; they couldn’t do their work without the data.
Some employees were distracted; they paused their work in progress to get the pipeline working again.
Some employees were going crazy; they were trying to find people that could put out the fire.

This incident pointed to the brittleness and instability of not just the pipeline that failed, but other pipelines in our system. After this incident, we invested time in making our pipelines more reliable but our infrastructure is still far from perfect. The anecdotes described in books like The Phoenix Project about what happens when things go wrong are all too real.

In contrast, I’ve also had the fortune of working with an operationally sound pipeline that handles hundreds of millions of events per month. Weeks or even months go by and I almost forget this pipeline exists. It’s reliable, well tested, and easy to maintain. In this article, we’ll be looking at this pipeline and seeing what makes it robust.

Avoiding Data Armageddon

Like many other companies, our engineering department is slowly transitioning from monolithic to service-oriented architecture. Observability is a mandatory requirement for any critical service in our ecosystem. As we spin up more and more critical services, we’ll be generating more and more data. Without a sophisticated solution to manage all that data, we’ll head towards data armageddon — a place where people are running around screaming because data is disorganized, hard to access, and always causing fires.

First off, I would like to give credit where credit is due. “Data Armageddon” was a term coined by a very talented engineer, and my colleague/mentor/friend, Christian B. The pipeline I’m about to introduce was his brainchild.

We worked on separate teams; I’m on the platform team and Christian worked on the Data team. Our teams collaborated on a joint initiative to:

Increase the observability of our services.
Improve our data turn around time from days to minutes.
Explore any future projects Data and Platform might collaborate on.

Our solution involved collecting data generated by several backend systems and storing them in our data warehouse.

Data Structure

Data structure is very important as it heavily influences your data pipeline. Namely, it will guide the infrastructure you’ll need to support your extraction and transformation steps. We knew this pipeline would be used by all of our backend services. What would our infrastructure look like if each service for responsible for its own data structure?

Each service/team would spend time coming to a consensus on a data structure.
If services were expected to push their data into the pipeline, they would need to write validation, error handling, and data batching logic.
If our pipeline instead pulled data from services, service owners would need to modify the pipeline’s code so that it can understand & pull their data.

What I’ve described is a lot of work. Even more so considering every service would need to do this. This isn’t ideal. Our pipeline should be plug-n-play so that we can:

Increase operational stability as service owners aren’t introducing changes and using well-tested code.
Decrease onboarding time for services wanting to participate in the pipeline.

In order to create a pipeline any service can use, we need a common data/event structure across all our services. Such an event structure needs to be intuitive and versatile in order to satisfy all of our services’ use cases. An easy way to solve this problem is to choose from structures people are already familiar with. One such class of structures you might consider is linguistic topology, or in other words, the sentence structures which support different languages.

Deciding on a fixed structure unlocks the ability for you to abstract away the extraction and transformation steps. This abstraction enables you to bake important properties like durability and resiliency into your data pipeline.

We decided to use the subject-verb-object (SVO) structure leaving services the freedom to decide which subjects, verbs, and objects they want to use. Check out this article for more information on event grammar. Here are some examples of what events would look like (obtained from our data warehouse):

Once you’ve found a structure that works for you, you can easily enhance the events you emit by attaching:

Emitted timestamp / Ingested timestamp
Service Name / Process Name
Event Version
and anything else you might find useful!

Extraction

Once you’ve decided on a data structure, abstracting away the extraction step is relatively straightforward. The approach you adopt will depend on your existing architecture and tools but will likely involve writing a client library that will either:

Push data into your pipeline.
Configure your service so that your data pipeline can start pulling data.

A common theme, however, is developer experience. We want to make it as easy as possible for developers to use the abstraction. Here’s what our client library looked like:

Any service wanting to participate in the data pipeline would install the client library in their service. This library would expose an interface enabling developers to send structured SVO events to Amazon Kinesis.

Code that developers would be writing:

Here’s the idea of what the library is doing behind the scenes:

The code I’ve shown is very rudimentary. To create a robust client library, you must also consider:

Testability - you shouldn’t be emitting events when running local unit tests or continuous integration tests. To solve this problem, you can use dependency injection/patching so that your test framework uses a “mock emitter” while running unit tests.
Network Failures - a network request can fail for any number of reasons. Having a retry / alerting policy in place will help you catch & diagnose issues early in the pipeline.
Batching - services may emit millions of events. Instead of making a separate network request for each event, you should batch several events into the same network request. Of course, this is only possible if the service ingesting the data supports batched events.
Asynchronous Emitting - the code I’ve shown emits events synchronously, resulting in slower response times for the end-user (e.g. If an API endpoint has a response time of 100ms and we introduce a synchronous call with a response of time of 50ms to the endpoint handler, then our new API endpoint will have a response time of 150ms.). In most cases, events can be emitted asynchronously. This can be done by placing all events in a queue and have a worker running on a separate thread process that queue.

Transformation

Now we’ll develop the transformation step. The transformation step is arguably the most failure-prone component of your pipeline because its most subject to change. Even more so if you’re working with multiple data sources/sinks. When designing your transformation step, make sure you think about:

Validation - If you have data coming from many different places, you’ll need to validate the data to make sure it isn’t corrupted. The transformation step is an excellent place to validate data coming in and data going out.
Testability - Pipelines can sometimes be hard to test. Especially considering data can come from many different places with no deterministic structure. In certain cases, where you know the structure of the data (like in our pipeline), you should write integration tests to make sure your transformations work as expected.
Scaling - If you’re streaming data (like we are) then chances are your data volume will vary throughout the day. Adopting a solution that can scale may result in substantial cost savings and will ensure your pipeline doesn’t crash under high pressure.
Deployment Automation - Making use of scripts or infrastructure as code for deployments will increase your operational stability and help you quickly recover from mistakes.
Monitoring - Capturing metrics such as CPU usage, Memory Usage, events processed, number of exceptions, etc… allow you to get a holistic view of how well your pipeline is working. Solutions that are ill-suited to capture these metrics are hard to debug.

There are many different ways to implement your transformation step. It could be in the form of a cron job that runs every 5 minutes, a process running on your server, etc…

We decided to use an AWS Lambda function which gets triggered when there are new events to process. Adopting a serverless solution with a mature cloud provider meant we would get scaling and monitoring for free!

Tying It All Together

Here’s a holistic view of our pipeline:

The pipeline starts with services sending their SVO events to AWS Kinesis.
AWS Kinesis is configured to batch incoming events into files. Each file holds approximately 5 minutes’ worth of events.
These files are stored in AWS S3 where they await processing.
When a new file is stored in S3, an AWS Lambda function is triggered with that file’s key.
The AWS Lambda function reads, validates, transforms, and loads the events in the file to Google Bigquery (our data warehouse).
Once the file has been processed, the lambda deletes the file from S3.

Let's see what makes this a robust data pipeline.

No Data Loss

Before our events are processed, they are saved in a durable store: S3. If our lambda throws an exception while it's processing an event file, we don’t lose any data as the file still exists in S3. Once we’ve diagnosed and resolved the issue, we can have the lambda try to reprocess that file. We can safely delete the file from S3 after we’ve successfully loaded the data into our data warehouse.

Saving data in a temporary location while it awaits processing is a very powerful pattern used by various distributed systems. For example, in message brokers, events remain in the queue until a consumer deletes them. Using this pattern helps us avoid failures due to:

Malformed/corrupted events and files
Buggy code
Transient network issues

Bought Not Built

Building your own solutions is a lot of fun but…

Using third-party tools helps us deliver more value at less cost and in less time.

By leveraging AWS, we’ve avoided having to build a lot of functionality:

AWS Kinesis handles data ingestion and the batching of events into files.
AWS Lambda automatically scales to handle the number of incoming events/files. Its also tightly integrated with AWS CloudWatch which enables us to monitor several metrics.
JSON Schema helps us validate events against an incoming/outgoing schema.

By using mature & well-tested technologies to obtain the properties we wanted our pipeline to have, we’ve saved costs pertaining to development, maintenance, and operations.

Plug-n-Play

Services wanting to participate in the pipeline can do so by simply installing & using the client library. They don’t have to introduce any changes to the pipeline and with fewer changes, there are fewer failures. As a bonus, we’ve saved developers the time it would’ve taken to:

Agree on a data structure.
Write code to enable batched asynchronous event transmission.
Write unit & integration tests to make sure their solution works.
Debug any issues & putting out any fires.
and much more!

In fact, our pipeline is so easy to use that in addition to all backend services, we’ve onboarded our frontend and mobile clients.

Conclusion

We’ve covered a lot of ground. We’ve talked about certain properties data pipelines should have and seen how they can be used in practice to avoid data armageddon. Of course, good code/architecture by itself isn’t enough to avoid failures. You’ll need to strive to create a culture of operational excellence. There’s a lot to say about that, but here are some ways to get started:

Make sure you’re not the only one that understands how the pipeline works.
Hold monthly fire drills to make sure everyone on your team knows what to do in the event of a failure.
Plot important metrics on a dashboard and check that dashboard once every day. (This can be done as a team exercise during agile standups).