A real data pipeline manifest

Saeed Zareian
2 min readNov 26, 2018

--

I have encountered to situations where I was told that someone has created a “data pipeline” for some use-case and I was asked for my opinion using high-level words.

I can imagine that sharing my thoughts could benefit many people out there, so here we go.

A pipeline is a pipeline is a pipeline

Water filtration pipeline model [1]
  1. Pipes are blind the the data, but they have the capacity. When you send the data through a direct TCP/API connection, does mean that your data is transferred over a medium with capacity? Of course not.
  2. Valves are important to redirect the data. They are the control mechanism.
  3. Buffer tanks are in the middle to process the water, however, they don’t keep the state and their capacity depends on the pipe and flow. Usually there is one type of task per buffer tank to simply the evaluation.
  4. The water flows one-way and there are absolutely no loops.

Now, we can compare these explanations above to a classic ETL pipeline for web analytics, for example Snowplow pipeline:

Blocks are the processing buffer tanks and the circles are the pipes

Pipeline Engineering

Capacity estimation

It is hard to believe that many engineers have designed their data pipelines without estimating and experiment the capacity of the flow. Most of the time they say: We didn’t have the time. Some bottle-necks are important to be identified before it would be late.

Latency estimation
Fast pipeline is ideal, but guess what! Wider pipes (i.e. many Kinesis shards) doesn’t mean your data arrives faster. The latency is more related to the number of steps and their processing time.

Crisis plans

There are open-ended questions for most of the issues:

  1. A broken pipe/tank: Is there a secondary hardware to help?
  2. Too much data flowing in or out: Are the flow statistics visible for every element? Has someone thought about scalability?

Footnotes:

[1] Taken from: https://medium.com/@anasbaig/how-safe-is-your-municipal-water-supply-c60dda744516

--

--