Building Robust Data Pipelines

Rajat Gupta
Aug 26 · 2 min read

Technology has always been about the data. Nowadays, we are dealing w/ ever more data that we bring to bear to solve problems. The act of bringing various bits of data together starts with a data pipeline. This can be a large funnel, like a feed of Facebook or Bloomberg data. It can be a small pipeline, with a few thousand records a day. Companies tend to have a some large feeds, and many many smaller feeds. The large ones tend to get a lot of attention and IT works on it. The smaller ones get addressed somewhat manually where a large number of people across the organization end up making sure daily that things are working. What this really means is that data flowing around the company isn’t reliable and things break when change occurs, and it breaks in unpredictable ways.

Change often will impact systems. The part that’s key is the impacts of breakages are unpredictable. It’s because the pipelines don’t have sufficient checks as well as ways of monitoring and managing the breaks. Once ‘bad’ data gets through, it takes lots of effort to remove it.

In a more ideal world, a data pipeline would be modeled as follows:

There are many tools for sourcing and preparing the data. Validations are added over time, as breaks occur. Repair tools are very infrequently bundled as part of a data pipeline. The bad data is very often ingested and then only later validated. This causes the applications to have to deal with the bad data. Some applications will have repair and re-processing capabilities. Most often, the expectation is that if the data is coming from another system, it should be good and the IT people can manually deal w/ the breaks.

Do your warehouses have validation and repair facilities?

With data lakes, how often is bad data part of the mix?

Imagine how much faster your company could be if data was clean and people focused on using it.

DPR by Qvikly is built with this data pipeline design in mind — it fundamentally includes the validations and repair capabilities alongside sourcing, prep, and publishing. Read more about how DPR can help you implement robust data pipelines.

Rajat Gupta

Written by

Qvikly Lists is the simplest tool to gather and share information, with tasks, activity streams, and history. Now available at http://qvikly.com.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade