Technology has always been about the data. Nowadays, we are dealing w/ ever more data that we bring to bear to solve problems. The act of bringing various bits of data together starts with a data pipeline. This can be a large funnel, like a feed of Facebook or Bloomberg data. It can be a small pipeline, with a few thousand records a day. Companies tend to have a some large feeds, and many many smaller feeds. The large ones tend to get a lot of attention and IT works on it. The smaller ones get addressed somewhat manually where a large number of people across the organization end up making sure daily that things are working. What this really means is that data flowing around the company isn’t reliable and things break when change occurs, and it breaks in unpredictable ways.
Change often will impact systems. The part that’s key is the impacts of breakages are unpredictable. It’s because the pipelines don’t have sufficient checks as well as ways of monitoring and managing the breaks. Once ‘bad’ data gets through, it takes lots of effort to remove it.
In a more ideal world, a data pipeline would be modeled as follows:
There are many tools for sourcing and preparing the data. Validations are added over time, as breaks occur. Repair tools are very infrequently bundled as part of a data pipeline. The bad data is very often ingested and then only later validated. This causes the applications to have to deal with the bad data. Some applications will have repair and re-processing capabilities. Most often, the expectation is that if the data is coming from another system, it should be good and the IT people can manually deal w/ the breaks.
Do your warehouses have validation and repair facilities?
With data lakes, how often is bad data part of the mix?
Imagine how much faster your company could be if data was clean and people focused on using it.
DPR by Qvikly is built with this data pipeline design in mind — it fundamentally includes the validations and repair capabilities alongside sourcing, prep, and publishing. Read more about how DPR can help you implement robust data pipelines.