Every data pipeline needs a real-time dashboard of business data
Show them the data; they’ll tell you when it’s wrong
How often do you build data pipelines to ingest some data, but are not quite sure you are processing the data correctly? Are the outliers you are clipping really a result of malfunctioning equipment? Is the timestamp really in UTC? Is this field populated only if the customer accepts the order?
If you are diligent, you will ask a stakeholder these questions at the time you are building the pipeline. But what about the questions you didn’t know you had to ask? What if the answer changes next month?
One of the best ways to get many eyes, especially eyes that belong to domain experts, continually on your data pipeline is to build a visual representation of the data flowing through it. By this, I don’t mean the engineering bits — not the amount of data flowing through, the number of errors, the number of connections, etc. You should build a visual representation of the business data flowing through.
Build a dashboard of what’s meaningful about your data to your stakeholders. This means that you will show the number of times a particular equipment malfunctioned in the past, and whether it is malfunctioning now. To figure this out, you will use the number of outliers that you were clipping out in your data pipeline. Build the dashboard, share it widely, and wait.
Real-time dashboards are catnip. People are drawn to them. The day that those outlier values are produced because of some reason other than malfunctioning equipment, someone will call you and let you know.
This works only if the dashboard in question is web-based or integrated into the everyday systems that decision-makes use. There are free dashboard tools like Data Studio, and tools like Tableau and Looker have free tiers. Learn how to use them to spread your semantic burden.
[For more tips on building data science pipelines, read my O’Reilly Media book Data Science on the Google Cloud Platform.]