Perfect Data Quality?
Can a Data Manager ever raise a hand in front of his CTO or DGO and say he has achieved perfect data quality? The answer unfortunately is a big ‘NO’.
Why? Because there is nothing like perfect data; it’s a mirage which, if you start chasing, will get you stranded in the middle of a desert. So, is it worth trying and remediating bad data? Yes of course, but we have to take a selective approach. The Pareto principle, the 80–20 rule, will come to the rescue here. Identify 20% of the issues that fix 80% of the data. Easier said than done but at-least it is something achievable and can be aspired for.
Even more important is to realize that fixing the process is far more important than fixing the data. I am not trying to imply that data remediation is not important. It is and needs to be done. But it should be more of a correction or fix in case of exceptions or a one-time effort when you laying down the foundations at the onset of your Data Quality program.
The focus should always be to understand, first, the data flow — how the data is generated? What all systems process, transform or tinker with it? Where is it stored for consumption? What systems consume it? What is the impact of bad data on the consumers? etc
The second step should be to understand what are possible areas where DQ errors are ingested in the system? What are the manual data entry points? What are the data transformation systems that touch the data type, precision etc?
Once you have identified the vulnerable points and risks, the next step should be mitigation. Usually this is done through native error handling & exception handling in your ETL systems. Establishing validation points before a data handshake happens between two systems and at manual data entry points will avoid many DQ errors.
Now that you have reached a stage where you understand the data flow, the pain points for DQ and have taken steps to mitigate the risks. What next? The buck definitely doesn’t stop here.
Data Quality is a continuous iterative endeavor. Once you have the basic framework mentioned above, the focus should shift on exception identification process. Regular monitoring and checks need to be established that help identify data quality errors. Based on data criticality and the type of data store (operational/transactional/analytical/legal), the frequency of such checks needs to be established. When possible, open channels of communication between your Data Stewardship team and you Data Analysts & Business Analysts. The Data Stewards over a period may become the subject matter experts for the data but the DAs & BAs will always have more insight in to the business and they more often than not will be able to tell when something looks unusual on their reports/queries. That should not be a data problem all the time but should definitely raise a flag for the stewards to verify that everything looks good.
Trust me, here again, you will not be able to fix everything and review, analysis and prioritization will be the key. Yet again, the communication and dialog between the business and stewards will come in handy to determine the priority. Also, beware of the enhancements that the business wishes for coming to your queue disguised as data quality issues. Part of the prioritization process should also be to determine and segregate DQ issues & enhancements so that they can be routed to your production support or development teams as needed. The correction part will be usually handled by these two teams but the DQ or stewardship team should be validating when the fix actually resolved the issue before intimating the business for verification at their end.
So, overall at a very high level, this is how the entire process should look like:
Data Governance & Quality is a huge and growing topic, we will continue to deep dive and discuss the various aspects in the following posts. But based on this post, is there any other step that you think should be included? Any other factors that need to be considered? Let me know your thoughts via the comments below.
Originally published at theobservinganalyst.blogspot.com on May 17, 2016.