Your Report is Wrong.

Why Data Quality matters, how to transform poor data quality to good data quality and most importantly: keep it there.

Julie Thomson
Slalom Data & AI
7 min readAug 16, 2019

--

Packing up your things for the day before heading home, you see a project manager out of the corner of your eye quickly rushing to your desk.

“Can we sit down? This report is wrong, can you fix it please? It’s urgent.”

“Wrong” is typically caused by 2 issues:

1. The calculation you made is wrong in the report. Whoops. Mistakes happen.

2. The data pipeline that feeds this report is inaccurate. Not your fault, but you are responsible for the data within the report, even if the data pipeline falls ‘outside’ of your own responsibility.

Your laptop comes back out of the backpack (feeling 10lbs heavier than it did 5 minutes ago) and your ‘inspector gadget mode’ is turned on. The root cause of the incorrect figures in the report are driven from Issue #2: The data pipeline that feeds the report is inaccurate.

Companies strive towards the green field of ‘data driven decision making’ whether through simple interactive dashboards or real time machine learning output. It is often the highest valued asset in any company, and the foundation of most decision making.

While the content within this post is not a comprehensive list, below are a few lessons learned outlining why data quality matters, steps to move historical data towards better data quality, and building resiliency in future data pipelines.

Why this matters: Data quality = Water Quality.

  1. Decisions matter.
  2. Reports that enable decisions matter.
  3. Data presented within those reports that enable decisions matter.
  4. Data quality matters.

I think about data quality as water quality. We expect when we use the tap, water is clean and safe to repurpose for whatever end product we need it for (drinking, showering, cooking). When water is not safe to drink it can make a community sick, similar to a bad business decision being made from bad data. Public service announcements urging families to boil the heck out of what is coming into the kitchen can lead to an elongated distrust or excessive time investments preparing and ensuring water is safe to consume. This sounds like a similar time investment when analysts are required to spend countless hours reconciling report numbers to ensure each figure across disparate systems match exactly prior to publishing to the rest of the division.

Just as water is the foundation for an individual maintaining a standard level of health, strong data quality is the foundation for a healthy business.

How do I know if my department has good quality data?

Generally, data quality issues are not announced over a loud speaker at a company, but often identified over multiple shoulder taps requesting an analyst provide explanation of why reports do not match, or research into “this number feels wrong”. In most cases, data quality issues are often not tracked, recorded or reported.

Identifying ‘good data quality’ can typically be answered through the following set of example questions. If any of these questions lead to anything less than a ‘yes’, there is area of opportunity for better data quality.

If I don’t trust the data quality, what should I do?

Below are a few beginning steps recommended towards retroactively fixing poor data quality.

1.Marie Kondo your VIP historical data.

You know, keep the data that makes you ‘spark joy’

Its great to keep ‘all of the data’ but if it’s not trusted, pulling trivial numbers won’t get a business stakeholder very far if they are trying to put together a bullet proof analysis. Filter the pieces that useable, trustworthy and able to be repurposed quickly. The rest of it can be put into ‘archive’ for back of the napkin number crunching.

Historical data often propagates variations of itself as analyses are completed and frequently abandoned after the next round of business questions. Tables such as “JT_User_Agg_V2” should be moved and left in the archive if it was created for an analysis that has come and gone.

2.Put VIP data in a place and structure that users know how to use it.

Let me repeat:Historical data often propagates variations of itself as ‘high priority’ analyses are completed then eventually abandoned after the next round of business questions. This can cause countless unnecessary tables co-existing with repurposed information refreshing on an ongoing basis, left to cause confusion for analysts on which table is the best to use.

Keep historical data in a structure that is flexible for ongoing analyses for an analyst or data scientist to avoid having to create duplicate tables or copy information. The more flexible the structure, the less likely it will need to be modified and materialized. Keep this data in a well-known location that is trusted and well documented.

3.Move forward.

In most cases, the time spent trying to find the needle in a haystack of old, tangled information can leave an analyst exhausted with trivial output. Invest a set amount of time gathering what is crucial from historical data for ongoing business decisions then move efforts towards building trust of future pipelines and incoming data. This set amount of time will be based on what is right for your department. This may range from 1 week to a 6 month dedicated effort.

Future Facing: Get it right the first time.

Once past the historical mess, setting up data quality reinforcements ranging from data quality checks within a given pipeline to setting up a data governance group is key for maintaining future data quality.

1.Validate along the journey

Taking control to alter the “eh, this report is wrong” situation at work to move towards “Hey heads up, the report will be refreshed again at noon today when the data gets corrected after a nightly job finishes” moves a data team from reactive to proactive.

Set up data quality checks along the data transformation journey with proper alerts, especially when transforming and materializing data. A simple example above outlines the steps taken for processing data prior to making a business decision. The green stars indicate where recommended data checks should occur, e.g. checking the start and end of a pipeline to ensure that records and aggregations match as anticipated. If they do not match, troubleshooting is started proactively rather than receiving a disruption from a co-worker later in the afternoon catching the error or worse carrying forward a business decision with half of the information.

2.Set up flexible architecture structures

Allow the data analyses to transform just as much as the business does. While a total percent of inventory may be important one year, the variance of remaining inventory by product may be a KPI the next year. A flexible architecture allows for finding the balance between saving everything vs. saving what is important at the time. In general, this data structure will be able to answer 80% of the business questions with 20% of the total data of the business.

What does a flexible architecture structure look like? It depends. A ‘flexible structure’ will be entirely based on the business, data type, data size and necessary speed to delivery required from a given business department. This simplified example below may not be fitting for all use cases for the needs of an evolving business.

Simple example: Rather than keeping 1 column of % values as shown in the table 1 on left, keep data in its raw form as much as compute allows (table 2 on the right). Keeping data in this form allows the user to add, multiply, and divide a KPI much as the business demands as it also grows through its own data transformation.

3.Documentation.

Some of the greatest companies I have had the chance to work inside have begun to attempt to automate documentation as much as possible. While I have not seen a perfect solution, companies that are proactively talking about data quality are often those that are further along in their analytics journey. These companies see evolving data quality and corresponding documentation as a key input to quality output of a team.

The more up to date the information, the stronger the analysis, and more confident the business decision. These organizations actually treat ‘data as an asset’.

While writing documentation is not considered as ‘flashy’ as AI/ML modeling, if no one can reproduce or understand how or what was used to create the analysis, its impact is typically only as long as the individual is within that department.

Context Matters.

Data, in most cases needs to be accurate. How accurate will depend on a department and the type of data they leverage.

For example, there will be very different ‘accuracy standard’ requirements for the processing of an individual’s paycheck versus a quick ad hoc analysis of users that have likely returned to a website over the last month. While both are important, consequences for being incorrect in both of these situations are very different. A given department should be aware of standards they and others hold themselves to, and balance accordingly.

The same applies for the trade-offs between speed and accuracy. These trade-offs should be known and discussed with the primary stakeholders both analyzing and using this information to make decisions off of to better inform the business.

Start the conversation with your organization.

Strengthening data quality is typically moved to the ‘we will get to it eventually’ section of tech-debt budget of most organizations; however, I suggest it should be put at the forefront given the increasing frequency of data hacks and policy changes such as GDPR and CCPA. These changes will require an organization to know what, where, and how information is collected, propagated and stored. This new standard will raise data quality to a new level of importance touching almost every organization and industry.

Whether it is new policy standards or a renewed focus of a data driven mindset within your organization, data quality will continue to take more of the spotlight in the near future.

--

--