Build Trust Through Test Automation and Monitoring
“Trust takes years to build, seconds to break, and forever to repair.”
We recently talked to a data team in a financial services company that lost the trust of their users. They lacked the resources to implement quality controls so bad data sometimes leaked into user analytics. After several high-profile episodes, department heads hired their own people to create reports. For a data-analytics team, this is the nightmare scenario, and it could have been avoided.
Organizations trust their data when they believe it is accurate. A data team can struggle to produce high-quality analytics when resources are limited, business logic keeps changing and data sources have less-than-perfect quality themselves. Accurate data analytics are the product of quality controls and sound processes.
The data team can’t spend 100% of its time checking data, but if data analysts or scientists spend 10–20% of their time on quality, they can produce an automated testing and monitoring system that does the work for them. Automated testing can work 24x7 to ensure that bad data never reaches users, and when a mishap does occur, it helps to be able to assure users that new tests can be written to make certain that an error never happens again. Automated testing and monitoring greatly multiplies the effort that a data team invests in quality.
Data Flow as a Pipeline
Think of data analytics as a manufacturing pipeline. There are inputs (data sources), processes (transformations) and outputs (analytics). A typical manufacturing process includes tests at every step in the pipeline that attempt to identify problems as early as possible. As every manufacturer knows, it is much more efficient and less expensive to catch a problem in incoming inspection as opposed to finished goods.
Figure 1 depicts the data-analytics pipeline. In this diagram, databases are accessed and then data is transformed in preparation for being input into models. Models output visualizations and reports that provide critical information to users.
Along the way, tests ask important questions. Are data inputs free from issues? Is business logic correct? Are outputs consistent? As in lean manufacturing, tests are performed at every step in the pipeline. For example, data input tests are analogous to manufacturing incoming quality control. Figure 2 shows examples of data input, output and business logic tests.
Data input tests strive to prevent any bad data from being fed into subsequent pipeline stages. Allowing bad data to progress through the pipeline wastes processing resources and increases the risk of never catching an issue. It also focuses attention on the quality of data sources, which must be actively managed — manufacturers call this supply chain management.
Data output tests verify that a pipeline stage executed correctly. Business logic tests validate data against tried and true assumptions about the business. For example, perhaps all European customers are assigned to a member of the Europe sales team.
Test results saved over time provide a way to check and monitor quality versus historical levels.
Failure Modes
A disciplined data production process classifies failures according to severity level. Some errors are fatal and require the data analytics pipeline to be stopped. In a manufacturing setting, the most severe errors “stop the line.”
Some test failures are warnings. They require further investigation by a member of the data analytics team. Was there a change in a data source? Or a redefinition that affects how data is reported? A warning gives the data-analytics team time to review the changes, talk to domain experts, and find the root cause of the anomaly.
Many test outputs will be informational. They help the data engineer, who oversees the pipeline, to monitor routine pipeline activity or investigate failures.
Types of Tests
The data team may sometimes feel that its work product is under a microscope. If the analytics look “off,” users can often tell immediately. They are experts in their own domain and will often see problems in analytics with only a quick glance.
Finding issues before your internal customers do is critically important for the data team. There are three basic types of tests that will help you find issues before anyone else: location balance, historical balance and statistical process control.
Location Balance Tests
Location Balance tests ensure that data properties match business logic at each stage of processing. For example, an application may expect 1 million rows of data to arrive via FTP. The Location Balance test could verify that the correct quantity of data arrived initially, and that the same quantity is present in the database, in other stages of the pipeline and finally, in reports.
Historical Balance
Historical Balance tests compare current data to previous or expected values. These tests rely upon historical values as a reference to determine whether data values are reasonable (or within the range of reasonable). For example, a test can check the top fifty customers or suppliers. Did their values unexpectedly or unreasonably go up or down relative to historical values?
It’s not enough for analytics to be correct. Accurate analytics that “look wrong” to users raise credibility questions. Figure 4 shows how a change in allocations of SKUs, moving from pre-production to production, affects the sales volumes for product groups G1 and G2. You can bet that the VP of sales will notice this change immediately and will report back that the analytics look wrong. This is a common issue for analytics — the report is correct, but it reflects poorly on the data team because it looks wrong to users. What has changed? When confronted, the data-analytics team has no ready explanation. Guess who is in the hot seat.
Historical Balance tests could have alerted the data team ahead of time that product group sales volumes had shifted unexpectedly. This would give the data-analytics team a chance to investigate and communicate the change to users in advance. Instead of hurting credibility, this episode could help build it by showing users that the reporting is under control and that the data team is on top of changes that affect analytics. “Dear sales department, you may notice a change in the sales volumes for G1 and G2. This is driven by a reassignment of SKUs within the product groups…”
Statistical Process Control
Lean manufacturing operations measure and monitor every aspect of their process in order to detect issues as early as possible. These are called Time Balance tests or more commonly, statistical process control (SPC). SPC tests repeatedly measure an aspect of the data pipeline screening for errors or warning patterns. SPC offers a critical tool for the data team to catch failures before users see them in reports.
Notifications
A complex process could have thousands of tests running continuously. When an error or warning occurs, a person on the data team should be alerted in real-time through email, text or a notification service like slack. This frees the data team from the distraction of having to periodically poll test results. If and when an event takes place, they’ll be notified and can take action.
Conclusion
Automated tests and alerts enforce quality and greatly lessen the day-to-day burden of monitoring the pipeline. The organization’s trust in data is built and maintained by producing consistent, high-quality analytics that help users understand their operational environment. That trust is critical to the success of an analytics initiative. After all, trust in the data is really trust in the data team.
Want to learn more about data testing and DataOps? Download the Second Edition of the DataOps Cookbook or visit datakitchen.io
Read the next article on analytic/data code testing.