Disband Your Impact Review Board: Automate Analytics Testing

DataKitchen
Sep 18, 2018 · 9 min read

Some companies take six months to write 20 lines of SQL and move it into production.

The last thing that an analytics professional wants to do is introduce a change that breaks the system. Nobody wants to be the object of scorn, the butt of jokes, or a cautionary tale. If that 20-line SQL change is misapplied, it can be a “career-limiting move” for an analytics professional.

Analytics systems grow so large and complex that no single person in the company understands them from end to end. A large company often institutes slow, bureaucratic procedures for introducing new analytics in order to reduce fear and uncertainty. They create a waterfall process with specific milestones. There is a lot of documentation, checks and balances, and meetings — lots of meetings.

Impact Analysis

One of the bottlenecks in an analytics release process is called “impact analysis.” Impact analysis gathers experts on all of the various subsystems (data feeds, databases, transforms, data lakes/warehouses, tools, reports, …) so they can review the proverbial 20 lines of SQL and try to anticipate if/how it will adversely impact data operations.

Imagine you are building technical systems that integrate data and do models and visualizations. How does a change in one area affect other areas? In a traditional established company, that information is locked in various people’s heads. The company may think it has no choice but to gather these experts together in one room to discuss and analyze proposed changes. This is called an “impact analysis meeting.” The process includes the company’s most senior technical contributors; the backbone of data operations. Naturally, these individuals are extremely busy and subject to high-priority interruptions. Sometimes it takes weeks to gather them in one room. It can take additional weeks or months for them to approve a change.

The impact analysis team is a critical bottleneck that slows down updates to analytics. A DataOps approach to improving analytics cycle time adopts process optimization techniques from the manufacturing field. In a factory environment, a small number of bottlenecks often limit throughput. This is called the Theory of Constraints. Optimize the throughput of bottlenecks and your end-to-end cycle time improves (check out “The Goal” by Goldratt).

Get Out of Your Head

The Impact Analysis Meeting is a bottleneck because it relies upon your top technical experts — one of the most oversubscribed resources in the company. What if you could extract all the knowledge and experience trapped in the brains of your company’s experts and code it into a series of tests that would perform the impact analysis for you? This would give you a quick way to test out changes to analytics without requiring bureaucratic procedures and meetings. If the tests pass, you could deploy with confidence. No more waiting on the impact review team. With a comprehensive test suite, you reduce reliance on the impact analysis bottleneck and move a lot faster.

Automating Impact Analysis

Manual testing moves the bottleneck from impact review to the testing team. Manual testing is performed step-by-step, by a person. This tends to be expensive as it requires someone to create an environment and run tests one at a time. It can also be prone to human error.

DataOps automates testing. Environments are spun up under machine control and test scripts, written in advance, are executed in batch. Automated testing is much more cost-effective and reliable than manual testing, but the effectiveness of automated testing depends on the quality and breadth of the tests. In a DataOps enterprise, members of the analytics team spend 20% of their time writing tests. Whenever a problem is encountered, a new test is added. New tests accompany every analytics update. The breadth and depth of the test suite continuously grow.

One advantage of automated testing is that it’s easier to run so it’s executed repeatedly and regularly. Manual testing is often too expensive and slow to run on a regular basis. To ensure high quality, you have to be able to consistently and regularly test your data and code.

These concepts are new to many data teams, but they are well established in the software industry. As figure 1 shows, the cycle time of software development releases has been (and continues to be) reduced by orders of magnitude through automation and process improvements. The automation of impact analysis can have a similar positive effect on your organization’s analytics cycle time.

Figure 1: Software developers have reduced the cycle time for new releases by orders of magnitude using automation and process improvements.

Analytics is Code

At this point some of you are thinking this has nothing to do with me. I am a data analyst/scientist, not a coder. I am a tool expert. What I do is just a sophisticated form of configuration.This is a common point of view in data analytics. However, it leads to a mindset that slows down analytics cycle time.

Tools vendors have a business interest in perpetuating the myth that if you stay within the well-defined boundaries of their tool, you are protected from the complexity of software development. This is ill-considered.

Don’t get us wrong. We love our tools, but don’t buy into this falsehood.

The $100B analytics market is divided into two segments: tools that create code and tools that run code. The point is — data analytics is code. The data professional creates code and must own, embrace and manage the complexity that comes along with it.

Figure 2: From data access to visualization to reports, there is code running at every stage of the data operations pipeline

Figure 2 shows a data operations pipeline with code at every stage of the pipeline. Python, SQL, R — these are all code. The tools of the trade (Informatica, Tableau, Excel, …) these too are code. If you open an Informatica or Tableau file, it’s XML. It contains conditional branches (if-then-else constructs), loops and you can embed Python or R in it.

Figure 3: Tableau files are stored as XML, and can contain conditional branches, loops and embedded code.
Figure 5: Informatica presents a UI that creates ETL in an XML format that is then converted to Java and executed on the machine.

Remember our 20 line SQL change that took six months to implement? The problem is that analytics systems become so complex that they can easily break if someone makes one misbegotten change. The average data-analytics pipeline encompasses many tools (code generators) and runs lots of code. Between all of the code and people involved, data operations becomes a combinatorially complex hairball of systems that could come crashing down with one little mistake.

For example, imagine that you have analytics that sorts customers into five bins based on some conditional criterion. Deep inside your tool’s XML file is an if-then-else construct that is responsible for sorting the customers correctly. You have numerous reports based off of a template that contains this logic. They provide information to your business stakeholders: top customers, middle customers, gainers, decliners, whales, profitable customers, …

There’s a team of IT engineers, database developers, data engineers, analysts and data scientists that manage the end to end system that supports these analytics. One of these individuals makes a change. They convert the sales volume field from an integer into a decimal. Perhaps they convert a field that was US dollars into a different currency. Maybe they rename a column. Everything in the analytics pipeline is so interdependent; the change breaks all of the reports that contain the if-then-else logic upon which the original five categories are built. All of a sudden, your five customer categories become one category or the wrong customers are sorted into the wrong bins. None of the dependent analytics are correct, reports are showing incorrect data, and the VP of Sales is calling you hourly.

At an abstract level, every analytic insight produced, every deliverable, is an interconnected chain of code modules delivering value. The data analytics pipeline is best represented by a directed acyclic graph(DAG). For example, see Figure 4.

Figure 4: The Directed Acyclic Graph (DAG) models the steps in the data analytics pipeline

Whether you use an analytics tool like Informatica or Tableau, an Integrated Development Environment (IDE) like Microsoft Visual Studio (Figure 6) or even a text editor like Notepad, you are creating code. The code that you create interacts with all of the other code that populates the DAG that represents your data pipeline.

To automate impact analysis, think of the end-to-end data pipeline holistically. Your test suite should verify software entities on a stand-alone basis as well as how they interact.

Figure 6: Developers write SQL, Python and other code using an integrated development environment or sometimes a simple editor like Notepad.

Types of Tests

The software industry has decades of experience ensuring that code behaves as expected. Each type of test has a specific goal. If you spend any time discussing testing with your peers, these terms are sure to come up:

Tests Target Data or Code or Both

It’s also helpful to frame the purpose and context of a test. Tests can target data or code. In our last blog, we discussed tests that run as part of the data operations pipeline. We talked specifically about location balance, historical balance and statistical process controls (time balance) tests. These tests are directed at the data flowing through an operations pipeline. The code that runs the data processing steps in the pipeline is fixed. The code is tightly controlled and only changed via a release process. Data that moves through operations, on the other hand, is variable. New data flows through the pipeline continuously. As Figure 7 shows, the data operations pipeline delivers value to users. DataOps terms this the Value Pipeline.

Figure 7: Data Operations: The Value Pipeline

The development of new analytics follows a different path, which is shown in Figure 8 as the Innovation Pipeline. The Innovation Pipeline delivers new insights to the data operations pipeline, regulated by the release process. To safely develop new code, the analyst needs an isolated development environment. When creating new analytics, the developer creates an environment analogous to the overall system. If the database is terabytes in size, the data professional might copy it for test purposes. If the data is petabytes in size, it may make sense to sample it; for example, take 10% of the overall data. If there are concerns about privacy or other regulations, then sensitive information is removed. Once the environment is set up, the data typically remains stable.

Figure 8: New analytics are developed in the Innovation Pipeline

In the Innovation Pipeline code is variable, but data is fixed. Tests target the code, not the data. The unit, integration, functional, performance and regression tests that were mentioned above are aimed at vetting new code. All tests are run before promoting (merging) new code to production. Code changes should be managed using a version control system, for example, GIT. A good test suite serves as an automated form of impact analysis that can be run on any and every code change before deployment.

Table 1: In the Value Pipeline code is fixed and data is variable. In the Innovation Pipeline, data is fixed and code is variable.

Some tests are aimed at both data and code. For example, a test that makes sure that a database has the right number of rows helps your data and code work together. Ultimately both data tests and code tests need to come together in an integrated pipeline as shown in Figure 9. DataOps enables code and data tests to work together so all around quality remains high.

Figure 9: Ultimately the Value and Innovation Pipelines work together to maintain data and code quality

Conclusion

A unified, automated test suite that tests/monitors both production data and analytic code is the linchpin that makes DataOps work. Robust and thorough testing removes or minimizes the need to perform manual impact analysis, which avoids a bottleneck that slows innovation. Removing constraints helps speed innovation and improve quality by minimizing analytics cycle time. With a highly optimized test process, you’ll be able to expedite new analytics into production with a high level of confidence.

20 new lines of SQL? You’ll have it right away.


Read the previous article about operational testing and monitoring.


Like this story? Download the Second Edition of the DataOps Cookbook!

data-ops

The DataOps Blog

DataKitchen

Written by

data-ops

data-ops

The DataOps Blog

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade