Making Data Analytics Responsive to the Organization’s Needs

Seven Steps to Implementing DataOps: Step 6 — Parameterize Your Processing

DataKitchen
data-ops
3 min readApr 25, 2017

--

In a previous blog we introduced DataOps, a new approach to data analytics, which can put analytics professionals in the center of the company’s strategy, advancing its most strategic and important objectives. DataOps doesn’t require you to throw away your existing tools and start from scratch. With DataOps, you can keep the tools you currently use and love. You may be surprised to learn that an analytics team can migrate to DataOps in seven simple steps. This blog entry is step 6 of 7.

Imagine a pharmaceutical company that obtains prescription data from a 3rd party company. The data is incomplete, so the data producer uses algorithms to fill in the gaps. In the course of improving their product offering, the data producer develops an alternate algorithm. The data has the same shape (rows and columns), but certain values are modified using the new algorithm.

The 3rd party now provides the pharmaceutical company with two versions of the prescription data — one has the original algorithm and the other uses the alternate algorithm. The alternate algorithm may ultimately be an improvement, but it starts out as a work in progress. The algorithm vendor requests feedback on potential improvements to make. While the application team tests the alternate algorithm, the original algorithm remains in place for production. The software application that uses the prescription data needs to be able to be run with both — either the original or the alternate algorithm depending on whether it is being run in production or for testing.

This situation illustrates one of many cases in which the data-analytic pipeline needs to be flexible enough to incorporate different run-time conditions. Which version of the raw data should be used? Is the data directed to production or testing? Should records be filtered according to some criterion (such as private health care data)? Should a specific set of processing steps in the workflow be included or not? To increase development velocity, these options need to be built into the pipeline as options.

A robust pipeline design will allow the engineer or analyst to invoke or specify these options using parameters. In software development, a parameter is some information (e.g. a name, a number, an option) that is passed to a program that affects the way that it operates. Consider an example program that adds together the numbers 3 and 5. This might be useful in a specific circumstance, but it only does one thing. Imagine if the program is rewritten to add together the numbers a and b, two values passed to the program at start time. Now the program adds any two numbers — much more useful. Parameters allow code to be generalized so that it can operate on a variety of inputs and respond to a range of conditions.

Parameters can also be used to improve productivity. Consider a preprocessing run that performs an operation on some data. After running for several hours, it stops unexpectedly due to an error. An inflexible program might need to be restarted at the beginning, losing the several hours of processing. A program using parameters could be designed to be restarted at any specified point, in this case where the processing left off. Using parameters, the program completes the processing in much less time.

Returning to the case of the pharmaceutical company described earlier, an engineer or analyst could easily build a parallel data mart with the alternate algorithm and have both the original and alternate versions accessible through a parameter change. Parameters allow the data-analytic pipeline to be designed to accommodate different run-time circumstances. This flexibility is critical for DataOps which seeks to make analytics more responsive to the needs of the organization. We will provide additional ways that DataOps improves flexibility and responsiveness in our next blog.

--

--