DataOps in Seven Steps

DataKitchen
Sep 19, 2017 · 6 min read

In our latest blog series, we explore data analytics in the on-demand economy. Companies like Amazon and Google have turned instant fulfillment into competitive advantages and are being rewarded in the marketplace. As consumers adapt to this “new normal,” the expectation of instant delivery is crossing into other domains. For example, data analytics users can’t or won’t wait weeks or months for new analytics. Data analytics teams that can successfully meet the requirements for rapid delivery of new analytics will play a high-visibility role in helping their organizations compete in the on-demand economy. Improving the speed and robustness of analytics can be achieved using a process and tools approach called DataOps. DataOps draws from process innovations in software development and lean manufacturing. Organizations that implement DataOps correctly have experienced significant improvements in the ability to produce robust and adaptive analytics. DataOps may be implemented in seven simple steps without discarding an organization’s existing analytics tools. This blog is part 4 of a 4-part series.

To implement DataOps, an analytics team does not need to throw away any of their beloved tools. There are tools that can help optimize the data analytics pipeline, but the methodology and philosophy of DataOps are just as important as the tools. An organization can migrate to DataOps in seven simple steps.

Step 1 — Add Data and Logic Tests

To be certain that the data analytics pipeline is functioning properly, it must be tested. Testing of inputs, outputs, and business logic must be applied at each stage of the data analytics pipeline. Tests catch potential errors and warnings before they are released so the quality remains high. Manual testing is time-consuming and laborious. A robust, automated test suite is a key element in achieving continuous delivery, which will be essential for companies in the on-demand economy.

Step 2 — Use a Version Control System

All of the processing steps that turn raw data into useful information are source code. Code can control the entire data-analytics pipeline from end to end in an automated and reproducible fashion. In so many cases, the files associated with analytics are distributed in various places within an organization without any governing control. A revision control tool, such as Git, helps to store and manage all of the changes to code. It also keeps code organized, in a known repository and provides for disaster recovery. Revision control also helps software teams parallelize their efforts by allowing them to branch and merge.

Step 3 — Branch and Merge

When an analytics professional wants to make updates, he or she checks a copy of all of the relevant code out of the revision control system. He or she then can make changes to a local, private copy of the code. These local changes are called a branch. Revision control systems boost team productivity by allowing many developers to work on branches concurrently. When changes to the branch are complete, tested and known to be working, the code can be checked back into revision control, thus merging back into the trunk or main code base.

Branching and merging allow the data analytics team to run their own tests, make changes, take risks and experiment. If a set of changes proves to be unfruitful, the branch can be discarded and the analytics team member can start over.

Step 4 — Use Multiple Environments

In addition to having a local copy of the code, data analytics professionals need a private copy of the relevant data. In many organizations, team members work on the production database. This often leads to conflicts and inefficiencies. With storage on-demand from cloud services, a Terabyte data set can be quickly and inexpensively copied to reduce conflicts and dependencies. If the data is too large to copy, provide your staff an easy way to switch between environments.

Step 5 — Reuse & Containerize

Data analytics team members typically have a difficult time leveraging each other’s work. Code reuse is a vast topic, but the basic idea is to componentize functionalities in ways that can be shared. Complex functions, with lots of individual parts, can be containerized using a container technology like Docker. Containers are ideal for highly customized functions that require a skill set that isn’t widely shared among the team.

Step 6 — Parameterize Your Processing

The data analytics pipeline should be designed with run-time flexibility. Which dataset should be used? Is a new data warehouse used for production or testing? Should data be filtered? Should specific workflow steps be included or not? These types of conditions are coded in different phases of the data analytics pipeline using parameters. In software development, a parameter is some information (e.g. a name, a number, an option) that is passed to a program that affects the way that it operates. With the right parameters in place, accommodating the day-to-day needs of the users and data analytics professionals becomes a routine matter.

Step 7 — Work Without Fear™

Many data analytics professionals dread the prospect of deploying changes that break production systems or allowing poor quality data to reach users. This state of constant anxiety is no way to live. Addressing this requires optimization of two key workflows:

  • Value Pipeline — Data flows into production and creates value for the organization.
  • Innovation Pipeline — Ideas in the form of new analytics undergo development and are added to the production pipeline. The Value and Innovation pipelines intersect in production.

The DataOps enterprise masters the orchestration of data to production and the deployment of new features both while maintaining impeccable quality. With tests (statistical process control) controlling and monitoring both the data and new development pipelines, the dev team can deploy without worrying about breaking the production systems. With Agile Development and DevOps, the velocity of new analytics is maximized. Work Without Fear™.

DataOps Case Study

Companies that are implementing DataOps are seeing tremendous improvements in their data analytics cycle time and ability to adapt to new analytics requirements. The benefits of rapid and robust analytics flow through the entire organization.

At one pharmaceutical company, DataOps enables data analytics to be a self-service function for many individuals in the organization. They have kept the analytics tools that they rely upon, but have woven them together into a cohesive pipeline. They can easily make changes to data marts and data warehouses and a robust test suite verifies that none of the changes interrupt the flow of analytics. Enhancements are implemented quickly and released confidently, satisfying the many requests that flow in from the users.

The salespeople have dashboards with forecasts, opportunities, bookings, shipments and all of the other basic information that they need in order to do their jobs. This frees up the analytics team to focus on higher-value analytics, which helps the growing business understand their fast-changing marketplace. Analytics are updated with internal users “shoulder to shoulder,” facilitating immediate feedback and greatly shortening the time it takes to provide users with the useful analytical tools.

Conclusion

DataOps is a methodology that enables data analytics teams to thrive in the on-demand economy. It allows data analytics to be updated nimbly while still maintaining a high level of quality. Companies who have embraced DataOps, using seven simple steps, have seen tremendous improvement in user satisfaction and in their development of analytics as a key competitive advantage.

This completes our blog series exploring how data analytics teams can deliver analytics at Amazon speed using DataOps. You can return to the beginning of the series here. For more information on the Seven Steps, please download our white paper “Seven Steps to Implement DataOps.”

DataKitchen helps organizations turn data into value by offering the world’s first DataOps platform. DataKitchen is leading the DataOps movement to incorporate Agile Software Development, DevOps, and manufacturing-based statistical process control into analytics and data management.


Like this story? Download the

data-ops

The DataOps Blog

DataKitchen

Written by

data-ops

data-ops

The DataOps Blog