DataOps Puts Agility into Agile Data Warehousing

Data analytics professionals get used to being in no-win situations. Internal customers make a simple request; for example, add a new file to the database. Users expect requests like these to take days, yet, in many large organizations, they require months to complete. At DataKitchen, we repeatedly hear from companies that they need to improve their cycle time for new analytics. One approach, Agile Data Warehousing, applies Agile principles to data warehouse projects in an attempt to speed innovation. However, many companies quickly discover that simply implementing Scrum is not sufficient to attain results.

Imagine that you oversee a fifty-person team managing numerous large integrated databases (DB) for a big insurance or financial services company. You have 300 terabytes (TB) of data which you manage using a proprietary database. Between software, licensing, maintenance, support and associated hardware, you pay $10M per year in annual fees. Even putting another single CPU into production could cost hundreds of thousands of dollars.

Someday these large databases will move to the cloud at a fraction of the cost. New databases will be turned on and off like light bulbs with the enterprise only paying for the resources they consume. That’s a long-term goal. In the short term, the team has to produce results using the existing platform.

You can’t afford separate instantiations of the entire data set for development, quality assurance (QA), performance testing and production so non-production machines are given subsets of the data. The necessity of provisioning physically separate hardware instantiations is one barrier to greater Agility.

The machine environments are different and have to be managed and maintained separately. New analytics are tested on each machine in turn — first in dev, then QA and finally production. You may not catch every problem in dev and QA since they aren’t using the same data and environment as production.

Running regression tests manually is time-consuming so it can’t be done often. This creates risk whenever new code is deployed. Also, when changes are made on one machine they have to be manually installed on the others. The steps in this procedure are detailed in a 30-page text document, which is updated by a committee through a cumbersome series of reviews and meetings. It is a very siloed and fractured process, not to mention inefficient; during upgrades, the DB is offline so new work is temporarily on hold.

In our hypothetical company, the organization of the workforce is also a factor in slowing the team’s velocity. Everyone is assigned a fixed role. Adding a table to a database involves several discrete functions: a Data Quality person who analyzes the problem, a Schema/Architect who designs the schema, an ETL engineer who writes the ETL, a Test Engineer that writes tests and a Release Engineer who handles deployment. Each of these functions is performed sequentially and requires considerable documentation and committee review before any action is taken. Hand-off meetings mark the transition from one stage to the next.

The team wants to move faster but is prevented from doing so due to heavyweight processes, serialization of tasks, overhead, difficulty in coordination and lack of automation. They need a way to increase collaboration and streamline the many inefficiencies of their current process without having to abandon their existing tools.

How DataOps Helps

DataOps is a new approach to data analytics that automates the orchestration of data to production and the deployment of new features, both while maintaining impeccable quality. DataOps does not mandate the use of any particular tool or technology, but support in the following areas can be critical to Agile Data Warehousing in large teams, such as the one described:

  • Shared Workspace — DataOps creates a shared workspace so team members have visibility into each other’s work. This enables the team to work more collaboratively and seamlessly outside the formal structure of the hand-off meeting. DataOps also streamlines documentation and reduces the need for formal meetings as a communication forum.
  • Orchestration — DataOps deploys code updates to each machine instantiation and automates the execution of tests along each stage of the data analytics pipeline. This includes data and logic tests that validate both the production and feature deployment pipelines. Tests are parameterized so they can run in the subset database of each particular machine environment equally well. As the test suite improves, it grows to reflect the full breadth of the production environment. Automated tests are run repeatedly so you can be confident that new features have not broken old ones.

These tools and process changes together break down the organizational and technology barriers that prevent the team from implementing Agile methods in data analytics. DataOps unburdens the team from non-value-add tasks and empowers them to self-organize around new creative initiatives. When the team is free to innovate, the continuous improvement culture built into DataOps will begin working to reduce the cycle time of new analytics from months to days (and less). This ultimately puts the Agility back into Agile Data Warehousing by delivering high-quality analytics to users in a timely fashion.