DataOps: what it is and why bother

Petr Travkin
13 min readMay 23, 2020

--

With this article I’d like to start the series of texts which are intended to describe the DataOps practice and framework from its basic definition to application and best practices. If you are not familiar with DataOps these few pages will not only provide useful and interesting (I hope so!) details, but also leave some space for further reflection.

(Very) Short Intro

The goal of making data more universally accessible within the company has been there for ages even when there were no computers. Despite the evolution of business processes and data related technologies this problem is still in place, but now aims at creating a data-driven company.

Old New Challenges

Hardly anyone would argue that the path to accomplish the task of all decisions being backed-up by data is long and cumbersome. Here are the challenges I usually come across in practice and talking to data professionals.

1. Data and Analytics Pipelines are still immature

The development of data and analytics pipelines is still a handcrafted and largely non-repeatable process with minimal reuse. The result is both a plodding development environment that can’t keep pace with the demands of a data-driven business and an error-prone operational environment that is opaque and slow to respond to change requests. Immature development and delivery processes force business users to build their own pipelines that result in an ever-expanding universe of data silos that go unnoticed until a major decision backfires.

2. Taming Big Data is difficult

While some companies are still trying to figure out how to use it, others have embraced it but still struggle with how big it is. A lot of this struggle comes from the new systems and technologies that have emerged to address the need for big data. Since the innovation does not seem to be slowing down, it is very difficult for businesses to have the appropriate vision and expertise to build and what is even more challenging operate these platforms.

3. There is no ideal Data Unification system

The benefits of unifying data sources are obvious. Unfortunately, enterprises are typically operating at a large scale, with orders of magnitude more data than ETL tools can manage. Everything from accounting software to factory applications are producing data that yields valuable operational insight to analysts working to improve enterprise efficiency. The easy availability and value of data sources on the web compounds the scalability challenge. Moreover, enterprises are not static. Scalable data unification systems must accommodate the reality of shifting data environments.

4. Finding qualified personnel is difficult

The existing data team can be very familiar with data warehouses, which have fairly mature architecture and a strong integration with the ecosystem of data tools. On the contrary, the relatively new concepts may it be a Data Lake, an Operational Data Hub or a Data Factory are less mature and require expertise, which is rather difficult to find. In addition, since the operational tools to manage the new approaches are also evolving it is still much more difficult for companies to support them from an operational perspective, than data warehouses.

5. Creating a pervasive Self-Service data access is difficult

To truly democratize data, a company needs to transform both data access tools and infrastructure provisioning to a self-service mode. This requires a thoughtful combination of business and technical efforts and takes a lot of time. Moreover, it can’t succeed without a massive collaboration and constant feedback loop between data consumers, analysts, scientists and data engineers, which is frequently overlooked.

In addition to these long existing problems there are also business drivers:

  • Competitive pressure of digital-native companies in traditional industries.
  • Opportunities presented by the “democratization of analytics” driven by new products and companies that enabled broad use of analytic tools.
  • The need for more agility with data as the data not moving at the same pace is dropped from the decision-making process.
  • Data becoming more mainstream with a proliferation of data sources such as Internet of Things (IoT) and social media.

Data Debt

The challenges and goals mentioned above are all valid and real, but the nature of DataOps is strictly speaking a process optimization, which is usually needed when there is some kind of a mess to be structured. That’s why it occurs to me that the ultimate driver of the DataOps discipline is a so called “data debt”.

Data debt stems naturally from the way that companies do business, especially when they are running their business as a loosely connected portfolio. Lines of businesses want control and rapid access to their mission critical data, so they start making “free rider” decisions about data management and procure their own applications, thus creating data silos.Managers move talented personnel from project to project, so the data systems owners turn over often. As a result most large enterprises still face the reality of intensely fractured data environments and general data heterogeneity, which can be defined as “data debt”.

These problems and drivers beg for a new approach to building data analytic solutions. The current goal to increase agility and cycle times, while reducing data defect is well defined, but the approach to reach it is still emerging and promising. It is DataOps.

So, what is DataOps?

From the data debt perspective it is a discipline intended to enable companies to pay down their data debt by rapidly and continuously delivering high-quality, unified data at scale from a wide variety of enterprise data sources. However, there is a much wider scope the DataOps framework can cover.

DataOps: the definition

It may come as a surprise that DataOps idea actually has been around since 2010, but it’s only after 2018 when DataOps made their debut in the 2018 Gartner Hype Cycle for Data Management it started to proliferate into the minds of wider group of data professionals.

There exist multiple definitions of DataOps. Here are some examples:

  • DataOps is a collaborative data management practice, really focused on improving communication, integration, and automation of data flow between managers and consumers of data within an organization.
  • DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics.
  • DataOps, the set of best practices that improve coordination between data science and operations. (There is a section below about it)
  • DataOps is an emerging methodology for building data analytics solutions that deliver business value. Building on modern principles of software engineering, DataOps applies rigor to developing, testing, and operating code that manages data flows and creates analytic solutions.

All these definitions being valid tell only what DataOps is and what it does, but skips the “how” perspective. As to me, the main idea behind DataOps is that it emerges from the recognition that separating the product — production-ready data — from the process that delivers it — operations — impedes quality, timeliness, transparency, and agility.

Taking a cue from DevOps (of course!), DataOps looks to combine the production and delivery of data into a single, Agile practice that directly supports specific business functions.

DataOps: the core principles

You can read through (and even sign!) The DataOps Manifesto yourself, and here are the core principles.

  • Apply Agile process and software engineering best practices. Short time-to-delivery and responsiveness to change are mandatory. Using version control, automated regression testing of everything, clear code design and factoring is mandatory too.
  • Integrate with your customer and deliver business value. The DataOps team has the advantage that the customers, the engineering teams they support, are in-house, and therefore readily available for daily interaction. Gather feedback as frequently as you can. Data is not an end in itself, but a means to deliver insights that add value to the business and satisfy the customer.
  • Collaboration and Communication. Share knowledge, simplify communication, and provide feedback at every stage of the data analytics lifecycle.
  • Analytics as a Code. Look at data artifacts, like models and visualization, as code and adopt software methods like version control, automated testing, and continuous deployment to them. This also means host configuration, network configuration, automation, gathering and publishing test results, service installation and startup, error handling, and so on. Everything needs to be code.
  • End-to-End Processes and Continuous Improvement. Avoid data silos and consider analytics an enterprise endeavor. Orchestrate data, schema, tools, code, and stakeholders throughout the data landscape. Learn from mistakes, review processes continuously, and adapt to changing circumstances.
  • Maintain multiple environments and Integrate the toolchains. Keep development, acceptance testing, and production environments separate. Never test in production, and never run production from development. Maintain multiple environments, but within each environment, everything needs to work together.
  • Reuse and Automate. Automate wherever possible and reuse existing artifacts to avoid unnecessary rework and repetition.
  • Short Cycles and Incremental Change. Avoid “big bang” releases and bloated processes. Iterate in short cycles so you can adapt quickly to new and changing needs.
  • Test everything. Make quality and testing a top priority and ensure that no untested artifact reaches production. Automated testing is what allows to make changes quickly, having confidence that problems will be found early, long before they get to production.
  • Full-stack Monitoring and Data-Driven Improvement. Continuously monitor applications down to infrastructure and use the resulting insights to enhance performance and reliability.
  • (And finally) Keep It Simple! Whenever an easier solution appears, it is most likely also a superior one.

Sounds a bit familiar? Don’t mention!

DataOps and DevOps

Much like DevOps in the enterprise, the emergence of enterprise DataOps mimics the practices of modern data management at large internet companies over the past 10 years. The engineering framework that DevOps created is great preparation for DataOps. Just like the internet companies needed DevOps to provide a high-quality, consistent framework for feature development, data-driven enterprises need a high-quality, consistent framework for rapid data engineering and analytic development.

(However!) Despite a whole bunch of similarities, there are several key differences between DataOps and DevOps.

1. The Human Factor

One key difference between DataOps and DevOps relates to the needs and preferences of stakeholders. DevOps was created to serve the needs of software developers. Dev engineers love coding and embrace technology. DataOps users are often the opposite of that. They are data scientists or analysts who are focused on building and deploying models and visualizations and are typically not as technically savvy as engineers.

2. The process

The DataOps lifecycle shares DevOps iterative properties, but an important difference is that DataOps consists of two active and intersecting pipelines: the data pipeline and the analytics development process. The data pipeline takes raw data sources as input and through a series of orchestrated steps produces an input for analytics. DataOps automates orchestration and monitors the quality of data flowing into analytics. The analytics development is the process by which new analytic ideas are introduced. It conceptually resembles a DevOps development process, but upon closer examination, several factors described further make the DataOps development process more challenging than DevOps.

3. Orchestration

Orchestration is required in both the data pipeline and the analytics development process. Analytics development orchestration is the orchestration that occurs in conjunction with “testing” and prior to “deployment” of new analytics. Orchestration of the data factory is the second orchestration in the DataOps process, which drives and monitors the dataflow.

4. Testing

Tests in DataOps have a role in both the data pipeline and the analytics development process. In the former the tests monitor the data values flowing through the data factory to catch anomalies or flag data values outside the norms, in the latter they validate new analytics before deploying them.

5. Test data management

The concept of test data management is a first order problem in DataOps whereas in most DevOps environments, it is an afterthought. To accelerate analytics development, DataOps has to automate the creation of development environments with the needed data, software, hardware and libraries so innovation keeps pace with Agile iterations.

6. Tools

Unlike DevOps, the tools required to support DataOps are in their infancy. For example, testing automation plays a major role in DevOps, but most DataOps practitioners have to build or modify testing automation tools to adequately test data and analytics pipelines and analytic solutions.

7. Exploratory environment management

Sandbox creation in software development is typically straightforward: the engineer usually receives a bunch of scripts from teammates and can configure a sandbox in a day or two. Exploratory environments in data analytics are often more challenging from a tools and data perspective. Data teams collectively tend to use many more tools than typical software dev teams. Without the centralization that is characteristic of most software development teams, data teams tend to naturally diverge with different tools and data islands scattered across the enterprise.

So, given so many challenges why bother exploring DataOps for analytic processes?

DataOps benefits

DataOps can help to deal with complex data landscapes and analytic solutions that require the coordination of a broad range of stakeholders and technologies. More precisely, DataOps can contribute in the following areas.

1. Accelerate Time to Production

A major driver for DataOps is speed. The idea of streamlined and largely automated analytics pipelines helps deliver new features and insights quickly and reduces manual effort. Moreover, the short feedback and testing cycles help speed up reactions to changing business requirements and increase flexibility.

2. Increase the value proposition of data and analytics by industrializing processes

The stages and steps that must be orchestrated in a data analytics pipeline are not always serial; often multiple steps happen in parallel and once completed, a step might be repeated as part of an iterative, agile workflow to refine output until it gains user acceptance. This approach increases quality, because it ensures that no untested change makes it to production. It also improves orchestration and collaboration, as different actors in the pipeline rely on another and work together in a fluid process.

3. Support the management and orchestration of heterogeneous technologies

A key role of DataOps is to orchestrate and automate the flow of data and code between people and tools in an efficient manner that ensures clean handoffs and minimal errors and disruptions. With complex pipelines, this can be challenging, making orchestration and automation key facilities in any DataOps implementation.

4. Improve Collaboration and establish a culture of continuous improvement

DataOps comes with a change in culture that promotes collaboration, trust, and responsibility. The goals are to blur the lines between departments and functions, encourage the exchange of knowledge, reduce conflicts, and eventually increase productivity. The convergence of different roles helps align changes throughout various stages, such as when a data engineer is informed about the later cleansing issues encountered by a data scientist or the lack in performance of an ETL process in production.

5. Assure a stable and efficient operation of applications and infrastructure

Well-defined analytics pipelines enhance both speed and robustness. Multiple stages of automated and manual tests prevent the deployment of flawed updates. Besides, DataOps also includes monitoring of production environments to identify bottlenecks or potential issues and thereby improves efficiency and stability of infrastructure and applications.

6. Enable Self-Service

With greater automation and machine learning algorithms that simplify development, deployment and performance management tasks, organizations need fewer experts to build and manage data and analytics pipelines. Business users with some degree of technical savvy can build their own pipelines or move code into production.

7. (Last but not least) Operationalize data science to provide more value to the business

I suggest we should discuss it in more detail as I truly believe this is one of the greatest challenges of modern organizations even mature ones from the data management perspective.

Operationalizing Data Science

Advancements in big data, machine learning, and artificial intelligence enable many new applications and at the same time pose two major challenges to analytics practice.

Operationalization of Data Science

In other words, transform statistical models from an experimental stage to production so they deliver ROI. Data science still lacks standards and robust best practices. Consequently, data scientists often have custom-tailored processes and sometimes use exotic tools. This rarely becomes an issue until an isolated data science approach must be converted to a repeatable, efficient, and flawless application. For instance, a statistical model might behave differently when it is run in a large big data cluster and it can be hard to tell why. However, this often difficult operationalization is necessary to get ROI from data science.

Aligning Data Scientists with product owners, and data engineers

A key to data science is the combination of business knowledge, statistical experience, programming, and data skills. These different skill sets are rarely found in one person and usually require different experts to work together. However, it can be difficult to align the various mindsets and interests of data scientists, product owners, and data engineers. Getting the most out of data science requires demystifying the discipline by establishing defined structures and processes. The hard part here is to put data science into a frame without infringing the necessary autonomy of data scientists.

DataOps can provide at least three starting points:

  1. Bring Data Science to Production with Analytics Pipelines. An analytics pipeline for data science can help structure the process of moving statistical models from the lab into the field. Here, models need to pass different tests that measure their accuracy and scalability, bringing together the knowledge of data engineers, data scientists and the operations team. A streamlined deployment process with comprehensive monitoring can help to refine statistical models, ensure scalability and thereby prevent one-off models, and deliver continuous business value.
  2. Orchestration and Reuse. DataOps can discover common ground where various stakeholders and technologies can act in concert. Here, orchestration tools enable automatic combinations of various technologies. Such a tool, for instance, allows a data scientist to check in her custom Python code, and then ensures that it automatically gets tested and converted so it can be used subsequently by other ETL or visualization tools. Therefore, it puts an end to technology discussions and shifts focus to actual business problems.
  3. Collaboration and Cross-Functional Thinking. Communication is key to bridging the gaps among the magic of data scientists, the daily work of a data engineer, and the business. The DataOps way of work includes cross-functional thinking and short feedback cycles to increase the relevance of data science, as the business always has a seat at the table. Similarly, integrating the knowledge of data engineers early simplifies the scaling of statistical models for production.

Not sure yet why bother?

For the last decade the DataOps idea has transformed from an emerging trend to an applicable set of principles and even technologies. Moreover, DataOps adoption has skyrocketed in the past year driven by data sprawl across hybrid and multi-cloud environments, increased data privacy regulations, and the need for companies to accelerate innovation in an extremely dynamic digital landscape.

Companies that want to implement DataOps should focus their efforts in three areas: culture, organization, and technology.

This is what I would like to go into more detail in the next chapter.

Stay tuned.

--

--