Cloud Trends: A Mainstream Evolution to DataOps

Why DataOps is becoming an impactful tech approach for organizations

Justin Van Wygerden
Slalom Data & AI
5 min readMay 8, 2023

--

Photo by Desola Lanre-Ologun on Unsplash

Over the years, the traditional DevOps concept has done an extraordinary job of providing end-user value — specifically when streamlining skill sets around development and cloud-native operations. In return, teams have been able to see end-to-end product-centered models that (1) decrease time to market for new applications and (2) increase overall application availability. Yet, in the new advanced era of the cloud, company needs are becoming increasingly more complex. One such complexity specifically resides in aggregating cloud-native, actionable data insights, including how to automate cloud-native data infrastructure to rapidly obtain these insights.

For more clarity on how to obtain and use this data, companies should consider the following set of questions:

  1. How do I create a data lake and/or data mesh (and what is the difference between the two)?
  2. What cloud-native components of a data lake or data mesh can be provisioned with automation?
  3. How can we automate data processing early using clearly defined, repeatable pipelines?
  4. How can we increase compliance and security of our data?

In order to fully answer these questions, many teams are turning to solutions that address a new mainstream “ops” concept, via DataOps. DataOps is a model that encourages the combination of managed cloud-native services and DevOps skills for the benefit of data-driven initiatives. This article will cover the four essential components of DataOps, as well as details on how to kick-start your DataOps strategy!

Data lakes and meshes

Data lakes and data meshes are trending toward becoming cloud-native entities, but the difference between the two is critical. Traditionally, data lakes are created to aggregate data from all areas of an organization, typically in a centralized location. For a cloud-migrated organization, this would normally reside in a cloud-native data warehouse (think of Amazon Redshift or Snowflake). However, the reality of utilizing a single data warehouse is becoming more difficult to manage as cloud workloads increase.

The data mesh concept, per Zhamak Dehghani, increases the emphasis on product-based value. In order to deliver actionable insight to an organization, the IT industry appears to be trending toward leveraging “mini” data lakes, created through repeatable processes. For example, if your organization has four key lines of business (that all require separate, actionable data insights), it would likely be more valuable to have distinct data “pools” or “puddles.” Each of these more granular data lake entities would then be able to serve a clearer and direct purpose (hopefully with the end goal of speeding up valuable, market-based insights). This ideally allows for the creation of a more well-defined, scalable data strategy across an organization.

Automation of data lake/mesh cloud components

Whichever data strategy an organization chooses, the concept of cloud specialization likely becomes essential. For a data lake or data mesh, knowing the best practices for data architecture in each of the major cloud vendors (AWS, Azure, and GCP) is especially important. Additionally, even if an organization uses a cloud-agnostic solution like Snowflake, having some degree of knowledge of the components “under the cloud vendor covers” becomes critical. Cloud-agnostic data warehouses are often installed on a major cloud vendor’s infrastructure, which creates complications for teams that do not understand its services.

In order to automate data-based cloud infrastructure, teams likely will turn to an infrastructure as code (IaC) tool. By utilizing IaC tools such as Terraform, automated pipelines can keep track of updates made to separate, distinct instance(s) of data warehouses. The purpose of doing this is to increase the velocity of creation and updates to our DataOps cloud infrastructure(s), while allowing for the increase of compliance and security capabilities. For technical examples, below are relevant Terraform provider resources for each major cloud vendor (note all are available on the HashiCorp Terraform Registry):

  1. Amazon Redshift Terraform Resource (aws_redshift_cluster)
  2. Azure Synapse Terraform Resource (azurem_synapse_workspace)
  3. GCP BigQuery Terraform Resource (google_bigquery_dataset)

Repeatable, data-processing pipelines

Data-processing pipelines often follow repeatable steps that can be automated within a cloud infrastructure footprint. For example, data pipeline logic (which often runs using languages such as Python or R) needs to be run within an underlying DataOps infrastructure. Two core options for doing this involve using a container orchestration or serverless strategy. For the container orchestration strategy, a managed container cluster (i.e., Kubernetes) can be used to execute the application code. Popular choices for a managed Kubernetes (K8s) service include EKS (Amazon), AKS (Azure), and GKE (GCP).

A container orchestration model allows for ETL (extract, transform, load) processes to be built at scale. Additionally, each of these steps can be made more granular by running in a container that does not need to be long-running (for the emphasis of reducing DataOps-related compute and storage costs). Moreover, managed K8s services tie directly into their respective cloud identity solutions to allow for repeatable authentication/authorization solutions.

On the other end of the spectrum, serverless strategies can also be used for data-processing logic. Often, this is a valuable choice for teams that have a uniform set of cloud services (whereas the K8s option is more beneficial for teams wishing to remain cloud agnostic). While it is extra work to maintain data pipelines using separate serverless FaaS-based cloud services (AWS Lambda, Azure Functions, or GCP Cloud Functions), the major benefit is the ability to create truly event-driven data pipelines. The concept of event-driven logic is innately more seamless in the serverless strategy, as opposed to using a Kubernetes-based strategy.

Increasing data compliance and security

Of all components, this may be the most valuable for DataOps. By creating automated cloud infrastructure, security potential can increase drastically for companies. The key is to have data cloud infrastructure under source control. For example, by having the same source code provisioned with an IaC tool like Terraform, auditing capabilities vastly increase. Any changes or modifications to the data cloud infrastructure can be tracked, whereas manual changes in the past were more difficult to audit. In addition, any configuration drift of a compliant, secure landing zone can be detected early to see deviation from the desired state.

DataOps capabilities via IaC also allow teams to use cloud-native security services in a more consistent, standardized way (that emphasize the inclusion of new features). By having data cloud infrastructure stored in a repository, a security engineer can more easily make a GitHub pull request to add or update security services.

Conclusion

For organizational leaders, addressing the different topics related to the growth of DataOps is difficult, especially as the strategy continues to evolve. At a high level, the thought process is that new value streams can be added by combining data and cloud-native initiatives, specifically by leveraging your cloud vendor(s) of choice. Often this may mean combining skill sets between cloud infrastructure and data, as well as looking at investment areas more broadly (whereas in the past, these two divisions may have been more isolated). Going back to foundational DevOps theory, unifying these two tech units allows for more effective ways to increase innovation. By thinking of the possibilities that these two disciplines can deliver together, more innovative capabilities are becoming possible.

Slalom is a global consulting firm that helps people and organizations dream bigger, move faster, and build better tomorrows for all. Learn more and reach out today.

--

--