Understanding DataOps Through DataOps

Subtitle

Published in

The Data Driven Diaries

11 min readJun 15, 2023

Currently, I’m part of a team responsible for constructing a Snowflake-based Enterprise Data Warehouse and Data Processing Platform. Naturally, a project of this scale requires a sizeable team and a wide range of tools. In the project’s early stages, our lead Architect emphasized the vital role of DataOps (the tool) and urged me to quickly familiarize myself with the product.

When it comes to familiarizing myself with a new tool, my initial instinct is to dive into the documentation provided by the creators. However, I found myself grappling with the DataOps documentation quite a bit. It wasn’t because the documentation was poorly written; rather, I was having difficulty grasping the purpose of this particular tool.

It didn’t take long for me to realize that the DataOps tool derives its name from the broader DataOps concept. In light of this realization, I decided to take a step back from the tool and go on a quest to gain a deeper understanding of the underlying idea.

During my research for the project, I came across a collection of excerpts, definitions, ideas, and comments sourced from the internet. I must clarify that the majority of this ellipsis block, about 97%, are not my own words but were rather copied, pasted, and rearranged in my notes. At the time, I didn’t have the intention of publishing this material, so unfortunately, I can’t recall the exact sources.

Deriving the Need for DataOps

According to the Economist, Data has overtaken oil as the world’s most valuable resource. And more and more companies are aligning with this conclusion as they awaken to the fact that the data that they are generating is very valuable.

Let’s take Netflix as an example. As a user, you have the ability to express your preference by liking or loving a show you’re watching. It may seem like a small and insignificant action, but it actually plays a significant role in driving Netflix’s multi-billion dollar revenue stream. By indicating your preferences, you enable Netflix to make tailored recommendations, suggesting more content that aligns with your tastes or even content that appeals to users similar to you. Taking it a step further, this valuable information provides insights on how they can create content that has a high probability of success.

With the new knowledge, there is a skyrocketing supply of data customers resulting in a skyrocketing demand for access to data throughout the company. What businesses are demanding from data at an increasing rate is insight. Delivering that insight at an increased speed, level of efficiency, and reliability is a critical goal.

A data customer is anybody related to an organization who needs the data for a particular function. For example, a financial controller needing sales data, a marketing person needing product data, a customer wanting to see their account details and marketing preferences, a data scientist needing specific information for predictive modeling, or software developers requiring data for application development.

Similar to oil, data requires processing before it can be effectively utilized by data customers. The challenge lies in the fact that data processing is far from a straightforward task, and yet it must be capable of meeting the increasing demand.

There has now been a shift in how organizations think about their data. Data is now considered a product in itself and delivering it to the business needs to function the same as delivering a new software product to the public. When you switch your viewpoint from data project to product you realize that, like a product, it needs to be built well and continue to evolve.

A data product is an application or tool that uses data to help an organization improve their decisions and processes. A trusted data product is a reusable business asset used in multiple places and multiple times.

DataOps Definitions

After understanding the origins of DataOps I wanted a concrete definition. Fortunately, or rather unfortunately, there was no shortage of them. I broke down different definitions into three levels of simplicity.

Level I Definitions

DataOps is an umbrella term that encompasses everything that is involved in the operations around the data.
DataOps means building, testing, and deploying data platforms, the exact same way we do software platforms.
DataOps is the use of agile practices to create, deliver, and manage data applications quickly and reliably.
DataOps is how modern companies strategically manage and integrate analytics to uncover new opportunities, quickly respond to issues, and even predict the future.

Level II Definitions

DataOps is a pervasive, automated methodology for optimization of the development, management, and execution of data pipelines ensuring compliance, data quality, and quick time to market for analytics.
DataOps is the collection of processes, patterns, and best practices around delivering data products. It is all about orchestrating data, tools, code, and environments from beginning to end, with the aim of providing reproducible results. DataOps encompasses everything from collection to delivery.
DataOps is a practice that applies agile engineering and DevOps best practices in the field of data management to better organize, analyze, and leverage data to unlock business value. It’s a collaboration between DevOps teams, data engineers, data scientists, and analytics teams to accelerate the collection and implementation of Data-Driven Business insights.
DataOps is a collection of technical practices, workflows, cultural norms, and architectural patterns that enable:
- Rapid innovation and experimentation delivering new insights to customers with increasing velocity.
- Extremely high data quality and very low error rates.
- Collaboration across complex arrays or people, technology, and environments.
- Clear measurement, monitoring, and transparency of results.

Level III Definitions

DataOps, short for “data operations,” brings rigor to the development and management of data pipelines. It promises to turn the creation of analytic solutions from an artisanal undertaking by a handful of developers and analysts to an industrial operation that builds and maintains hundreds or thousands of data pipelines. DataOps not only increases the scale of data analytics development, but it accelerates delivery while improving quality and staff productivity. In short, it creates an environment where “faster, better, cheaper” is the norm, not the exception.
DataOps applies the use of DevOps technologies to the collaborative development and maintenance of data and analytical pipelines in order to produce trusted, reusable and integrated data and analytical products. These products include trusted, reusable datasets, predictive models, prescriptive models, decision services, BI reports and dashboards. The objective is to accelerate the creation and maintenance of these data and analytical products via continuous component based development of data and analytical pipelines that assemble and orchestrate data cleansing, data transformation, data matching and data integration and analytic component-based services. In addition all changes to versions, operating configurations are managed with build, test and deployment automates in order to shorten time to value.

After diving into the origins of DataOps and exploring different definitions, one common thread emerges: DataOps is all about helping organizations get the most out of their data — faster, better, cheaper. Definitions may vary in their level of detail, but my personal favorite is the simplest one: “DataOps covers everything related to operations around data”. There’s a whole lot that falls under data operations — setting up roles and permissions, following coding standards, keeping an eye on data warehouses, building reports, testing data quality, creating documentation, sharing files, and yes, even scheduling meetings. It’s a lot to handle, which is why it’s hard to nail down a one-size-fits-all definition. And if you could, it would be too much to remember or way too vague (i.e Level III Definitions).

In my search for understanding DataOps, TrueDataOps has been one of the best resources. As I am pretty sure this site is heavily affiliated with DataOps and is low-key advertising for the tool, I wouldn’t take it as the end all be all. However, it provides tons of helpful links and breaks down the idea from start to finish. With the help of my past experiences, I am going to discuss four of the seven pillars that I have come to understand working with DataOps (the tool):

ELT (and the Spirit of ELT)
CI/CD (inc Orchestration)
Component Design & Maintainability
Environment Management

ELT (and the Spirit of ELT)

“Lift & Shift — Building a future that you can’t yet anticipate”

The very first data project that I executed on my own, from ideation to completion was called Bike Share USA. My aim was to introduce bike sharing to my neighborhood and develop a model to predict the optimal number of stations required. It was during this project that I discovered my passion for Data Engineering, as a significant 75% of the effort revolved around this field.

To kickstart the project, I had to create ETL jobs using Python, which involved extracting, transforming, and loading data into an RDS database. The process of writing all the logic for the jobs took an insane amount of time to complete. But here’s the kicker — when I finished the ETL phase and started doing the analytics portion I changed my mind about what I wanted to do and had to go back and edit all of those jobs. It turned into an absolute nightmare due to the lack of readability, flexibility, and extensibility in the code I had initially written (more on this later).

All in all, it took me 5 of 7 months to just do the ETL phase of the project. Considering the current landscape, dedicating over a third of a year to simply loading data feels excessive in a real-world scenario. Opting for the ELT approach, on the other hand, would likely enable me to replicate my work within a week, yielding a much better data product as the output.

As a Data Engineer/Analytics Engineer, I truly appreciate the value of raw data. It eliminates the need for an extensive understanding of the data or how it will be used by the business in order for me to perform my tasks effectively. My role involves handling the data, ensuring its accurate placement, and providing appropriate documentation.
Switching gears to an Analytics Engineer/Data Analyst perspective, raw data is the definition of potential energy. It offers an untainted view of the information, enabling me to work with it in its purest form. I have the freedom to construct my analyses and solutions according to my own preferences and requirements, while also having the flexibility to start from scratch whenever I feel the need.

Component Design & Maintainability

“Small atomic pieces of code and configuration”

As far as I’ve come, my experience converges around this pillar. When you build something, design it to be readable and flexible for the future. Although I’m sure there are other pillars of “good code” that I have not mentioned, these two have really resonated with me so far.

It is the readability and flexibility of the code that makes it easy to maintain. Every piece of code plays a crucial role in ensuring the business’s functionality. And every piece of code, you or someone else in the future, will have to look at it. It is the little things that make the most difference when it comes to readability and maintainability, such as naming objects or organizing code. It’s not so much about being smart and doing something slick. In fact, I look back on slick solutions and wonder what was wrong with me.

As this pillar is mainly concerned with the code that is being written, I want to extend this pillar to the tools that are being used as well. As you build out an enterprise data warehouse, you will probably use a lot of tools, and there is a good chance that your stack’s capabilities will overlap. As much as possible, I recommend using a “separation of responsibilities” approach.

During the early prototyping at the beginning of the project, we used Azure Data Factory to read and write data in blob storage, create tables in Snowflake, execute stored procedures, and even perform some data transformation. But then you needed to use DataOps SOLE to create other tables and DataOps MATE to do some other transformations. It required a lot of “mental processes” and “insider knowledge” to get things done. The intricacies and nuances of each system had to be understood to get things done effectively or fix them when they broke.

We have now separated out each tool’s responsibilities based on what they do best. Azure Data Factory reads and writes data in blobs, DataOps SOLE creates external tables pointing at blobs, and DataOps MATE transforms those tables. When you are trying to do one thing you work in one system without having to think about how a thousand different things work: if you do a transformation you work in MATE, if a table isn’t being created, then SOLE is the culprit.

CI/CD (inc Orchestration)

“Repeatable and orchestrated pipelines for building/deploying everything data”

I discovered CI/CD when I used the Serverless Framework rather than packaging and uploading Python code to AWS Lambda and immediately fell in love with it. This simple piece of orchestration easily saved me hundreds of hours of time and effort.

A major advantage of CI/CD is that everything is stored in a version control system. With DataOps, everything about everything is stored in one place, from Snowflake RBAC configurations to MATE transformations. With DataOps, you can see your infrastructure from end to end in one repository.

CI/CD allows robots to do what they are good at. It allows changes to be deployed quickly and easily, which is amazing, but it can also lead to inconsistent behavior. Developers are free to develop without worrying about deployment, which could backfire because each developer develops differently. Although that’s fine, you want developers to develop differently the same way for maintainability’s sake.

The main thing I learned about CI/CD, especially when it comes to Infrastructure as Code, was the need to create other structures like style guides, templates, branching strategies, branch protections, etc. Unless you take care of these, it will be the wild west. These are the things that really matter and, for me, tend to be at the back of mind (along with updating DevOps/Jira boards 😵‍💫).

Environment Management

“Branching data environments like we branch code”

My favorite DataOps feature is zero copy cloning the production environment for my feature branches. This feature makes this tool exceptionally powerful. It gives developers the freedom to develop, period. DataOps creates a clone of the production database for every feature branch I create so that I can use it as I please without feeling nervous about what I can and cannot do. As part of continuous integration, it destroys the clone after I am done developing and ready to merge.

As someone who deals with infrastructure/analytics code on a daily basis, I cannot express enough how much this feature has improved the development process. I can explore and experiment without worrying about the consequences. This has allowed me to create better, more robust code, faster.

DataOps has helped me understand the previous pillars. TrueDataOps lists a number of other pillars that I don’t comprehend well enough to discuss. In the future, I would like to explore and experiment with Automated Data Testing and Monitoring more. I’ve used DBT tests and packages, but that’s far from automated. Moreover, I’d like to experiment with these pillars outside of DataOps so that I can really put them into practice. While I love DataOps and would use it 100% if I were building a big enterprise, at my stage it feels like cheating and I’m robbing myself of valuable lessons. It’s almost like learning how to multiply for the first time by using a calculator.

I will continue to use (and enjoy) DataOps in our enterprise project, but I’m excited to head back into the lab and keep playing around and see what else is out there. Hopefully I’ll find some interesting things…fingers crossed for how to do a zero copy clone off a feature branch.