You had me at DataOps

Chantelle Whelan
4 min readMay 19, 2023

--

Photo by charlesdeluvio on Unsplash

I’ve been working as a data analyst for about five years. Over the last year or so, my work has taken me into a slightly different realm. I was tasked with creating data tables for other teams on the project to use. The tables needed to be refreshed regularly and saved in a location where they could be transferred to the database available to the other teams. As this process would be triggered by Engineering as part of their full daily process, it was vital that the code ran smoothly and efficiently. Through doing this work I became familiar with the concept of writing ‘clean code’, making use of code repositories, unit testing and the use of SSH.

I found myself really enjoying this new task and quickly volunteered to take on more work that made use of these types of skills. It was so satisfying to see that multiple lines of repetitive code could be replaced by one elegant line. For the first time I thought about how the data I use in analysis got to my environment and what was going on behind the scenes when I queried this data. Why some queries take so long to run and how I could reduce the computational load by making minor adjustments to my code. And lo and behold, there is a name for this type of work — DataOps!

Definitions of DataOps

There are a number of definitions of DataOps depending on who you speak to. From my understanding, the term was first coined by Andy Palmer (CEO & Co-founder of Tamr, Inc) in 2015. Andy defines DataOps as:

“A data management method that emphasises communication, collaboration, integration, automation, and measurement of cooperation between data engineers, data scientists and other data professionals.”

I think this definition covers the field well. DataOps (or data operations) requires collaboration between all teams who make use of the data to create and monitor data products in the most streamlined way possible. To take a machine learning model as an example, there are many people and steps required to build and test the model and push it into production where it is monitored over time. This process will require data engineers to create architecture that will enable the data scientist to make use of it in order to build the model. For the model to be used by a business, the model will need to be automated, with an agreed upon output. And the performance of the model will need to be monitored to ensure it remains fit for purpose for the business needs.

To me, DataOps acts as a bridge between data analytics and data engineering. DataOps is required for any data product that is to be used on an ongoing basis. To ensure the product can run smoothly the analytical code needs to adhere to best practice and be automated. This means the initial code used to run the analysis, which is often not in a fit state to be productionised needs to be refactored and maintained using version control. As the initial code is written by a data analytics team, it is these team members who are in the best place to make these changes as they understand the product and the process to create it. However, data management is a core principle of data engineering. Therefore knowledge of this field is a big component of DataOps. The difference between the two is the stage at which the skills are required. Whilst DataOps deals with the operations of data pipelines, data engineers designed and created those pipelines. This great blog post by DataTalks.Club dives into the similarities and differences between DataOps and data engineering.

Future of DataOps

So where could this field go in future? DataOps falls under ‘XOps’, along with DevOps MLOps, etc, etc. As we are using more products underpinned by data, the need to operationalise will become more important. The process of automating and streamlining may be drilled down even more, resulting in additions to that ‘XOps’ list.

One area I foresee a big change are job roles within data. I have already seen many data science job listings that ask for experience using DevOps. It is increasingly important that data scientists are able to put models they have built into production. This expands to any data products, suggesting job roles in other areas will continue to change.

So I can’t go through a whole blog post in 2023 without mentioning Generative AI can I?! Generative AI can be used to create documentation, a very important but often neglected part of the data operations process. Other DataOps processes that Generative AI can be used for includes refactoring code and, identifying the most efficient infrastructure for use purposes. And if it hasn’t already been used, I’m sure it won’t be long till it can be used to review pull requests and maintain models.

Final thoughts…

I have loved working in and learning as much as I can about this field over the last year. Not only has it improved my understanding of data, and therefore the analysis I do, I have also been able to share my learnings with others in my team. Knowing how to make our code more efficient and fit for production are vital skills for a data analyst in today’s data driven world.

So that’s all from me in my first post. If you’ve got this far — thank you so much for reading! Do leave comments with any thoughts or questions. I’d love to hear how others have been using DataOps and learn from you…

--

--

Chantelle Whelan

Academic turned data analyst with a passion for DataOps and ethical machine learning