DataOps: Bringing DevOps to Analytics Teams
Data teams have, for years, taken inspiration from software engineering teams on team structure, tooling, and general principles. The analyst has become the analytics engineer, aiming to bring in software engineering principles to analytics code.
Similarly, DevOps sits parallel to DataOps. They are not dissimilar — they both aim to make software and data teams, respectively, produce more efficient and reliable outputs. DevOps and DataOps differ in how they execute to achieve this level of productivity: one is focused on application quality, and the other on data quality.
This difference manifests itself in what each team is tasked with monitoring within an organization.
Focus areas of DevOps versus DataOps
You don’t have to trust me on this one — let’s get some definitions from industry leaders.
Atlassian defines DevOps as:
“DevOps is a set of practices, tools, and a cultural philosophy that automate and integrate the processes between software development and IT teams. It emphasizes team empowerment, cross-team communication and collaboration, and technology automation.”
Let’s digest this a bit. Infrastructure and deployment, previously a responsibility of IT, is no longer siloed from code and is not maintained by software engineering teams. With this, teams can deploy faster and focus on reliability and availability of the applications being deployed.
The deployment time decreases alongside the feedback loop to iterate and make changes.
DataOps, on the other hand, defined by IBM as:
“DataOps [is] the orchestration of people, process, and technology to deliver trusted, high-quality data to data citizens fast. The practice is focused on enabling collaboration across an organization to drive agility, speed, and new data initiatives at scale.”
The blatant similarities include unifying the responsibility of infrastructure and code. However, DataOps is focused on data, not applications. Of course, the infrastructure and applications are a means to end: without them, the data cannot flow through to the appropriate stakeholders. However, the testing and reliability doesn’t stop at infrastructure, and continues to what’s actually flowing through it.
Data teams are focused on enabling key decision-makers to leverage data in their decisions, products, and services. The role of DataOps is to ensure the data team can do this work efficiently, consistently, and reliably.
So what tools can DataOps leverage to be as thorough and efficient as possible?
DataOps and the data tooling landscape
I’d argue there are many pieces to the DataOps puzzle, contributing to the grand vision of a fast-moving but highly reliable data team.
Data infrastructure as code: Software infrastructure looks less like a boutique pet groomer and more like a cattle ranch as scales increase. To get there, configuring various pieces of infrastructure can’t rely on button clicking and interfaces anymore — it has to be managed as code that can be scripted, copied, easily rolled back, etc. The same is true for both data code and infrastructure. We can look to dbt Labs as a great example of defining data transformations by checking them into a repo. Data teams can also benefit from adopting typical infrastructure-as-code tools like Terraform, bringing version control beyond code and to the infrastructure itself.
CI/CD: Just like DevOps, continuous integration and continuous deployment enable data teams to iterate quickly and reliably. With versioning tools like Github and build tools like CircleCI, data teams can write code, test it, and deploy it without involving other teams’ resources.
Communication: While DevOps focuses on communication between software engineers and IT, DataOps enables analytics teams to communicate data and insights with the rest of the business, and do so reliably. Efficient DataOps means analytics teams can address requests from business stakeholders iteratively, quickly, and confidently. Task management tools like Jira and Trello help keep software engineering teams organized, and can do the same for data teams.
Observability and monitoring: Automated testing frameworks are usually limited to unit tests when it comes to software applications. In the data world, pipeline tests and ongoing monitoring of metrics solve the need for alerting when things are amiss. Within a pipeline, testing occurs to stop inaccurate data from reaching production tools, whether it be through reverse ETL or dashboards to executives. However, you don’t know what you don’t know: pipeline testing relies on human coded tests. In parallel, monitoring tools (like Bigeye) collect a broad array of statistics from the data and detect anomalies automatically to help discover these unknowns. Some of my colleagues at Bigeye wrote about their take on observability and monitoring.
Just like DevOps, however, tools on their own don’t make for effective DataOps. The people and processes are just as important.
Setting up the analytics team for success
DataOps is a key factor in the trust built around data. If teams lack scalable practices — like version control, CI/CD, and observability — they’ll either fail to move fast enough to keep their org fed with the data it needs, or they’ll fail to deliver enough reliability for that data to be trusted. This trust measures how effective analytics teams are at enabling stakeholders to rely on data day-to-day. Answering questions quickly is what keeps stakeholders going to analytics teams instead of working around them.
Observability and alerting keeps analytics teams confident they won’t lose that trust and reliability. With that, I encourage you to check out tools like Bigeye, dbt, and CircleCI.