Towards Effective DataOps
Gain the confidence to mess with your data without making a mess of your data.
“If it hurts, do it more often.” is a wise piece of advice that DevOps engineers often repeat.
Unless you are a masochist, following this advice will naturally lead you to finding ways to make the process being repeated less painful.
In the world of DevOps, these processes are typically technical deployments like creating cloud resources or applying version upgrades, for example.
Once it is simple to do these tasks, we no longer need to fear them. We can build applications that are reliable and can feel like true professionals at our craft.
From DevOps to DataOps
The core principles of DevOps that allow us to make things simple are: Collaboration, Automation, and Continuous Improvement. Following them leads to a path towards a painless developer experience.
For the purposes of this article, we will explore how these principles need not be limited to traditional software developers, but can be applied to what’s referred to as data-intensive development as well.
Enter DataOps. Why should we care about this concept?
Well, look around. Many modern data environments are fragile and error-prone precisely because they were built without DataOps-related considerations.
If we apply DevOps principles to data and pick a specific technology like a data warehouse, we are encouraged to undertake behaviors like:
- Automate the warehouse’s creation via an IaaC platform like Terraform.
- Automate the creation of database schemas, table definitions, and user accounts.
- Eliminate any manual operations involved in the updating or altering of data collections.
How many of the above practices does your data org follow?
From working with a variety of data teams, what we see is that most check the box on only one, maybe two. As a result, many data teams build data products that are error-prone and at a slow pace.
As data professionals, we can do better. Let’s see how we can get on the path towards effective DataOps.
Taking the First Steps Towards Effective DataOps
If you are looking to improve your DataOps practices, we advise starting by taking note of operations that require you to run commands manually or in a GUI.
Perhaps when someone new joins the team, you have to run a CREATE USER command for them in one or more systems. It might not seem like a big deal to do this every month or so. But as these manual processes accumulate, you lose the ability to deploy new data stacks quickly or guarantee the state of an existing one.
Think of how amazing it would be if you could recreate your production environment — query engines, queues, dashboards, orchestrators, and even the data itself — in a matter of minutes as opposed to days or weeks.
You would be able to recover from outages quicker and more reliably. It would be much easier to test and understand the impact of an update deployed to production.
From our experience, the upfront investment required to make this happen pays off in the long run.
What Effective DataOps Looks Like
When done right, DevOps engineers allow software developers to focus on building applications and the application logic — instead of getting bogged down on infrastructure concerns. This division of labor proves effective since it is not easy to maintain expertise in both areas.
In the same way, data engineers and data scientists can see a bump in productivity once freed from the concerns of their data infrastructure.
When starting a new project and architecting the solution, engineers should first spell out all the resources required to get it working so the DataOps team can begin configuring the automation that will create it.
We’re keeping an eye on projects like the Open Data Hub are standardizing Kubernetes-based deployments of popular data technologies to help prevent individual companies from having to reinvent the DataOps wheel.
Next, for their part, the data engineers and scientists should use shell scripts or code to create data tables or transformation jobs, rather than GUI interfaces. This prevents the need for retracing one’s steps in an error-prone way when going from a dev environment to production.
What makes DataOps especially tricky though, is the final step of hydrating a data environment with its most important resource — the data itself. Data is arguably the largest thing in the digital universe, and replicating it across multiple environments can be burdensome and costly.
Luckily, this is where lakeFS shines.
lakeFS as the DataOps Solution for Data Lakes
lakeFS is an open source project that lets you create data repositories, which enable Git workflows over data lakes.
We designed lakeFS from the beginning to promote effective DataOps practices when it comes to managing data of any size. As a result, we prioritized things like:
- Committing to an open source model, allowing for easily automated deployment.
- Exposing comprehensive and robust APIs that let you automate repository and user creation.
- Building scalable Git-inspired operations like
mergeto make it easy to perform CI/CD data deployments and hydrate development environments with data.
- Exposing a lakeFS hooks functionality that let you link data quality tests to commit and merge operations.
Most critically, through its novel data versioning engine, lakeFS makes it as easy as running a one-line branch create command to populate a data environment with a full, isolated replica of your data (that also minimizes any copying of the underlying data objects).
See here for more details on using lakeFS in your development environments.
This makes lakeFS an indispensable part of the data stack for an org that wants to maintain good DataOps principles for its 1) infrastructure 2) application and 3) data layers.
The best time to get started building a data environment people enjoy developing in is today!
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
This post was originally published on the lakeFS blog by Paul Singman.