Observability for Data Engineering

Evgeny Shulman
Databand, an IBM Company
8 min readJan 7, 2020

--

Observability is a fast-growing concept in the Ops community that caught fire in recent years, led by major monitoring/logging companies and thought leaders like Datadog, Splunk, New Relic, and Sumo Logic. It’s described as Monitoring 2.0 but is really much more than that. Observability allows engineers to understand if a system works like it is supposed to work, based on a deep understanding of its internal state and context of where it operates.

Sumo Logic describes Observability as the following:

It is the capability of monitoring and analyzing event logs, along with KPIs and other data, that yields actionable insights. An observability platform aggregates data in the three main formats (logs, metrics, and traces), processes it into events and KPI measurements, and uses that data to drive actionable insights into system security and performance.

Observability goes deeper than Monitoring by adding more context to system metrics, providing a deeper view of system operations, and indicating whether engineers need to step in for a fix. In other words, while Monitoring tells you that some microservice is consuming a given amount of resources, Observability tells you that its current state is associated with critical failures and you need to intervene.

But What About Data?

Until now, Observability has lived in the realm of DevOps or DevSecOps, focused on applications, microservices, network, and infrastructure health. But the teams responsible for managing data pipelines (Data Engineers and DataOps) have largely been forced to figure things out on their own. This might work for organizations that aren’t heavily invested in their data capabilities, but for companies with serious data infrastructure, the lack of specialized management tools leads to major inefficiencies and productivity gaps.

Why Don’t Existing Tools Cut It?

The typical course of action for Data Engineers today is to take a standard monitoring tool that was originally built for applications or infrastructure and try using it for their data processes. Using these general-purpose tools, Data Engineering teams can gain insight into high-level job (or DAG) statuses and summary database performance but will lack visibility into the right level of information they need to manage their pipelines. This gap causes many teams to spend a lot of time tracking issues or work in a state of constant paranoia.

The reason standard tools don’t cut it is because data pipelines behave very differently than software applications and infrastructure.

Zabbix and Airflow — one of the good examples of each category
Zabbix and Airflow, how do we get all the right data in one place?

The Differences

Here are some of the main differences between data pipelines, especially batch processes, and other kinds of infrastructure.

Periodicity

Most monitoring tools were built to oversee systems that are supposed to run consistently, 24/7. Any downtime is a bad thing, it means visitors can’t access a website or users can’t access an application. On the other hand, batch data processes run for discrete periods of time by design. As a result, they require a different kind of monitoring because most of the questions you’ll ask do not have a simple binary answer like “on/off”, “up/down”, “green/red”. There are more dimensions to understand — scheduled start times, actual start times, end times and acceptable ranges.

Unlike other systems, it’s also totally normal for data pipelines to regularly fail before they succeed. A DAG might fail 6 or 7 times before it snaps on and successfully runs. This could be due to some type of desirable job throttling like database pooling.

Point being, Typical behavior for DAGs is very untypical of other infra services. With standard alerting, data teams get flooded with meaningless alerts, with hundreds of unread notification emails, unable to sift through the noise.

Long-Running Processes

Data pipelines are often long-running processes that take many hours to complete. Our own customers report back to us averages of around 6 hours. What’s challenging about monitoring long-running processes is that errors can come out later in the job and you need to wait and watch for a long time to know if there is a success or failure. This leads to a higher cost of failure because of the added time of restarting jobs from the beginning if an issue comes up downstream. Teams need ways of gathering early warning signs and smarter methods of analyzing histories to anticipate failures.

Complex Dependencies

DAGs are pretty much complex by definition. They have internal dependencies in the form of their sequence of tasks as well as external dependencies like when data becomes available and outputs of preceding jobs. This web of interdependencies creates a unique set of monitoring requirements where you need to understand the broader context around a process so that you know how issues cascade or trace back.

Data Flows

Of course, we can’t forget that data pipelines run on data. This is another complex dimension that needs to be monitored and understood, including schemas, distributions, and data completeness. For example, most batch jobs operate on some window of data, ie. the last 60 days. You need to know if in that 60 days there was a problem, like data not being generated.

While there are tons of solutions for tracking data, what’s really different about teams today is that there are much more specialization and open source usage, and it’s hard to find frameworks that are easy to integrate with the modern stack of tools (ie. Airflow, Spark, Kubernetes, Snowflake), are generally applicable, and provide the right level of extensibility.

Cost Attribution

Last but not least, cost attribution is harder when it comes to pipeline monitoring because teams need to look at processes on many different angles to understand their ROI. Examples include looking at cost by:

  • Environment (the pipelines on my production Spark cluster cost X)
  • Data source provider (the pipelines reading data from Salesforce cost Y)
  • Data consumer (the pipelines delivering data to the Data Science team cost Z)

Bringing these factors all together, most of the differences between monitoring data pipelines and monitoring other kinds of infrastructure boil down to the fact that pipelines have many more dimensions that you need to watch and require very granular reporting on statuses. These issues are compounded for teams leveraging more complex systems like Kubernetes and Spark. Without observability, Data Engineering teams run blind and will spend the vast majority of their time trying to track down issues and debugging things.

What We Suggest

In our previous lives managing Data Engineering teams, we always struggled with maintaining good visibility into projects and infrastructure. We suggest giving more thought to Observability for your own data stack, and considering the factors that make it unique. This will help you build a more robust data operation by making it easier to align your team on statuses, identify issues faster, and debug more quickly. Here are some best practices we recommend to begin:

(1) Make incremental investments in data/metadata collection from your DAGs. Start with tracking basic metrics about your pipeline inputs and outputs so you can figure out right away if there’s significant data changes, and if those changes are internal to your pipeline or caused by external events. Examples include reporting the number and size of input/output files.

The next step would be extracting info about pipeline internals — the intermediate results. These are the input and outputs between tasks in a pipeline. Having internal visibility will help you drill into exactly in the DAG where issues or changes are happening.

The next most significant addition to data monitoring would be layering in tracking of schemas, distributions, and other statistical metrics of your input and output files.

Making these improvements incrementally will help you gain more and more awareness of your pipelines without undertaking a massive project, and enable you to experiment with the right tools and approach for each layer of visibility.

(2) Define pipeline regression tests and track your test metrics. Just like Software Engineers test application code before it goes into production, Data Engineers must test pipeline code.

For teams that have a testing or CI/CD process for their pipelines, we see two common issues. First of all, making sure that the data used in your data regression tests is updated and represents real production data. We recommend using data from the latest successful production pipeline run. Second, having some basic automation that runs new DAG code on that data and alerts on issues before pushing into prod.

Automating tests for your pipelines will help you understand more nuances about your data flows and pick up on issues before your data consumers struggle with them downstream. You can catch bugs in your pipelines, identify changes in data quality before they surprise data analysts and scientists, and make decisions about updating/retraining an ML model.

(3) Define & monitor standard KPIs that you can align all roles on — data engineers, analysts, and scientists. For a team working on machine learning, this might be data engineers having exposure to model performance indicators that are built by data scientists (like R2), and data scientists having metrics of the data ingestion process managed by data engineers (like a number of filtered events). Creating alignment across the team on these shared metrics is powerful because each side will understand issues without so much back and forth.

Here’s an example — let’s say a data engineer adds an extra data source. The source contains a lot of noise, so the engineer adds filters to make sure only useful data is getting through. A data scientist starts using the data and trains a model. Knowing how the data was prepared empowers the data scientist to manage their model. They can anticipate problems that would happen if the model is productized in environments without similarly filtered data, and can advise on how the data should be treated when their model needs to be retrained in the future.

Wrapping It Up

Beyond getting started, as your data engineering team scales Observability will become more and more important and it’s important to use the right tool for the job.

Don’t assume the tools you use to run your process can monitor themselves. The kinds of tools most teams use to run their data stack have significant gaps when it comes to Observability — ie, relying on Airflow to monitor Airflow can easily snowball into excessive complexity and instability. Also don’t assume you can do it with the standard, time-series monitoring tool for watching scheduled jobs, because of how different data processes are.

To really gain Observability, you need execution metrics (CPU, time to run, I/O, data read and write), pipeline metrics (how many tasks in pipeline, SLA of each task), data metrics, and ML metrics (R2, MAE, RMSE) in one place and a dedicated system that can make sure your logs are accurate and statuses are in sync.

If you would like to share your perspectives on data engineering observability, we’d love to hear from you! Check out our website or message the team at contact@databand.ai.

--

--