From Software Observability to Data Observability

Seckin Dinc
8 min readApr 26, 2023

--

Photo by Will Myers on Unsplash

The data department’s core responsibility is to serve accurate and reliable data in a timely fashion at their organizations. With the increasing amount of data sources, variety, and speed, establishing this responsibility is getting harder and harder every day.

As the organization scales, the product and operation team numbers increase exponentially. While this is necessary for the growth of the organization, it brings a problem of data ownership.

In the lack of properly designed data ownership and data quality best practices, data engineering teams find themselves building infrastructures to ensure enterprise data quality measures without a proper understanding of the business, product, and operational changes.

Creating data quality measures such as data validation, testing, and alarming mechanisms only prevent known-unknown data quality issues. As the organization scales among various teams, it is quite impossible to track all the changes and prevent them. In this case, data teams need continuous monitoring and observability of their data pipelines to understand and ensure the quality of the data flowing.

In this article, we will dive into the data observability topic as the core measure to prevent our data engineering teams turn into 7/24 firefighters.

We Have the SLAs, What is Next?

Consider the SLAs as the new year’s resolutions. You define various resolutions for the next year. Some of the resolutions are directly under your control, and others are aligned with your partner, stakeholder, etc. After you set up these resolutions, you need to establish an environment where you can continuously monitor the related processes, create alarms when something is not going as expected, find ways to improve the approaches, etc.

Setting up SLAs is just the beginning of a long-term process. Under the guidance of the SLAs, you need to get full control of your data pipelines. In order to do that you need data reliability solutions that continuously monitor the pipelines, check the lineage, test the data at rest, constantly apply machine learning and time series techniques to differentiate the normals from anomalies under the best practices of data observability, generate alarms on the anomalies and automate as much as possible.

We need a solution to monitor everything about our data to inform us about the unknown-unknown problems in our data stack before we are bombarded by our stakeholders or clients. We need data observability! But what is observability?

What is Observability?

Photo by Jean-Louis Aubert on Unsplash

Observability is a term that originated in the field of engineering and computer science to describe the ability to understand the internal state of a system by examining its outputs, without requiring knowledge of the system’s inner workings.

In software engineering, observability refers to the ability to understand how a system is performing based on its external signals, such as logs, metrics, traces, and events. Observability enables developers to monitor and debug complex distributed systems by providing them with actionable insights into the system’s behavior and performance.

The origins of observability in software engineering can be traced back to the emergence of distributed systems and cloud computing in the early 2000s. In recent years there are enormous updates in the open-source and commercial areas. Below you can find the most known observability solutions for software engineering teams;

  1. Amazon CloudWatch: A monitoring service provided by Amazon Web Services (AWS) that collects and tracks metrics, logs, and events from various AWS resources and applications.
  2. Google Cloud Monitoring: A monitoring service provided by Google Cloud Platform (GCP) that provides visibility into the performance and health of GCP resources and applications.
  3. Microsoft Azure Monitor: A monitoring service provided by Microsoft Azure that collects and analyzes data from various Azure resources and applications, providing visibility into their performance and health.
  4. Datadog: A cloud-based monitoring and analytics platform that provides real-time visibility into the performance and health of cloud infrastructure, applications, and logs.
  5. New Relic: A cloud-based observability platform that provides real-time visibility into the performance and health of cloud infrastructure, applications, and logs.
  6. AppDynamics: A monitoring and analytics platform that provides real-time visibility into the performance and health of cloud applications and infrastructure.
  7. Splunk: A cloud-based platform that collects and analyzes machine-generated data, providing insights into the performance, security, and compliance of cloud infrastructure and applications.

The Pillars of Observability

Software engineering observability is built on three pillars; metrics, traces, and logs. Let’s deep dive into them;

  • Metrics are a metric is a quantifiable measurement of a system or application’s performance or behavior. Metrics are typically numerical values that are collected over time, and they can provide insights into how a system is performing, how it is being used, and where potential issues or bottlenecks may exist.
  • Traces are records of the activity and interactions that occur as a request or transaction flows through a system. A trace typically consists of a sequence of events or spans, each of which represents a specific operation or step within the transaction.
  • Logs are chronological records of events or messages that are generated by a software system. Logs are typically used to capture information about the behavior of a system, such as errors, warnings, and other events that may be of interest to developers or operations teams.

If observability is that important for the software engineering teams, how is the situation for the data domain?

What is Data Observability?

Data observability is the practice of ensuring that all data flowing through a system is complete, accurate, and reliable. It involves monitoring data in real-time to detect and address data quality issues as they arise.

Data observability is an important part of data reliability, and it helps organizations to manage their data effectively by identifying issues, anomalies, and errors in real time. This process involves collecting, analyzing, and visualizing data to gain insights into data quality, data lineage, and data usage.

Here are the most critical reasons why every data organization needs to establish data observability;

  • The data changes quite often on the data producers' side. Most of the time the data engineering team is the latest team to be informed about these changes.
  • Lack of data ownership and data quality best practices applied on the data producers’ side, data pipelines are fed by garbage data that breaks many things in the pipeline.
  • End-to-end data pipeline testing is about preventing known-unknown data quality issues. The unknown-unknown data quality issues may arise at any time and they are harder to detect as we don’t know their baselines.
  • When data downtime happens, it is quite time-consuming to trace back to the root cause due to a lack of data lineage.

The Pillars of Data Observability

We can group pillars of data observability into four groups; metrics, lineage, and metadata. Let’s deep dive into them;

  • Metrics can include a wide range of measures; e.g. data quality metrics, latency metrics, throughput metrics, error metrics, compliance metrics, etc
  • Metadata refers to information that describes the characteristics and properties of data, including its structure, format, schema, freshness, etc.
  • Lineage refers to the ability to trace the origin and movement of data throughout a system or pipeline. Lineage information typically includes the source of the data, any transformations or processing that occurs, and the destination where the data ends up. This information can be used to troubleshoot problems and diagnose issues in the pipeline.

What are the Key Capabilities of Data Observability Tools?

Data observability tools should provide a range of features that enable data teams to monitor, diagnose, and troubleshoot issues in their data pipelines. Let’s deep dive into the details;

  1. Data quality monitoring: These features enable data teams to monitor the quality of their data in real time, identifying issues such as data anomalies, missing values, and duplicates.
  2. Alerting and notification: These features enable data teams to receive alerts and notifications when issues are detected, allowing them to respond quickly and take corrective action.
  3. Lineage tracking: These features enable data teams to trace the origin and movement of data throughout their systems, providing insights into the quality, reliability, and accuracy of the data.
  4. Data visualization and analytics: These features enable data teams to visualize and analyze data in real time, providing insights into trends, patterns, and anomalies.
  5. Collaboration and sharing: These features enable data teams to collaborate and share information with other team members, improving communication and reducing the time required to resolve issues.
  6. Data compliance and governance: These features enable data teams to ensure compliance with regulations and data governance policies, such as data privacy and security regulations.
  7. Integration with data tools: These features enable data teams to integrate data observability tools with other data tools and systems, such as data warehouses, data lakes, and ETL tools, to provide a comprehensive view of their data pipelines.

Overall, data observability tools provide a range of features that enable data teams to ensure that their data is accurate, reliable, and compliant with regulations and best practices.

Data Observability Market Status Quo

Data observability is a rather niche area compared to other fields of the data domain. Due to this fact, the open-source community is still in the development phase at the moment.

On the contrary, the commercial market is pretty competitive these days. Thanks to the broader adoption of data and the major impact of the data downtimes pushed companies to make investments in this field.

Below I will share the competitors in the market and share their positioning according to G2.

Conclusion

Data observability is the core of data reliability and data quality to build trusted data pipelines that generate trust and impact in the organization. Without data observability, we can only test and validate our pipelines with the known use cases. Due to the changing nature of the data, these only cover a small portion of the problem statement. In that sense, data teams should invest in data observability as the software engineering teams have been doing over the last 20 years.

Thanks a lot for reading 🙏

If you are interested in data quality topics, you can check my other articles;

If you want to get in touch, you can find me on Linkedin and Mentoring Club!

--

--

Seckin Dinc

Building successful data teams to develop great data products