The need for data observability

Antonio Aguilar
Globant
Published in
7 min readJul 12, 2022

World data trends

According to Statista, The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 64.2 zettabytes in 2020. Over the next five years up to 2025, global data creation is projected to grow to more than 180 zettabytes.

In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often.

How Much Data Is on the Internet?

  • The amount of data in the world was estimated to be 44 zettabytes at the dawn of 2020.
  • By 2025, the amount of data generated each day is expected to reach 463 exabytes globally.
  • Google, Facebook, Microsoft, and Amazon store at least 1,200 petabytes of information.
  • The world spends almost $1 million per minute on commodities on the Internet.
  • Electronic Arts process roughly 50 terabytes of data every day.
  • By 2025, there would be 75 billion Internet-of-Things (IoT) devices in the world
  • By 2030, nine out of every ten people aged six and above would be digitally active.

How much data is generated every day?

The entire digital universe is expected to reach 44 zettabytes by 2020

Here are some key daily statistics highlighted in the infographic:

  • 500 million tweets are sent
  • 294 billion emails are sent
  • 4 petabytes of data are created on Facebook
  • 4 terabytes of data are created from each connected car
  • 65 billion messages are sent on WhatsApp
  • 5 billion searches are made
  • By 2025, it’s estimated that 463 exabytes of data will be created each day globally, that’s the equivalent of 212,765,957 DVDs per day!

A very graphic example of the tendency is the “Internet in a minute” by Lori Lewis, this amount of data is moved every minute, making it more and more complex to perform an assessment of the data quality.

Currently, Seagate has around 4 ZB, Estimated to be 175 ZB in 2025

Given the huge volume of data expected to be more than 180 ZB this year, according to the data conferences, the data increasing is taking more concern and at the same time is providing more value in organizations, so the solutions to ensuring data quality have received little more attention.

The importance of data quality

Data quality Is a function of measuring the reliability, completeness, and accuracy of data as a way to understand whether your data fits the needs of your business.

The rise of the cloud, distributed data architectures, and teams, as well as the move toward data production, have placed the onus on data leaders (others than Data architects also new roles are being blooming like Data steward, Data owner and also Data Strategist) to help their businesses move toward more trusted analytics and the data that allow achieving reliable data as very important aspects as a long race and involves many stages in the data pipeline. Furthermore, committing to improving data quality is much more than a technical challenge, it is an organizational and cultural commitment .

Data reliability is the ability to deliver data with high availability and healthy throughout the entire data lifecycle, as a component delivered during Data pipelines execution in a data quality report.

If making key strategic decisions based on inaccurate data or wasting valuable time finding and diagnosing issues with data sounds commonplace, then companies suffer from data downtime.

The root of data downtime? Unreliable data, and lots of it.

Data downtime refers to periods of time when your data is partial, erroneous, missing, or otherwise inaccurate. It is highly costly for any organization that considers itself as data-driven affecting almost every team, actually it is typically addressed on an ad-hoc basis and in a reactive manner.

Data Platform

In addition to monitoring and alerting for data issues at all stages of the data pipeline, delivering reliable data requires a smart data platform, a combination of technologies that enable to manage data holistically, from ingestion to analytics.

While building a data platform, it’s important to account for six foundational, interconnected layers:

Data quality is the degree to which data is accurate, complete, timely, and consistent with your business’s requirements.

Data governance, in very basic terms, is a framework to proactively manage your data in order to help your organization achieve its goals and business objectives by improving the quality of your data.

End-to-end data observability is crucial for ensuring data quality.

Effective observability tooling will connect to your existing data stack, providing end-to-end lineage that allows you to surface downstream dependencies and automatically monitor your data at rest without extracting data from your data store and risking your security or compliance.

Having observability makes audits, breach investigations, and other possible data disasters much easier to understand and resolve

The need for data observability

Data observability is an organization’s ability to fully understand the health of the data in their systems, eliminating data downtime.

The challenge outlined above is the exact problem that data observability aims to fix.

Data observability gives you transparency and control over the health of your data pipeline, such that when an issue does occur, you can quickly understand:

  1. Where is the problem?
  2. Who needs to resolve it?

Knowing this information makes it possible to find and resolve issues far quicker and minimize data downtime.

Observability is not only monitoring, monitoring covers the “known unknowns”, whereas observability covers the “unknown unknowns”.

To identify and eliminate data downtime, teams must leverage the five pillars of data observability and embrace automated checks to monitor pipeline performance.

Observability can be an option adequate combined with best practices for data logging and monitoring, for data quality management and analytics

Data observability makes it much easier to find and fix the root cause of problems

Given a potentially high risk of problems, the solution using data observability can help to

  • Detect and stop incident propagation
  • Find and fix the root cause of problems
  • Data observability is a solution that provides an organization with the ability to measure the health and usage of their data within their system, as well as health indicators of the overall system.

By using automated logging and tracing information that allows an observer to interpret the health of datasets and pipelines, data observability enables data engineers and data teams to identify and be alerted about data quality issues across the overall system. This makes it much faster and easier to resolve data induced issues at their root rather than just patching issues as they arise

How to achieve data observability

There are some indicators to consider to achieve data observability and data quality; these indicators can be part of the Key Performance Indicators formulas to achieve the Objectives and Key Results on the data side.

Data quality indicators:

  • Accuracy
  • Completeness
  • Consistency
  • Conformity
  • Integrity
  • Timeliness
  • Unique

How to ensure these quality indicators are all present within your dataset and systems?

Using principles similar to software application observability and reliability and applying it to data quality, these issues can be identified, resolved, and proactively presented.

To generate this type of end to end data observability, it’s necessary to log and trace the following information within your data pipeline and datasets:

  • Application
  • User
  • Lineage
  • Distribution
  • Time-based metrics
    Frequency
    Freshness
    Time frame
  • Completeness
    Missing values
    Volume
  • Schema

Considerations to implement a data observability solution

To achieve data resiliency and avoid data incidents, there are 3 key steps organizations must take:

  1. Increase the visibility and awareness of data concerns
  2. Contextually log, measure, and trace data usage
  3. Use automation to continuously validate and test the quality of data in production

Given the three pillars of observability in software engineering:

  1. Metrics
  2. Traces
  3. Logs

The extension for this concept in data observability are the 5 pillars, as follows:

Every pillar covers the metrics needed to have a smart dashboard for data observability.

The column of questions has some examples as guides to define the formulas to the defined KPIs.

There are a lot of software tools in the market to build awesome dashboards with a little effort or more complicated and sophisticated ones, with ML and AI involved, that selection depends on the company, budget, project, technology platform, and other variables.

The Data Observability journey depends more on the particulars of the team, on the Goals and KPI’s in the project, focusing on solving the biggest pain points rather than starting in place places that already operate smoothly enough.

Keep in mind the inclination toward moving fast, demonstrating high value and ROI, and tackling work interactively.

References

  1. Statista is a German company specializing in market and consumer data — Amount of data created, consumed, and stored 2010–2025, Published by Arne von See
  2. The World Economic Forum (WEF) — How much data is generated each day
  3. Lori Lewis Media — Every Second Counts, Every Person Counts

--

--