Defusing Data ‘Time Bombs’ with DataHub Observability

Published in

Data Engineering Indonesia

7 min readSep 27, 2023

Photo by Frederick Marschall on Unsplash

Data is like garbage, You’d better know what you are going to do with it before you collect it. — Mark Twain

This article takes you on a short journey into how we, as a team of data engineers at eFishery, strive to continuously enhance the quality of our data platform. We aim to defuse the data ‘time bombs’ that we continually monitor, or rather, tame them. The term ‘Observability’ in the title may have a slightly different connotation than usual. To us, ‘Observability’ means the ability of the data team to understand the current state of data comprehensively, encompassing aspects like integrity, quality, and timeliness.

In this post, we’ll explore how we implement data observability by integrating various systems, making them easily accessible and usable. This integration includes data cataloging, data profiling, data lineage, data job profiling, impact analysis, and anomaly detection.

The Background

Initially, our data platform was simple and straightforward. Data from spreadsheets and some production databases were ingested into a central database for analysis. Analysts would then process this data and create various dashboards for business needs in a data visualization platform.

As the business grew, the number of dashboards increased significantly without clear curation. To facilitate data users in finding and curating dashboards, we built a simple platform that included metadata for all dashboards, such as descriptions and owner details. This allowed all users, whether from the data team or not, to easily find the information they needed through this dashboard catalog.

However, as the volume and diversity of data grew, the need for curation shifted. Users now required curation of ‘data’ rather than just dashboards. They had become accustomed to building tables and running queries independently for their specific needs. It was for these reasons that the data ‘time bombs’ started to form.

The ‘urgent need’

As the need for cataloging shifted, so did our data platform. In this second phase, we realized that the OLAP database being used needed better management. This is when we began implementing a simple data warehouse. The OLAP database, previously acting as a data lake (where all data was entered without clear rules), was reconfigured to act as a data warehouse, focusing on good data modeling (Bronze-Silver-Gold layering).

Aside from the data warehouse layering, another focus in this phase was providing data curation effectively, technically known as a Data Catalog. There are many open-source solutions for data catalogs, such as Open Metadata, Amundsen, and CKan. However, considering the diversity of supported data sources, the availability of technical documentation, and the growth of community and support, we decided to use DataHub.

The implementation of DataHub in the data platform was straightforward. DataHub was set up to regularly read and store metadata for every table in the data warehouse.

At this phase, DataHub only stored table descriptions and column structures. Here’s an initial glimpse of DataHub.

After this initial implementation, we realized that many tables and schemas (Postgres Datasets) needed tidying up. For example, we found tables containing experimental data in the production environment that had not been deleted, causing confusion for analysts searching for the data they needed in DataHub. This is what we mean by a data ‘time bomb.’

Yea, we had successfully discovered and defused the data ‘time bomb,’ or so we thought.

The ‘explosion’

After believing we had defused the data ‘time bomb’ before it exploded, we took a moment to reflect and asked ourselves, “Is that the only data ‘time bomb’?”

Answering this question was challenging because, in reality, the metadata related to the data platform was limited to the catalog. However, to answer this question, we needed more in-depth information. That’s when we set out to gather more information through Data Profiling.

Example for data profile on a certain table

After running data profiling for some time, we discovered another interesting piece of information: if the data profile of a table appeared suspicious (e.g., an unusually large number of rows or strange column names), it was highly likely that the data in its upstream tables was also suspicious. This led us to start collecting the interaction relationships between tables, known as Data Lineage.

After a while of running this process and collecting sufficient information, we finally agreed that we were late in defusing the data ‘time bomb’ .We concluded that the data ‘time bomb’ had already exploded.

This was because, at this stage, we realized that several serious issues existed in the data platform. Some of these problems included improper use of upstream tables, tables that were ‘dead’ but still considered active, and more.

The ‘Time Bomb’ Detector

Once most of the recovery processes from the previous explosion were in place (leaving lasting scars to this day), we turned the previous problems into lessons to avoid repeating them in the future.

We realized that these mistakes, which we call data ‘time bomb’ explosions, might occur again in the future, but we believe that they won’t be as extensive as before. Therefore, we decided to build a detector for metadata platforms. This detector would be based on several factors.

The first detection factor was data quality through Data Quality Checks. We aimed to detect explosions through data quality, where poor data quality indicated errors in the data build-up process. The checks performed here are quality assurance, not a quality gate, meaning that data quality can only be verified after the data build-up process is complete. In this process, we integrated the Great Expectations data quality framework with DataHub.

Great Expectations results on the DataHub page

Furthermore, utilizing the established data lineage, we also built detectors for factors related to failed build-up on upstream tables. This facilitated our Impact Analysis. So, if an anomaly was found in certain data, we could easily determine if it was due to the upstream table’s impact or the logic of the table itself. This could all be done without manually inspecting each upstream table, thanks to our Custom Test feature in DataHub.

Upstream table build-up status indicator

Another factor we tried to detect was Data Build-up Jobs. Based on this factor, we created a Stagnant Checker to identify tables that never changed or were ‘dead,’ as well as Anomaly Detection to monitor the normalcy of row count changes in a table. We referred to this job detection process as Job Profiling. Besides the mentioned aspects, job profiling also helped us calculate the data pipeline service-level agreement (SLA).

In the end, here’s a general overview of the data observability architecture up to this phase. With these changes, we successfully detected some data ‘time bombs’ before they exploded. But have we found them all?

Conclusion

In conclusion, our journey through data observability with DataHub has shed light on its transformative potential. But this is just the beginning.

The next steps are clear: assess data challenges, harness the power of solutions like DataHub, and explore its capabilities further. Implement data quality checks with tools like Great Expectations, seamlessly integrated with DataHub to ensure data reliability.

Stay attuned to data observability trends and nurture collaboration within your teams. By collectively sharing insights, we propel ourselves toward a data-driven future. Remember, the road to data excellence is an ongoing one, and with each step forward, we conquer the ‘time bombs’ lurking in our data landscape.

Credit

This article is the product of collaborative efforts from the dedicated members of the eFishery data engineering team. Special thanks to my leads and friends who enthusiastically contributed to this article: Mas Rifan Kurnia (Our beloved VP), Wa’ Dimas Gilang, Yusuf Maulana, Agung Fajar Gumilar, and Fajar Muslim.

We also acknowledge Data Engineering Indonesia community for their valuable insights and review, helping us share this article with a broader audience. Your contributions have played a crucial role in making this knowledge accessible. Thank you!

P.S.: Beware of the second part!