Deutsche Telekom Digital Labs

Building revolutionary products to connect the world

From Chaos to Clarity: How DataHub Transformed our Data Utilization

--

In today’s data-intensive landscape, organizations face the daunting task of managing and understanding vast quantities of information. As a company handling hundreds of terabytes of data each month, we encountered significant challenges in organizing and accessing our thousands of datasets. To address these issues and unlock the full potential of our data assets, we implemented DataHub as our centralized data catalog solution.

Our Motivation

We wanted to provide a unified platform where users can easily find answers to their data-related questions, offer a starting point to access relevant information and delve deeper as needed. By simplifying the discovery of data assets — whether it’s specific metrics, dashboards, privacy information, lineage, or impact — users can seamlessly explore and interact with the data.

This streamlined approach empowers them to make informed, data-driven decisions with greater efficiency and confidence, ensuring they have everything they in one place to navigate their data ecosystem effectively.

But managing such a vast amount of data led to an endless stream of questions directed to our Data Team from other functional units or questions within our data teams from various roles, such as:

Data Acquisition: How is this data obtained?

Data Content: What information does this data contain?

Data Access: Where is this data stored so I can access it?

Data Ownership: Who is responsible for this data, so I know whom to contact with inquiries?

Data Utilization: Who utilizes this data and for what purposes?

Dashboard Availability: Do we have this Dashboard or KPI present on BI tools?

“Data Disaster: Our Misadventures Before the Catalog Came to the Rescue”

We did lot of data literacy sessions and understood that different people in the team have different problems based on the role they have in the ecosystem. For instance:

  • Business Stakeholders: Inability to access and understand key performance indicators (KPIs) hindered strategic decision-making.
  • Product: Lack of data visibility led to delays in product development and iteration.
  • Analyst: Excessive time spent searching for data reduced efficiency and productivity.
  • Engineers: Faced challenges in maintaining data pipelines due to unclear data flow and dependencies.
  • Security: Ensuring data compliance and security was difficult without a clear understanding of data flows and access points. Security team usually found it challenging to keep track of sensitive data points.
  • Architect: Data architects struggled to design scalable and efficient data architectures without comprehensive metadata.
  • DevOps: DevOps teams faced difficulties in understanding what data belongs to which product.

Choice of right Data Catalog Solution ?

“No silver bullet works for all”

There are many metadata catalogue solutions available today which are aiming to solve this ever growing problem. We did a deep down and compared many based on some principles which stands-out for us.

To name a few, below are some which helped us to take a better call:

  • Easy-to-use UI with quick search which makes things easy for people to understand.
  • Easy onboarding
  • Automatic scanning of metadata
  • Diverse capabilities like E2E Data Lineage and support for our data stack
  • Strong community support
  • Open-source availability and ease of installation and maintenance

Based on these factors, we chose DataHub.

Implementing DataHub: Our Journey

  • Deployment: Using a Helm Chart for efficient deployment and management.
  • Metadata Ingestion: Establishing connections and ingestion of metadata from these systems Nifi, Athena, Redshift, Tableau, Redash, Kafka, Opensearch
  • Metadata Completion: Completing metadata for data assets
  • Defining business Glossary: We defined standard business terms and linked data assets to them.
  • Implementing Domains: We organized our data assets into logical collections called Domains in DataHub, aligning them with business units for streamlined management and easy access.
  • E2E Lineage Tracking: Implementing full lineage tracking across systems. This was the hardest and time consuming part of entire implementation for data engineering team.
  • Data Quality Metrics Reports: Currently in progress
  • Alerts Management Integration: Connecting alerts management system with Datahub Incidents (currently in progress)

Practical use-cases that we identified for our Data Catalog to address:

Grouping of data-assets into logical folders based on business process

Faster Impact Assessment and reduced TAT to resolve issues in case of pipeline failures

Quicker onboarding for new team members and Elimination of Dependency on “Data Hero”

PII Data identification in our system

Removal of dead data movement and pipelines components

Central component to define standard business definitions and tag data assets to it.

How we addressed the use-cases using Datahub:

End-to-End Data Lineage: This provided us with a clear view of how data flows through our systems, from ingestion to transformation and consumption. Understanding data lineage helped us ensure data quality, compliance, and governance, and facilitated better collaboration among teams.

By integrating our Alert management system with Datahub, we can now use Datahub Lineage to identify which downstream systems are affected by pipeline failures and determine who needs to be notified.

Downstream system impact due to job failure

Tagging Sensitive Data Points: Tagging sensitive data points with tags helps our security team and DevOps to easily identify and manage sensitive information, enhancing data protection and operational efficiency.

PII Data Tagging

Repository Links for Data Pipelines: Repository links are added in Datahub for Airflow DAGs and Spark Jobs to allow data engineers to quickly access and review the code in the event of a pipeline failure, facilitating faster debugging and resolution.

Airflow Dag with repository link

Business Glossary: A standardized business glossary now fosters a shared language across the teams, enhancing clarity and consistency in data management.

Data Catalog as a Catalyst: The DataHub data catalog significantly supported our AI/ML team’s efforts by offering seamless access to detailed column descriptions and table schemas through its APIs. This comprehensive data accessibility enabled the team to efficiently develop a text-to-SQL tool, which translates natural language queries into SQL commands. By leveraging the structured information stored in the catalog, the team could ensure accuracy and consistency in query generation, ultimately enhancing the tool’s effectiveness and reliability.

Customization & Giving back to the Community

End-2-End Lineage with datahub worked like a breeze, few configurations and voila!! it’s there. We have been using Apache Nifi heavily for our real-time data ingestion and data flow use cases.
What we noticed, datahub as of today has very low support for Nifi from lineage perspective, so we added our customer plugin to have a more advanced automatic lineage from our Nifi Engine.

E2E Lineage with nifi processors for Kafka, RMQ and OpenSearch

As soon we test the feasibility of this plugin in all our pipelines, we would be happy to share this with the community.

Conclusion

Implementing DataHub as our data catalog solution has significantly improved our data management capabilities. By providing a centralized repository of metadata, DataHub has enhanced data visibility, fostered collaboration, and enabled data-driven decision-making across the organization. The integration of diverse data sources and the establishment of end-to-end data lineage have empowered our teams to work more efficiently and effectively, ultimately driving better business outcomes.

--

--

No responses yet