Data Catalogs — Unlocking Value in your Data Lakes

Sajjad Syed
6 min readApr 6, 2020

--

It’s increasingly clear that successful data lake transformation and adoption of self-service rests on findability and accessibility of enterprise data using an emerging data capability — Data Catalog

Figure 1 : Data Catalog Reference Architecture

The reference architecture above takes a holistic approach for developing an end to end data catalog, including functional and global governance features. It acknowledges the existence of functional / use-case specific catalogs as well as an enterprise-level catalog stitched with market place to become a data-driven company. If companies want to solve business challenges of discovering data and generate insights by unlocking all the data across all their data assets, including data warehouse and data lakes to improve productivity of their data science and business community, they need the ability to search, discover and provision any data assets across the organization. That means taking a universal approach that connects metadata across data silos, data warehouse, data lakes, and analytical environment.

With new data privacy laws being enacted, it’s becoming more relevant to understand data from origin to consumption. Not being able to trace it across the data pipeline — now comes with significant risk. Unfortunately, the majority of organizations are not equipped for this new reality. Multiplication of new data sources makes the task even more difficult. In the age of big data, companies are able to capture large and exponentially increasing volumes and variety of data in data lakes, but the inability to efficiently search, discover and access all the data is hindering the insight generation. A vast amount of data is unknown across the organization and creating value from those data sets are untapped as you see in the figure below :

A depiction of undiscoverable data assets

How are organizations addressing this challenge and overcoming risk? This is the enormous promise of data catalog technology. A data catalog is a company-wide inventory of data assets that enable: discovery, collaboration, trust, provisioning, and governance. A full feature data catalog can help its users to discover, understand, and trust potentially relevant data assets that they do not know well or did not know before. It contains information that helps to understand the technical characteristics and the business context of all data assets of a company. Data Catalog should rebuild trust in data and resolve many of these inefficiencies. Figure 3 below shows the core capabilities of the data catalog in the end to end data supply chain process from data producers / curators to data consumers.

Figure 3 : Data Catalog core capabilities in end to end data supply chain

Benefits

A data catalog will eliminate many of the pain points that now exist when business user and data scientists try to gain business insight from data. The key pain points addressed are:

  • Improve productivity and reduce time spent by teams searching for relevant information or data
  • Increase visibility on key datasets we have across the different teams
  • Avoid double purchases of similar datasets by different teams
  • Improved collaboration between data science and business teams
  • Speed up the process to access and interpret the data
  • Facilitate compliance with growing international privacy and reporting regulations
  • Common KPIs and Data Definitions make data comparable and understandable
  • Facilitate data relevancy and usage tracking

The State of Data Catalogs

Amid all the hype and high expectations, it’s becoming clear that the Data Catalog space is rife with confusion. Vendors in the space have differing capabilities that address a multitude of problems, from technical metadata and taxonomies to data profiling, self-service & discovery.

  • Tools can amplify or simplify but cannot create a business definition. Most of the tools use ML to auto ingest and crawl to create technical and derived metadata, but none can create business metadata.
  • The underlying problem lies somewhere between data collection & data curation. Finding business resources who are willing to spend time, effort, and have motivation to curate and define business metadata continues to be a challenge.
  • In this environment, it’s no surprise that most organizations struggle to find a data catalog that meets their specific needs. Not all data catalog have the same capabilities, and some might even make the problem worse.
  • Bigger organizations end up having more than one catalog, and findability across the functional boundaries remains the challenge.

Vendor Landscape

To successfully navigate the increasingly complicated space, organizations should understand the distinction between the two categories of data catalog - Specialized/Embedded Catalogs and Independent/Unified Catalog.

Specialized/Embedded data catalogs are more geared toward use-case specific outcomes and come as part of the embedded capability of other tool categories. Because of this, they often end up specializing in a particular department and use case, such as data integration platforms ( Denodo, Talend, etc.), data preparation ( Tamr, Paxata, Dataiku, etc.) or data lake enablement tools ( Zaloni, Cloudera Navigator, Glue, Azure Catalog, etc.). Implementing a specialized Data Catalog may get organizations quick results for certain use cases, but it will limit them in the long term. These catalogs often lacks key functionalities such as the ability to ingest data from any source and end to end workflow to enable governance.

Independent/Unified data catalogs provide comprehensive end to end capabilities with a strong focus on ML driven capabilities and building trust on the data by ensuring data lineage and enabling strong data governance. These catalogs are also becoming the centerpiece of the data market place and have the ability to explore and provision the data assets. It breaks down silos to empower solutions across the entire enterprise. It also gives an organization the flexibility to address future use cases that may be hard to imagine today, without the need for retrofitting. Companies like Collibra, Alation and Informatica are leading the way.

Data Catalog Capability Checklist

Organizations are overwhelmed with the choices available for data catalog solutions in the market. When vetting a data catalog solution, organizations can avoid just making a new silo by making the third party integration and machine learning augmented capabilities as the key criteria in their evaluation. The checklist below in figure 4 can help organizations deduce whether the data catalog solution they are vetting offers enterprise functionality.

Figure 4 : Data Catalog Capabilities Checklist

Key Takeaways:

Enable unified/enterprise data catalog in top-down approach with light governance for a few core cross-functional data assets. Use-case specific or embedded catalog can be developed and maintained at a region/function level in bottom-up fashion.

Below are some additional points to consider:

  • Look to create a truly end to end data market place with a combination of specialized and enterprise data catalog.
  • Make data catalog seamless by integrating with consuming applications. A business user or data scientist should be able to access from any collaboration or analytical environment.
  • Enable Stewardship at the edges to ensure high-quality metadata creation and maintenance for department/use-case specific data
  • Create an enterprise data council organization that is responsible for governing, maintaining, and curation of cross-functional metadata definitions and usage policies.
  • Impossible to curate all metadata manually. Leverage ML-augmented capabilities to automate as much as possible·
  • Start with highly used/accessed data assets and work your way down to all the data in a systematic way.

In the future article, will share more details on reference architecture components, market place and governance model.

--

--

Sajjad Syed

Data & Analytics leader with more than 20 years of experience in design, architecture and enterprise implementations for fortune 100 companies.