Data Catalogs and your data rocks

Vincent Rejany
7 min readJul 2, 2020

--

In the first article of this series, we talked about why Data Catalogs are so trendy. It is time now to put some meat on the bone and look at the different miles of that stairway to heaven.

Any reference to any rock songs is fortuitous!

Start me up: Data Catalog classic features

Let’s open the beast. What is there in a data catalog? Of course, there are different approaches for performing data cataloging depending on software vendors, but most of the existing solutions rely on the following four main components:

  • A flexible data model for storing the metadata objects and their relationships
  • A set of data discovery services that allow you to extract metadata from structured and unstructured data sources as well as enriching (discovering, scoring) metadata with additional information/insight
  • Search and indexing services that allow you to make the information available as quickly as possible and to formulate complex search queries
  • An intuitive, easy to use, and collaborative user interface so that any kind of user can search and find what he or she needs
Data Catalog Components

You say “One” Central and Agile Metadata Repository

Metadata must be stored and represented in a flexible format, allowing for performant searching and retrieval of content, a high volume of data, security, and versioning. Most of the data catalog solutions rely on a graph database as it allows you to put a high focus on the relationships between the data asset elements, and it facilitates the querying of the repository. The following is a non-exhaustive list of the requirements or features that are expected from a data catalog in terms of metadata management:

  • Metadata can either be extracted through metadata crawlers or created manually by data catalog administrators.
  • Properties or attributes can be created by administrators on metadata objects A, which can be inherited from parents to children.
  • Application domains (for example, HR, CRM, analytics, and so on) or departments for grouping data sources together can be created by administrators to organize the catalog.
  • Purpose and appropriate use (provided by owners, stewards, or SMEs) can be defined by a data catalog user to indicate what the purpose of the asset is, and what its appropriate use would and would not be.
  • Tables can be linked to other tables (i.e. a fact table that is related to dimensions or reference data).

Can’t stop extracting Metadata and discovering New Insights

Technical Metadata Extraction

However, where do we get this metadata? Metadata about data sources that is structured or unstructured and connected to the platform must be extracted and made available to the end-user. Most of the data catalog solutions rely on content discovery techniques. They have different names (for example, crawlers, sniffers, bots, and connectors), but they have all the same objective: connecting to databases and either querying the database dictionary and system tables or running a set of queries for reading metadata. Depending on the size of the database or data lake storage, this step can take time. The refresh of metadata could be done automatically when a change is detected, scheduled, or executed manually.

Data management platforms and databases usually support metadata import/export capabilities through standards like CWM, SQL, DDL, or even specific text-formatted files. It is often a time consuming and error-prone approach, considering it is usually not automated. Finally, REST API integration as well as manual entries are also other options.

A data catalog must master metadata. If the same definition about a data asset (that is, a table) comes from multiple sources (for example, crawling, 3rd party, or manual entry), then it should be mastered and versioned in the data catalog, and the asset should only be presented one time.

The following tables illustrate some examples of the information that can be retrieved by content crawlers or imported into the data catalog.

A non-exhaustive list of Table Metadata
Non-exhaustive List of Column Metadata

Support Metadata Discovery

Getting technical metadata is one critical step but delivering insights about the data assets is also key for data catalogs. Data profiling metrics are the first level of analysis. Profiling is a discovery process of examining data by collecting statistics and information about that data to gain insight and uncover potential data quality issues. It involves gathering measurements for key metrics about specific data elements. Data profiling is a step in the assessment stage of the data quality lifecycle. It is not to be mistaken as a complete data quality assessment in that it does not explicitly determine whether a data defect exists. Instead, profiling enhances knowledge of the data and raises awareness of potential data quality issues that might require a thorough data quality assessment to determine the true quality status of the data in question. Table 3 presents a list of typical data profiling metrics.

A non-exhaustive list of Column-Level Profiling Metrics

Depending on the information available, additional analysis can be orchestrated, such as the following:

  • Data inventory/tagging: It aims at answering questions like “Do you have any data of type X?” To address this requirement, you need to be able to orchestrate field name and field content analysis to tag variables and tables. Such analysis can rely on machine learning, natural language processing identification analysis capabilities, regular expression, rules, or dictionaries. These technologies are often supported by a manual remediation process and/or collaborative work from the catalog community.
  • Scoring: Many metrics can be calculated on a table:
  • data quality score by analyzing the completeness and the redundancy of the information.
  • ABT score, meaning how much one table is fit for doing analytics.
  • privacy score by analyzing the number of variables that are identified as personal or sensitive data, their completeness, and the likelihood of reidentification
  • risk score by combining the privacy score, the information security metadata, and the degree of exposure of the table in the organization

“Rock the Casbah” with Smart and Easy Search Engine & UI

User interface complexity is one of the main reasons why metadata management has failed in the past. Now that personas have changed, users want to access a UI where they can search/interrogate a catalog of available assets. It must be as easy as possible. No technical knowledge should be required. The main objective is to facilitate access to data by non-technical people and to allow them to be autonomous. Searching in a data catalog must take a sub-second and be as easy as searching on Google or Amazon.

Free text search on metadata through facets, keywords, or even natural language is a must-have. All properties and attributes can be used for searching so that users could be formulating queries such as “revenue for French retail stores in 2018” or “tables containing sensitive data with more than 10000 records.”

Apache Atlas Basic Search

Most of the data catalog solutions rely on well-known search engines such as Elasticsearch, Solar, or Lucene and expose an API so that the search can be embedded into other applications.

Sure that this component list is not exhaustive. Data catalogs can be deeply integrated with other governance activities such as business glossary management, data lineage, and impact analysis, or reference data and data quality rules management.

Data Democracy “Uprising”

The world of metadata management is now waking up. It is empowered by the democratization of data access and consumption through self-service capabilities and the call for more ease of use, simplification for faster time to value. Business users need a centralized repository for data that has been categorized and classified. However, data catalogs need to be more ambitious and to extend to other metadata objects like reports, models, process flows, data pipelines, business rules, and any other objects that are a part of the data and analytical lifecycle. The support of data lineage capabilities is key for giving the visibility of where the data is coming from, where it is used, and by whom.

Data catalogs need to transition to “information catalogs” and embed built-in security and governance capabilities., This transition could be ideally driven by AI/ML features for strengthening collaboration, policy definition, error detection, and data privacy principles. Automation is the only way to address the big data challenge and to comply with data privacy regulations.

Moreover, the openness and extensibility of catalogs are becoming a critical topic. With an increasing number of solutions delivered by cloud vendors and data management software vendors, the ability to integrate catalogs through API or standards like the ODPI Egeria project is a judicious bet on the future. Without such an approach, the ambition of bringing clarity in the metadata confusion will surely fail. Getting metadata or generating metadata out from a platform is the easiest part. Maintaining a catalog and sustaining governance is a higher hill to climb.

--

--