Testing a Data Catalog

Published in

Globant

6 min readDec 31, 2020

Introduction

Today it is accepted that companies that embrace data driven decisions increase productivity by 5–6% compared with other investments and information technology usage. (1)

Despite the fact that the amount of data generated and collected around the world is increasing exponentially (some reports are predicting that by 2025 worldwide data will grow 61% with a massive buildup of IP traffic mainly due to the use of IoT device traffic (2) (3)), there is concern that companies are less data-driven than ever before.

With more data available and lower costs for obtaining and processing it, what is stopping organizations from using technology to make organizations more data-driven?

Here is a really good comparison between today’s challenges with data and the problems that the recycling industry has to face every day.

The costs of producing something usable are extremely high due to the high cost of cleaning up the mixed mess of bagged plastic, cardboard, trash, and metals that is dumped into the recycling plant. The huge mistake here is assuming that a business can take whatever you give it and generate a profitable product with it.(4)

The evidence suggests that more precise and accurate information should facilitate greater use of information in decision making and, in the end, lead to higher performance. Technologies that enable the greater collection of information, or facilitate more efficient distribution of information within an organization, should lower costs and improve performance.

In this context emerges the data catalog as an essential tool in order to ensure that the materials used to assemble new data products are fit for this intended use, producing a high quality final product.

Defining a data catalog

There are multiple definitions of a data catalog.

From a business point of view:

“A data catalog maintains an inventory of data assets through the discovery, description, and organization of datasets. The catalog provides context to enable data analysts, data scientists, data stewards, and other data consumers to find and understand a relevant dataset for the purpose of extracting business value.”(5)

Or a more technical perspective:

“A data catalog is defined as a collection of metadata, combined with data management and search tools that serves as an inventory of available data, and provides information to evaluate the data health for an intended use”.

One definition is complementing the other. Data should be considered an asset that needs to be evaluated if it is the correct one for an intended use. This means that a specific dataset could be the correct input for a specific data product, but doesn’t have enough data or quality for another one.

What is needed to have a data catalog?

As data in any organization is distributed across multiple sources, with varying degrees of accessibility, a centralized catalog that captures relevant metadata is essential. The catalog should capture:

Where the data is (e.g. database, excel files, blob storage)
What it means (e.g. column descriptions, data quality notes)
How to use it (e.g. suggested joins, sample queries)
Who produces /uses / knows about it?

Many of the catalogs have started to use the transaction logs from source systems to discover who is producing / using specific data sets. This is a different way to find the referents of some data and also to know if a data set is used, and how much, by the ecosystem without any dependency on human intervention.(4)

Some examples of data catalogs are: Collibra, Waterline, Alation, Amundsen.

Here is an example of how a data catalog like Alation looks:

Why are data catalogs needed?

Without a catalog, analysts look for data by checking documentation, talking to colleagues, relying on people’s knowledge, or simply working with familiar datasets because they know about them.

As metadata provides context and information, this allows the organization to be aligned on the meaning of it. The absence of an agreement related to data meaning, is one of the reasons that slows down the creation of new and valuable data products.

The data catalog serves as a bridge that links multiple data tools, databases and systems together, binding the data ecosystem. (7)

That’s because an enterprise data catalog is truly the foundation of data empowerment and not just a place to index all the information you have. Enterprise data catalogs unify your people, data, and analysis in a way that makes it easier to build a data-driven culture.(6)

With a data catalog the analyst is able to search and find data quickly, see all of the available datasets, evaluate and make informed choices for which data to use, and perform data preparation and analysis efficiently and with confidence. With this it is then possible to start dreaming to change the typical 80/20 data science issue to 20/80, whereby 20% of an analyst’s time is spent finding and preparing data, with 80% for analysis.(8)

How is a data catalog built?

Assess the metadata across all the organization’s databases to identify data tables, files and databases, then incorporate the metadata into the data catalog.
Pull descriptions of all data points into the data catalog and create profiles so data consumers can understand data.
Identify relationships between data across databases to create linkages within the data catalog that can make query results more robust.
Track data lineage to understand origin data and its transformations over time to its current state. This can help troubleshoot analytical errors.
Organize data through an intuitive system, using tagging and/or sorting by user type or usage frequency.
Implement data security measures, such as access controls and data de-identification to ensure the right users have access to the right information at the right time. (11)

Benefits of a Data Catalog

Improved data efficiency
Improved data context
Reduced risk of error
Improved data analysis
Build a common business language

This list of benefits is described in details by the following figure.

As you can see, the data catalog avoids the rework and waste of time related to poor or ‘none at all’ knowledge about the data available.

Corollary

Traditional database systems require the user to know the location of a data source’s documentation in order to understand its intended use. Many times this information is not available .

A data catalog is self-documenting and the documentation resides side-by-side with the data it is documenting, not in a separate system. (10)

The greatest value, however, is often seen in the impact on analysis activities. But today’s business and data analysts are often working blind, without visibility into the datasets that exist, the contents of those datasets, and the quality and usefulness of each. They spend too much time finding and understanding data, often recreating datasets that already exist. They frequently work with inadequate datasets resulting in inadequate and incorrect analysis and of course, is needed to start the process again after to obtain the inadequate data product.

A data catalog enables data discovery and exploration for self-service analytics by providing a single source of reference and a simple way for data consumers to access the data they need.

Also a data catalog is the place in the ecosystem where the Data Quality needs to be documented.

Data Quality is one of the challenges that we are facing at this moment as a very time consuming activity during the process of data product creation.

Having it included as part of data operations inside the lifecycle of development will increase the productivity of any company that intends to be data-driven.