Google Cloud Data Catalog hands-on guide: a mental model

Published in

Google Cloud - Community

6 min readJun 28, 2019

This quickstart guide is part of a series that brings a practitioner approach to Data Catalog, a recently announced member of Google Cloud’s Data Analytics services family.

Data Catalog is a fully managed and scalable metadata management service that empowers organizations to quickly discover, understand, and manage their data in Google Cloud.

Mental model

To provide some context about Data Catalog, and help data citizens to increase velocity when getting started with the service, let me describe my mental model around its core features. The model summarizes my learning path since I started using Data Catalog, and will also be a basis for the next articles I'll write to this series.

Disclaimer: this is my personal way of thinking, as a Data Catalog early adopter — only & simply this. The model is not based on any official/supported reference.

Basic concepts

Data Catalog is kind of a centralized service, fully managed by Google Cloud, keeping an optimized search index for data assets belonging to GCP projects. By data assets I mean: datasets, tables, views, text/CSV files, spreadsheets, and data streams. To build its index, Data Catalog relies on assets’ metadata, i.e. name, description, and columns definitions.

It also stores metadata for assets managed by other GCP services so that users may get details about them using only Data Catalog’s UI or API. Metadata is stored/updated when assets are indexed for the first time, changed in their source systems, or tagged using Data Catalog.

Privacy and information security are first-class citizens for Data Catalog. Assets’ IAM roles and ACLs are considered before providing any information for a given user or service account.

Search Catalog

The first contact with Data Catalog usually happens thru its search feature: powerful and simple to use. Please take a look at the below image:

A result set is returned when someone searches the Catalog. Keep in mind search results are just “summaries” of what Data Catalog knows about the indexed assets and each SearchResult has a small set of fields — most notably: searchResultType, searchResultSubtype, relativeResourceName, and linkedResource. Possible values for such fields are listed below:

search result types:

 ENTRY, TAG_TEMPLATE

search result subtypes:

entry.dataset, entry.table, entry.data_stream.topic, tag_template

relative resource names:

projects/<project-id>/locations/US/entryGroups/@bigquery/entries/<entry-id> (ENTRY / entry.dataset or entry.table),
  
projects/<project-id>/locations/US/entryGroups/@pubsub/entries/<entry-id> (ENTRY / entry.data_stream.topic),
  
projects/<project-id>/locations/us-central1/tagTemplates/<tag-template-id> (TAG_TEMPLATE)

linked resources:

//bigquery.googleapis.com/projects/<project-id>/datasets/<dataset-id> (ENTRY / entry.dataset),//bigquery.googleapis.com/projects/<project-id>/datasets/<dataset-id>/tables/<table-id> (ENTRY / entry.table),//pubsub.googleapis.com/projects/<project-id>/topics/<topic-id> (ENTRY / entry.data_stream.topic),//datacatalog.googleapis.com/projects/<project-id>/locations/us-central1/tagTemplates/<tag-template-id> (TAG_TEMPLATE)

You may notice from the above examples search results are split into 2 major groups: ENTRY and TAG_TEMPLATE. Entries refer to data assets managed by other Google Cloud services. Data Catalog automatically indexes assets managed by BigQuery and Pub/Sub. Please look for additional integrations. Tag Templates refer to Data Catalog’s native entities, we’ll learn more about them in this article.

Pay attention to relativeResourceName and linkedResource fields when the search result type is ENTRY. In this case, relativeResourceName is the identifier created for Data Catalog’s internal metadata record when its underlying data asset is added to the index, while linkedResource simply points to the data asset in its source system — kind of an external reference.

Quick tip for entries: relativeResourceName ends with a system generated id that looks meaningless when read by humans, while linkedResource is pretty much human-readable and may be useful when analyzing search results.

Get Entry

To retrieve more information regarding a given data asset, you may perform a Get Entry operation. It receives a name parameter, which is represented in a SearchResult by the relativeResourceName field. For each result returned from a Search Catalog operation, there should be one and only one catalog Entry.

***Image 2.*** *Catalog, Search Result, and E*ntry relationships

Entry, a native Data Catalog entity, represents an asset’s technical metadata, containing a variable fieldset that will vary according to its type. This means fields for a BigQuery’s date sharded Table-related entry will not be the same as the ones representing a PubSub Topic, although some are common to all types. Another example: the schema field stores a table columns schema if an entry refers to a table, but it is not available in entries referring to datasets.

Lookup Entry

In case you need to find the catalog Entry associated to a data asset you already know the name, you don’t need to perform a previous catalog search. The Lookup Entry operation allows you to go from an asset’s name to its catalog entry in one step.

Remembering a common pattern for GCP data assets’ resources names:

//bigquery.googleapis.com/projects/<project-id>/datasets/<dataset-id>/tables/<table-id>

Notice they follow a convention that makes assets uniquely identifiable, even across projects. The resource name is stored in the entry linkedResource field and is enough to retrieve a catalog entry.

What we have seen so far is the very basic part of Data Catalog’s search and data discovery capabilities. Even for organizations that keep a large amount of data across their projects in Google Cloud, Data Catalog provides an easy and fast way to find out what kind of data is there, as well as how and where it’s stored.

Templates and Tags

Once people are able to discover data belonging to their organizations, they should also be allowed to better manage it. In this sense, Data Catalog comes with another feature — Tagging — that may be used to improve data governance, among other possibilities.

Tag is a Data Catalog’s native entity that allows people and automated processes to attach additional metadata to any data asset indexed by the catalog, also making it easier to find such data assets in the future using search qualified predicates. For example, Data Governance teams may use search capabilities to find tables storing sensitive data (emails, Social Security Numbers, and so on), and then tag them in order to make periodic security auditing more straightforward.

A Tag is attached to an Entry, as shown in the next picture:

***Image 3.*** Template and Tag related entities

Tagging is a flexible feature: tags may be composed of as many fields as required to get the data classification job done (actually, there’s a limit, but it’s pretty high), of different types: boolean, double, string, timestamp and custom enumerated values. To make creating tags easier and safer (from an IAM perspective), Data Catalog provides a templating mechanism. Each Tag must be created according to a user-defined TagTemplate.

What Should I Read Next?

Well, this brief introduction to Data Catalog finishes here. Please check other articles from this series, listed below, and the official docs (see References section) to go further.