Google Cloud Data Catalog hands-on guide: a mental model
This quickstart guide is part of a series that brings a practitioner approach to Data Catalog, a recently announced member of Google Cloud’s Data Analytics services family.
Mental model
To provide some context about Data Catalog, and help data citizens to increase velocity when getting started with the service, let me describe my mental model around its core features. The model summarizes my learning path since I started using Data Catalog, and will also be a basis for the next articles I'll write to this series.
Disclaimer: this is my personal way of thinking, as a Data Catalog early adopter — only & simply this. The model is not based on any official/supported reference.
Basic concepts
Data Catalog is kind of a centralized service, fully managed by Google Cloud, keeping an optimized search index for data assets belonging to GCP projects. By data assets I mean: datasets, tables, views, text/CSV files, spreadsheets, and data streams. To build its index, Data Catalog relies on assets’ metadata, i.e. name, description, and columns definitions.
It also stores metadata for assets managed by other GCP services so that users may get details about them using only Data Catalog’s UI or API. Metadata is stored/updated when assets are indexed for the first time, changed in their source systems, or tagged using Data Catalog.
Privacy and information security are first-class citizens for Data Catalog. Assets’ IAM roles and ACLs are considered before providing any information for a given user or service account.
Search Catalog
The first contact with Data Catalog usually happens thru its search feature: powerful and simple to use. Please take a look at the below image:
A result set is returned when someone searches the Catalog
. Keep in mind search results are just “summaries” of what Data Catalog knows about the indexed assets and each SearchResult
has a small set of fields — most notably: searchResultType
, searchResultSubtype
, relativeResourceName
, and linkedResource
. Possible values for such fields are listed below:
- search result types:
ENTRY, TAG_TEMPLATE
- search result subtypes:
entry.dataset, entry.table, entry.data_stream.topic, tag_template
- relative resource names:
projects/<project-id>/locations/US/entryGroups/@bigquery/entries/<entry-id> (ENTRY / entry.dataset or entry.table),
projects/<project-id>/locations/US/entryGroups/@pubsub/entries/<entry-id> (ENTRY / entry.data_stream.topic),
projects/<project-id>/locations/us-central1/tagTemplates/<tag-template-id> (TAG_TEMPLATE)
- linked resources:
//bigquery.googleapis.com/projects/<project-id>/datasets/<dataset-id> (ENTRY / entry.dataset),//bigquery.googleapis.com/projects/<project-id>/datasets/<dataset-id>/tables/<table-id> (ENTRY / entry.table),//pubsub.googleapis.com/projects/<project-id>/topics/<topic-id> (ENTRY / entry.data_stream.topic),//datacatalog.googleapis.com/projects/<project-id>/locations/us-central1/tagTemplates/<tag-template-id> (TAG_TEMPLATE)
You may notice from the above examples search results are split into 2 major groups: ENTRY
and TAG_TEMPLATE
. Entries refer to data assets managed by other Google Cloud services. Data Catalog automatically indexes assets managed by BigQuery and Pub/Sub. Please look for additional integrations. Tag Templates refer to Data Catalog’s native entities, we’ll learn more about them in this article.
Pay attention to relativeResourceName
and linkedResource
fields when the search result type is ENTRY
. In this case, relativeResourceName
is the identifier created for Data Catalog’s internal metadata record when its underlying data asset is added to the index, while linkedResource
simply points to the data asset in its source system — kind of an external reference.
Quick tip for entries:
relativeResourceName
ends with a system generated id that looks meaningless when read by humans, whilelinkedResource
is pretty much human-readable and may be useful when analyzing search results.
Get Entry
To retrieve more information regarding a given data asset, you may perform a Get Entry operation. It receives a name
parameter, which is represented in a SearchResult
by the relativeResourceName
field. For each result returned from a Search Catalog operation, there should be one and only one catalog Entry
.
Entry
, a native Data Catalog entity, represents an asset’s technical metadata, containing a variable fieldset that will vary according to its type
. This means fields for a BigQuery’s date sharded Table-related entry will not be the same as the ones representing a PubSub Topic, although some are common to all types. Another example: the schema
field stores a table columns schema if an entry refers to a table, but it is not available in entries referring to datasets.
Lookup Entry
In case you need to find the catalog Entry
associated to a data asset you already know the name, you don’t need to perform a previous catalog search. The Lookup Entry operation allows you to go from an asset’s name to its catalog entry in one step.
Remembering a common pattern for GCP data assets’ resources names:
//bigquery.googleapis.com/projects/<project-id>/datasets/<dataset-id>/tables/<table-id>
Notice they follow a convention that makes assets uniquely identifiable, even across projects. The resource name is stored in the entry linkedResource
field and is enough to retrieve a catalog entry.
What we have seen so far is the very basic part of Data Catalog’s search and data discovery capabilities. Even for organizations that keep a large amount of data across their projects in Google Cloud, Data Catalog provides an easy and fast way to find out what kind of data is there, as well as how and where it’s stored.
Templates and Tags
Once people are able to discover data belonging to their organizations, they should also be allowed to better manage it. In this sense, Data Catalog comes with another feature — Tagging — that may be used to improve data governance, among other possibilities.
Tag
is a Data Catalog’s native entity that allows people and automated processes to attach additional metadata to any data asset indexed by the catalog, also making it easier to find such data assets in the future using search qualified predicates. For example, Data Governance teams may use search capabilities to find tables storing sensitive data (emails, Social Security Numbers, and so on), and then tag them in order to make periodic security auditing more straightforward.
A Tag
is attached to an Entry
, as shown in the next picture:
Tagging is a flexible feature: tags may be composed of as many fields as required to get the data classification job done (actually, there’s a limit, but it’s pretty high), of different types: boolean, double, string, timestamp and custom enumerated values. To make creating tags easier and safer (from an IAM perspective), Data Catalog provides a templating mechanism. Each Tag
must be created according to a user-defined TagTemplate
.
What Should I Read Next?
Well, this brief introduction to Data Catalog finishes here. Please check other articles from this series, listed below, and the official docs (see References section) to go further.
- Data Catalog hands-on guide: search, get & lookup with Python: https://medium.com/google-cloud/data-catalog-hands-on-guide-search-get-lookup-with-python-82d99bfb4056
- Data Catalog hands-on guide: templates & tags with Python: https://medium.com/google-cloud/data-catalog-hands-on-guide-templates-tags-with-python-c45eb93372ef
Hope it helps!
The plantuml files used to generate above class diagrams are available on GitHub: https://github.com/ricardolsmendes/gcp-datacatalog-diagrams.
References
- Data Catalog official website: https://cloud.google.com/data-catalog
- Data Catalog overview: https://cloud.google.com/data-catalog/docs/concepts/introduction-data-catalog
- Getting started with Data Catalog: https://cloud.google.com/data-catalog/docs/quickstarts/quickstart-search-tag