This quickstart guide is part of a series that aims to bring a practitioner approach to Data Catalog, a recently announced member of Google Cloud’s Big Data services family.
To provide some context about Data Catalog, and help data citizens to increase velocity when getting started with the service, let me describe my mental model around its core features. The model summarizes my learning path since I started using Data Catalog, and will also be a basis to the next articles I'll write to this series.
Disclaimer: this is my personal way of thinking, as a Data Catalog early adopter — only & simply this. The model is not based on any official/supported reference.
Data Catalog is kind of a centralized service, fully managed by Google Cloud, keeping an optimized search index for data assets belonging to GCP projects. By data assets I mean: datasets, tables, views, text/CSV files, spreadsheets, and data streams. To build its index, Data Catalog relies on assets’ metadata, i.e. name, description, and columns definitions.
It also stores metadata for assets managed by other GCP services so that users may get details about them using only Data Catalog’s UI or API. Metadata is stored/updated when assets are indexed for the first time, changed in their source systems, or tagged using Data Catalog.
Privacy and information security are first-class citizens for Data Catalog. Assets’ IAM roles and ACLs are considered before providing any information for a given user or service account.
The first contact with Data Catalog usually happens thru its search feature: powerful and simple to use. Please a take a look at the below image:
A result set is returned when someone searches the
Catalog. Keep in mind search results are just “summaries” of what Data Catalog knows about the indexed assets and each
SearchResult has a small set of fields — most notably:
linkedResource. Possible values for such fields are listed below:
- search result types:
- search result subtypes:
entry.dataset, entry.table, entry.data_stream.topic, tag_template
- relative resource names:
projects/<project-id>/locations/US/entryGroups/@bigquery/entries/<entry-id> (ENTRY / entry.dataset or entry.table),
projects/<project-id>/locations/US/entryGroups/@pubsub/entries/<entry-id> (ENTRY / entry.data_stream.topic),
- linked resources:
//bigquery.googleapis.com/projects/<project-id>/datasets/<dataset-id> (ENTRY / entry.dataset),//bigquery.googleapis.com/projects/<project-id>/datasets/<dataset-id>/tables/<table-id> (ENTRY / entry.table),//pubsub.googleapis.com/projects/<project-id>/topics/<topic-id> (ENTRY / entry.data_stream.topic),//datacatalog.googleapis.com/projects/<project-id>/locations/us-central1/tagTemplates/<tag-template-id> (TAG_TEMPLATE)
You may notice from the above examples search results are split into 2 major groups:
TAG_TEMPLATE. Entries refer to data assets managed by other Google Cloud services. Data Catalog automatically indexes assets managed by BigQuery and Pub/Sub. Please look for additional integrations. Tag Templates refer to Data Catalog’s native entities, we’ll learn more about them in this article.
Pay attention to
linkedResource fields when the search result type is
ENTRY. In this case,
relativeResourceName is the identifier created for Data Catalog’s internal metadata record when its underlying data asset is added to the index, while
linkedResource simply points to the data asset in its source system — kind of an external reference.
Quick tip for entries:
relativeResourceNameends with a system generated id that looks meaningless when read by humans, while
linkedResourceis pretty much human-readable and may be useful when analyzing search results.
To retrieve more information regarding a given data asset, you may perform a Get Entry operation. It receives a
name parameter, which is represented in a
SearchResult by the
relativeResourceName field. For each result returned from a Search Catalog operation, there should be one and only one catalog
Entry, a native Data Catalog entity, represents an asset’s technical metadata, containing a variable fieldset that will vary according to its
type. This means fields for a BigQuery date sharded Table related entry will not be the same of the ones representing a PubSub Topic, although some are common to all types. Another example: the
schema field stores a table columns schema if an entry refers to a table, but it is not available in entries referring to datasets.
In case you need to find the catalog
Entry associated to a data asset you already know the name, you don’t need to perform a previous catalog search. The Lookup Entry operation allows you to go from an asset’s name to its catalog entry in one step.
Remembering a common pattern for GCP data assets’ resources names:
Notice they follow a convention that makes assets uniquely identifiable, even across projects. The resource name is stored in the entry’s
linkedResource field and is enough to retrieve a catalog entry.
What we have seen so far is the very basic part of Data Catalog’s search and data discovery capabilities. Even for organizations that keep a large amount of data across their projects in Google Cloud, Data Catalog provides an easy and fast way to find out what kind of data is there, as well as how and where it’s stored.
Templates and Tags
Once people are able to discover data belonging to their organizations, they should also be allowed to better manage it. In this sense, Data Catalog comes with another feature — Tagging — that may be used to improve data governance, among other possibilities.
Tag is a Data Catalog’s native entity which allows people and automated processes to attach additional metadata to any data asset indexed by the catalog, also making it easier to find such data assets in the future using search qualified predicates. For example, Data Governance teams may use search capabilities to find tables storing sensitive data (emails, Social Security Numbers, and so on), and then tag them in order to make periodic security auditing more straightforward.
Tag is attached to an
Entry, as shown in the next picture:
Tagging is a flexible feature: tags may be composed of as many fields as required to get the data classification job done (actually, there’s a limit, but it’s pretty high), of different types: boolean, double, string, timestamp and custom enumerated values. To make creating tags easier and safer (from an IAM perspective), Data Catalog provides a templating mechanism. Each
Tag must be created according to a user-defined
Well, this brief introduction to Data Catalog finishes here. Please check other articles from this series, listed below, and the official docs (see References section) to go further.
- Data Catalog hands-on guide: search, get & lookup with Python: https://medium.com/google-cloud/data-catalog-hands-on-guide-search-get-lookup-with-python-82d99bfb4056
- Data Catalog hands-on guide: templates & tags with Python: https://medium.com/google-cloud/data-catalog-hands-on-guide-templates-tags-with-python-c45eb93372ef
Hope it helps!
The plantuml files used to generate above class diagrams are available on GitHub: https://github.com/ricardolsmendes/gcp-datacatalog-diagrams.
- Data Catalog official website: https://cloud.google.com/data-catalog
- Data Catalog overview: https://cloud.google.com/data-catalog/docs/concepts/introduction-data-catalog
- Getting started with Data Catalog: https://cloud.google.com/data-catalog/docs/quickstarts/quickstart-search-tag