Google Cloud Data Catalog hands-on guide: search, get & lookup with Python

Published in

Google Cloud - Community

6 min readJun 29, 2019

This quickstart guide is part of a series that brings a practitioner approach to Data Catalog, a recently announced member of Google Cloud’s Data Analytics services family.

Data Catalog is a fully managed and scalable metadata management service that empowers organizations to quickly discover, understand, and manage their data in Google Cloud.

The below content shows how to use Data Catalog’s search, get, and lookup entry features with the Python GRPC client library. If you need a bit of context before getting into practice, please take a look at the article I wrote describing my mental model about such features.

Environment setup

Create sample data assets

In case you don’t have data assets that might be used in the practical exercises, please create a BigQuery dataset with two tables, similar to what is displayed in the below image.

Image 1. BigQuery assets: dataset, tables, and columns

The first table may have only one field, give it a name according to your preference. Create two fields in the second table, setting their names to name and email. There’s no need to insert data.

I’ll refer to this fictitious data structure in the next section, but you will notice that’s easy to adapt the samples to actual data.

Enable Data Catalog’s API

Open your GCP Project’s API Library: https://console.cloud.google.com/apis/library/project=<project-id>. Type Data Catalog in the search box and you’ll see something similar to the below picture:

Click on Google Cloud Data Catalog API. In the next page, click the ENABLE button.

Install the client library

To install the GRPC/idiomatic client library, use pip install --upgrade google-cloud-datacatalog or add google-cloud-datacatalog to requirements.txt and then run pip install --upgrade -r requirements.txt.

Import dependencies

In the Python code, import Data Catalog’s package:

from google.cloud import datacatalog

Initialize the API client

The API client needs to be authenticated with a service account to make requests. One option is to provide auth credentials by using the GOOGLE_APPLICATION_CREDENTIALS environment variable. In this case, initializing the client object is as simple as:

datacatalog_client = datacatalog.DataCatalogClient()

Another option consists of explicitly passing a service account filename to the constructor:

datacatalog_client = datacatalog.DataCatalogClient.from_service_account_file(filename)

Search, get entry & lookup using Python

Search Catalog

Let’s start by looking for all BigQuery datasets the service account has at least read access to, inside your organization.

(1)scope = datacatalog.SearchCatalogRequest.Scope()
(2)scope.include_org_ids.append(<organization-id>)(3)results = datacatalog_client.search_catalog(scope=scope, query='system=bigquery type=dataset')

The first line creates a search Scope object. The second uses it to inform Data Catalog we wanna search at the organization level. To be more restrictive scope.include_project_ids could be used instead of scope.include_org_ids— and search would be scoped only to the provided projects in that case. Please notice both fields are lists, and at least one of them must be provided in any search request. The third line makes the API request and stores the response in the results variable.

There’s a trick here: search_catalog() returns a GRPC iterator that has kind of a “remote references collection” to the search results. Actual data is fetched only when you iterate over it, as follows:

(4.1)
fetched_results = [result for result in results]OR(4.2)
fetched_results = []
for page in results.pages:
    fetched_results.extend(page)
    # Extra processing might be added here.(5)print(fetched_results)

4.1 is a straightforward way to fetch all results, since the Iterator transparently handles pagination for you; 4.2 is a bit more verbose, iterating over one page at a time. The second allows you to add extra logic to the processing, such as checking if more data needs to be fetched when iterating over large resultsets or adding a timeout between new page requests to avoid API usage to exceed quota.

If you created a dataset as suggested in the Environment setup | Create sample data assets section, you should see below content among search results:

search_result_type: ENTRY
search_result_subtype: "entry.dataset"
relative_resource_name: "projects/<project-id>/locations/US/entryGroups/@bigquery/entries/<entry-id>"
linked_resource: "//bigquery.googleapis.com/projects/<project-id>/datasets/datacatalog_quickstart"

To get the expected results, the service account needs at least Data Catalog Viewer and BigQuery Data Viewer IAM roles. As Data Catalog automatically indexes assets managed by BigQuery and Pub/Sub, granting it the Pub/Sub Viewer role is also recommended for a better experience, although not required to follow this guide’s exercises.

Depending on the number of results returned by the first search, it will be difficult to analyze them. Also, datasets’ names may not tell us meaningful information about all of their content. We can use another search qualifier and look for more specific sensitive data. Searching for assets containing columns with the word “email” in their metadata — which includes name and description — might be a good starting point:

datacatalog_client.search_catalog(scope=scope, query='column:email')

WOW! There’s a table with “email” information:

search_result_type: ENTRY
search_result_subtype: "entry.table"
relative_resource_name: "projects/<project-id>/locations/US/entryGroups/@bigquery/entries/<entry-id>"
linked_resource: "//bigquery.googleapis.com/projects/<project-id>/datasets/datacatalog_quickstart/tables/table_2"

Let’s get more details…

Get Entry

Get Entry allows us to view the catalog Entry associated with the suspicious search result. As it refers to a table, we have chances to find useful information, including its columns schema. As mentioned in Data Catalog hands-on guide: a mental model article, relative_resource_name is the link between a search result and its underlying catalog Entry.

How does it work?

datacatalog_client.get_entry(name='projects/<project-id>/locations/US/entryGroups/@bigquery/entries/<entry-id>')

And the expected result is:

name: "projects/<project-id>/locations/US/entryGroups/@bigquery/entries/<entry-id>"
type: TABLE
schema {
  columns {
    type: "STRING"
    mode: "REQUIRED"
    column: "name"
  }
  columns {
    type: "STRING"
    mode: "REQUIRED"
    column: "email"
  }
}
source_system_timestamps {
  create_time {
    seconds: 1561338828
    nanos: 211000000
  }
  update_time {
    seconds: 1561338828
    nanos: 335000000
  }
}
linked_resource: "//bigquery.googleapis.com/projects/<project-id>/datasets/datacatalog_quickstart/tables/table_2"
bigquery_table_spec {
  table_source_type: BIGQUERY_TABLE
}

Interesting! There’s an “email” column in datacatalog_quickstart.table_2.

Thanks, Data Catalog! We wouldn’t guess it just by reading the dataset and table names. Spoiler: this amazing finding will be useful when demonstrating how to attach tags :).

Lookup Entry

In the previous example, we got the catalog Entry in 2 steps: identified a suspicious search result and requested its associated entry from the relative_resource_name. Probably there will be cases in which you will know (or guess…) the asset’s resource name before performing a catalog search, but will still need to get its catalog entry — e.g., to attach Tags to a well-known data asset belonging to one of your most important projects.

Coming back to our fictitious scenario, since we found a table_2 table with an “email” column, let’s also check for the existence of a table_1 in the same dataset. If it exists, we can analyze its columns metadata.

Replacing the table’s name in the already known linked_resource value allows us to understand Lookup Entry usage:

request = datacatalog.LookupEntryRequest()
request.linked_resource = '//bigquery.googleapis.com/projects/<project-id>/datasets/datacatalog_quickstart/tables/table_1'datacatalog_client.lookup_entry(request=request)

And the expected result:

name: "projects/<project-id>/locations/US/entryGroups/@bigquery/entries/<entry-id>"
type: TABLE
source_system_timestamps {
  create_time {
    seconds: 1561346951
    nanos: 38000000
  }
  update_time {
    seconds: 1561346951
    nanos: 86000000
  }
}
linked_resource: "//bigquery.googleapis.com/projects/<project-id>/datasets/datacatalog_quickstart/tables/table_1"
bigquery_table_spec {
  table_source_type: BIGQUERY_TABLE
}

Well, the catalog Entry shows us the dataset also contains a table_1, but it has no column (I created it empty when setting up my demo environment). We will use this finding to demonstrate tagging in Data Catalog hands-on guide: templates & tags with Python article.

Conclusion

This article explained how to use 3 core Data Catalog features with the Python GRPC client library: search catalog, get entry, and lookup entry. Although demonstrated by simple/fictitious examples, the concepts are the basis for actual scenarios, when much more data is expected to be found, analyzed, and understood with these helpful Data Catalog’s features.

There are more core features to learn if you are new to Data Catalog. We didn’t see how it can help teams to manage data yet: templates and tags are explored in the next article. Keep reading!

The source files used to generate above code samples are available on GitHub: https://github.com/ricardolsmendes/gcp-datacatalog-python.

References

Data Catalog getting started guide: https://cloud.google.com/data-catalog/docs/quickstarts/quickstart-search-tag
Data Catalog search syntax: https://cloud.google.com/data-catalog/docs/how-to/search-reference
Python Client for Google Cloud Data Catalog API docs: https://googleapis.github.io/google-cloud-python/latest/datacatalog/index.html

Changelog

2020–10–03: Updated the code snippets to enforce compliance with version 2.0.0 of the client library.