Google Cloud Data Catalog hands-on guide: search, get & lookup with Python
This quickstart guide is part of a series that brings a practitioner approach to Data Catalog, a recently announced member of Google Cloud’s Data Analytics services family.
The below content shows how to use Data Catalog’s search, get, and lookup entry features with the Python GRPC client library. If you need a bit of context before getting into practice, please take a look at the article I wrote describing my mental model about such features.
Environment setup
Create sample data assets
In case you don’t have data assets that might be used in the practical exercises, please create a BigQuery dataset with two tables, similar to what is displayed in the below image.
The first table may have only one field, give it a name according to your preference. Create two fields in the second table, setting their names to name
and email
. There’s no need to insert data.
I’ll refer to this fictitious data structure in the next section, but you will notice that’s easy to adapt the samples to actual data.
Enable Data Catalog’s API
Open your GCP Project’s API Library: https://console.cloud.google.com/apis/library/project=<project-id>
. Type Data Catalog
in the search box and you’ll see something similar to the below picture:
Click on Google Cloud Data Catalog API. In the next page, click the ENABLE button.
Install the client library
To install the GRPC/idiomatic client library, use pip install --upgrade google-cloud-datacatalog
or add google-cloud-datacatalog
to requirements.txt
and then run pip install --upgrade -r requirements.txt
.
Import dependencies
In the Python code, import Data Catalog’s package:
from google.cloud import datacatalog
Initialize the API client
The API client needs to be authenticated with a service account to make requests. One option is to provide auth credentials by using the GOOGLE_APPLICATION_CREDENTIALS
environment variable. In this case, initializing the client object is as simple as:
datacatalog_client = datacatalog.DataCatalogClient()
Another option consists of explicitly passing a service account filename to the constructor:
datacatalog_client = datacatalog.DataCatalogClient.from_service_account_file(filename)
Search, get entry & lookup using Python
Search Catalog
Let’s start by looking for all BigQuery datasets the service account has at least read access to, inside your organization.
(1)scope = datacatalog.SearchCatalogRequest.Scope()
(2)scope.include_org_ids.append(<organization-id>)(3)results = datacatalog_client.search_catalog(scope=scope, query='system=bigquery type=dataset')
The first line creates a search Scope
object. The second uses it to inform Data Catalog we wanna search at the organization level. To be more restrictive scope.include_project_ids
could be used instead of scope.include_org_ids
— and search would be scoped only to the provided projects in that case. Please notice both fields are lists, and at least one of them must be provided in any search request. The third line makes the API request and stores the response in the results
variable.
There’s a trick here: search_catalog()
returns a GRPC iterator that has kind of a “remote references collection” to the search results. Actual data is fetched only when you iterate over it, as follows:
(4.1)
fetched_results = [result for result in results]OR(4.2)
fetched_results = []
for page in results.pages:
fetched_results.extend(page)
# Extra processing might be added here.(5)print(fetched_results)
4.1 is a straightforward way to fetch all results, since the Iterator
transparently handles pagination for you; 4.2 is a bit more verbose, iterating over one page at a time. The second allows you to add extra logic to the processing, such as checking if more data needs to be fetched when iterating over large resultsets or adding a timeout between new page requests to avoid API usage to exceed quota.
If you created a dataset as suggested in the Environment setup | Create sample data assets section, you should see below content among search results:
search_result_type: ENTRY
search_result_subtype: "entry.dataset"
relative_resource_name: "projects/<project-id>/locations/US/entryGroups/@bigquery/entries/<entry-id>"
linked_resource: "//bigquery.googleapis.com/projects/<project-id>/datasets/datacatalog_quickstart"
To get the expected results, the service account needs at least Data Catalog Viewer and BigQuery Data Viewer IAM roles. As Data Catalog automatically indexes assets managed by BigQuery and Pub/Sub, granting it the Pub/Sub Viewer role is also recommended for a better experience, although not required to follow this guide’s exercises.
Depending on the number of results returned by the first search, it will be difficult to analyze them. Also, datasets’ names may not tell us meaningful information about all of their content. We can use another search qualifier and look for more specific sensitive data. Searching for assets containing columns with the word “email” in their metadata — which includes name and description — might be a good starting point:
datacatalog_client.search_catalog(scope=scope, query='column:email')
WOW! There’s a table with “email” information:
search_result_type: ENTRY
search_result_subtype: "entry.table"
relative_resource_name: "projects/<project-id>/locations/US/entryGroups/@bigquery/entries/<entry-id>"
linked_resource: "//bigquery.googleapis.com/projects/<project-id>/datasets/datacatalog_quickstart/tables/table_2"
Let’s get more details…
Get Entry
Get Entry allows us to view the catalog Entry
associated with the suspicious search result. As it refers to a table, we have chances to find useful information, including its columns schema. As mentioned in Data Catalog hands-on guide: a mental model article, relative_resource_name
is the link between a search result and its underlying catalog Entry
.
How does it work?
datacatalog_client.get_entry(name='projects/<project-id>/locations/US/entryGroups/@bigquery/entries/<entry-id>')
And the expected result is:
name: "projects/<project-id>/locations/US/entryGroups/@bigquery/entries/<entry-id>"
type: TABLE
schema {
columns {
type: "STRING"
mode: "REQUIRED"
column: "name"
}
columns {
type: "STRING"
mode: "REQUIRED"
column: "email"
}
}
source_system_timestamps {
create_time {
seconds: 1561338828
nanos: 211000000
}
update_time {
seconds: 1561338828
nanos: 335000000
}
}
linked_resource: "//bigquery.googleapis.com/projects/<project-id>/datasets/datacatalog_quickstart/tables/table_2"
bigquery_table_spec {
table_source_type: BIGQUERY_TABLE
}
Interesting! There’s an “email” column in datacatalog_quickstart.table_2
.
Thanks, Data Catalog! We wouldn’t guess it just by reading the dataset and table names. Spoiler: this amazing finding will be useful when demonstrating how to attach tags :).
Lookup Entry
In the previous example, we got the catalog Entry
in 2 steps: identified a suspicious search result and requested its associated entry from the relative_resource_name
. Probably there will be cases in which you will know (or guess…) the asset’s resource name before performing a catalog search, but will still need to get its catalog entry — e.g., to attach Tags to a well-known data asset belonging to one of your most important projects.
Coming back to our fictitious scenario, since we found a table_2
table with an “email” column, let’s also check for the existence of a table_1
in the same dataset. If it exists, we can analyze its columns metadata.
Replacing the table’s name in the already known linked_resource
value allows us to understand Lookup Entry usage:
request = datacatalog.LookupEntryRequest()
request.linked_resource = '//bigquery.googleapis.com/projects/<project-id>/datasets/datacatalog_quickstart/tables/table_1'datacatalog_client.lookup_entry(request=request)
And the expected result:
name: "projects/<project-id>/locations/US/entryGroups/@bigquery/entries/<entry-id>"
type: TABLE
source_system_timestamps {
create_time {
seconds: 1561346951
nanos: 38000000
}
update_time {
seconds: 1561346951
nanos: 86000000
}
}
linked_resource: "//bigquery.googleapis.com/projects/<project-id>/datasets/datacatalog_quickstart/tables/table_1"
bigquery_table_spec {
table_source_type: BIGQUERY_TABLE
}
Well, the catalog Entry
shows us the dataset also contains a table_1
, but it has no column (I created it empty when setting up my demo environment). We will use this finding to demonstrate tagging in Data Catalog hands-on guide: templates & tags with Python article.
Conclusion
This article explained how to use 3 core Data Catalog features with the Python GRPC client library: search catalog, get entry, and lookup entry. Although demonstrated by simple/fictitious examples, the concepts are the basis for actual scenarios, when much more data is expected to be found, analyzed, and understood with these helpful Data Catalog’s features.
There are more core features to learn if you are new to Data Catalog. We didn’t see how it can help teams to manage data yet: templates and tags are explored in the next article. Keep reading!
The source files used to generate above code samples are available on GitHub: https://github.com/ricardolsmendes/gcp-datacatalog-python.
References
- Data Catalog getting started guide: https://cloud.google.com/data-catalog/docs/quickstarts/quickstart-search-tag
- Data Catalog search syntax: https://cloud.google.com/data-catalog/docs/how-to/search-reference
- Python Client for Google Cloud Data Catalog API docs: https://googleapis.github.io/google-cloud-python/latest/datacatalog/index.html
Changelog
2020–10–03: Updated the code snippets to enforce compliance with version 2.0.0 of the client library.