Where is my data? The answer is Google Data Catalog

David Verdejo
bluekiri
Published in
14 min readMay 4, 2020
Photo by Kolar.io on Unsplash

Google Cloud Platform have announced that their Data Catalog service is General Available and we want to share our experience with you.

But before we start to explain this service, what is a data catalog?

From Gartner: “A data catalog maintains an inventory of data assets through the discovery, description, and organization of datasets. The catalog provides context to enable data analysts, data scientists, data stewards, and other data consumers to find and understand a relevant dataset for the purpose of extracting business value.”

To explain this in simple, it uses metadata to help organizations manage their data. But, what is metadata? There are three kinds of metadata:

  • Technical metadata: Schemas, tables, columns, …
  • Business metadata: business descriptions, comments, annotations, classifications, and more.
  • Operational metadata: data owner, dependencies, update frequency, …

The main benefits when we use a Data Catalog are:

  • Data governance
  • Operational efficiency
  • Competitive advantages
  • Risk detection
  • Cost savings

Speaking of GCP Data Catalog, the service offer us the following features:

  • Serverless
  • Central Catalog
  • Governance
  • Metadata-as-a-Service
  • Search and discovery
  • Schematized metadata
  • On-prem connectors
  • Cloud DLP integration
  • BigQuery integration
  • Cloud IAM integration

Let’s start with GCP Data Catalog. Log in the GCP console and go to the “Data Catalog” section.

If it’s the first time to access and you aren’t enabled the Data Catalog API, you’ll be prompted to enabled it

We are going to create a table in BigQuery. Our first step will be the dataset creation (select “Data location: US” in order to follow the example; if not, you won’t be able to copy from a dataset in another region)

Next we are going to copy a table from a public dataset to ours. In our case we are going to use the “bigquery-public-data:usa_names.usa_1910_2013". Click on “Copy Table”

And in destination we are going to point to our dataset

Go back to Data Catalog page, and in the section “Explore data assets”, click on “All BigQuery resources”

And then we can see our new created resources

Click on our table, and we can see the related details, schema and tags of our table

Data Catalog tag templates help you create and manage common metadata about data assets in a single location. The tags are attached to the data asset which means it can be discovered in the Data Catalog system.

Are you ready for create our first template? From the Data Catalog home, select “Create tag template”:

Entry the “Template display name”, the “Template ID” and select the appropriate “Location”. Then “Add attribute”

Then add then “Display name”, the “Attribute ID”, the “Type” and if it’s required or not

And add a second attribute

And finally click on “Create” to finish the tag template

Go back to our table and click on “Attach Tag” and the select “Attach to” the table or column and select the template and fill out the required fields:

Change to “Schema & column tags” tab and we can see the table schema and the columns descriptions

In the next step, we are going to modify the BigQuery table schema. Go back to BigQuery and go to the Schema tab of our table and click on “Edit Schema”

Add a new field called “country” and “Save” it:

Click another time in “Edit schema” and add a description to our new column and “Save”:

Then, return to Data Catalog and we can see that the schema have been updated automatically with the new column:

We can add tags to columns too. Firstly, we are going to create a new template to add a tag those columns that contain PII (Personally Identifiable Information) data:

And then move to the “Schema & column tags” in our BigQuery table and assign the template to a column. In our case, for testing purpose, we are going to set the tag “is_pii” to true for the name column:

In this new section, we are going to integrate Data Catalog with Data Loss Prevention (DLP) service. To see in action, we are going to create a new table in our BigQuery dataset with PII (Personally Identifiable Information) data using bq_fake_pii_table_creator (a library written in Python3 for creating BQ tables with fake PII data):

bq-fake-pii-table-creator --project-id $GOOGLE_CLOUD_PROJECT --bq-dataset-name ds_dc_01 --num-rows 5000 --num-cols 10

Note: you need Storage and BigQuery Admin roles (or project Owner) to execute the script

And now we need to enable de DLP API if you aren’t enabled before

In the DLP UI, create a “Job or job trigger”

Then enter the “Job ID”, “Storage Type” in our case will be BigQuery, the “Project ID”, “Dataset ID” and “Table ID”.

Note: It’s recommended that you setup the sampling to reduce the DLP cost if it’s possible.

In “Configure detection”, you can select a template or you can select the InfoTypes to check during our job (keep in mind that the DLP costs depend on the number of InfoTypes to check)

In “Actions”, choose “Publish to Data Catalog”

Note: I recommend use “Publish to PubSub” or “Notify by email” if you want to be alerted in real-time and “Publish to Security Command Center” if you want to see all your security related information in one single place.

And in “Schedule”, in our case we want to run immediately. Click on “Create”

During the job, we can return to the Data Catalog UI and we can verify that DLP have added some tags to our table during the DLP job execution

Wait until the DLP will finish

Note: one interesting point that you can observe from the result is that you have to take care when you select the InfoTypes to search (in this case IBAN_CODE don’t detect BBAN codes). Make sure that you select the correct ones before launch your DLP jobs.

And then go back to Data Catalog and we can see the DLP job results in the “Tags” tab:

Another interesting feature in Data Catalog is that we could restricting access to BigQuery at column-level with Policy Tags (currently in Beta).

To see in action, we are going to take advantage of the previous table that contains PII data.

Our first step will be the creation of a taxonomy in Data Catalog (Data Catalog Policy Admin role is required). Click on “Create and manage policy tags” in the “Policy tags” section:

Click on “Create”

Then provide the Name, Description, Project, Location and then create the policies tags (and our desired hierarchy). After that, press “Save”:

The next step will be enforce our policy and set permissions. In the “Policy Tags” page, select the recently created policy:

Enable the “Enforce access control” (a warning screen advise that only members of “Fine-Grained Reader” role on policy tags will be enabled to access to Bigquery columns with the assigned policy tags)

Important: before put in production, you can set the policy in monitor mode. To use monitor only mode, do not yet enforce access control. Then, have your previously authorized users continue to use the system. As they use the system, an audit trail is generated. You can scan the audit logs to determine if any unexpected PERMISSION_DENIED errors were encountered. After you are satisfied that the column-level security is properly set up, enforce access control.

Then click on the copy button next to the ID of the desired Policy tag to get the resource name (we’ll need later):

Before we can assign the the tag policy to our data, we need the required roles to do it: Policy Tags Admin and Project Viewer. Go to IAM page and we are going to give us Policy Tag Admin for the project (for more fine grained access control you can give access by Policy Tag too):

Note: don’t confuse Data Catalog Admin and Data Catalog Policy Tags Admin roles (they have similar names). The Data Catalog Policy Tags Admin role is specifically targeted at managing policy tags, and by design is not as broad of a role as Data Catalog Admin. Also, the Data Catalog Admin role does not contain the policy tag permissions that are granted to the Data Catalog Policy Tags Admin role.

And then go to BigQuery UI and we edit your schema

And surprise!!! At the time of writing, we can’t add policy tag from UI (you have to remember that it’s in beta)

After I spent several hours trying to fix it, I finally decided to make it via bq command. First, save the table schema in a file:

bq show --schema --format=prettyjson daveres-datacatalog:ds_dc_01.org_utilize_end_to_end_channels_f16c > schema.json

Modify the previous file to add the policy tag “High PII” to the email and company_email fields:

[
...
{
"mode": "NULLABLE",
"name": "email",
"type": "STRING",
"policyTags": {
"names": ["projects/daveres-datacatalog/locations/us/taxonomies/7455225971757727231/policyTags/8641215098440885555"]
}
},
...
{
"mode": "NULLABLE",
"name": "company_email",
"type": "STRING",
"policyTags": {
"names": ["projects/daveres-datacatalog/locations/us/taxonomies/7455225971757727231/policyTags/8641215098440885555"]
}
},
...
]

And update the table schema

bq update daveres-datacatalog:ds_dc_01.org_utilize_end_to_end_channels_f16c schema.json

And finally we see our change applied in the BigQuery console

Let’s try to query our table

As we expected, we receive an error message because we don’t have permission to query the column email

If we remove the above-mentioned column, the query works fine:

To permit an user to query this column, we have to assign Fine Grained Reader role to the user. Go to our policy tag and click on “Show Info Panel” (upper right corner)

And add the “Fine Frained Reader” role to the desired users:

Re-execute the query in BigQuery and… it works fine

Now it’s time to review the integration with PubSub. Go to PubSub UI and create a new topic

Return to the Data Catalog UI and this time we are going to Explore “Data streams and Pub/Sub topics”

We can see our recently created topic

And we can see the related details and we have the option to “attach tag”

Now it’s time to use Data Catalog with our Google Storage buckets. First of all, we are going to create a new Storage bucket from console (select US region in order to export data from BigQuery public datasets)

And we are going to export data from a BigQuery public dataset in Avro format (this time we are going to use bigquery-public-data:usa_names.usa_1910_current).

Go to Data Catalog and let’s create an entry group

Then “Create” the entry

And add the bucket, the fileset name and Fileset ID and select the file patterns (we can add up to 5 patterns)

Next, click on “Define Schema” and add the related schema

And finally click on “Create” and we can verify that our fileset is created

And now it’s time to see in action the searching capabilities of our Data Catalog. Imagine that we want to search for data where the data_owner is Bluekiri. From the Data Catalog UI home, search for “tag:data_owner:”Bluekiri””

Another useful search is to look for PII data: “tag:is_pii=true”

We can ecven review the DLP findings with “tag:has_findings=true”:

The documentation about the search syntax could be found in the following link

And one more thing… Integration with our on-premise databases.

To simulate our on-premise database, we are going to create a MySQL instance on GCP. To make it easy, go to the GCP Marketplace and select one deployment (I love this feature to make rapid tests)

After our instance is up and running, we connect to our instance and we are going to create a new database and a table

mysql> CREATE DATABASE db-dc;mysql> USE db_dc;mysql > CREATE TABLE tbl_demo ( \
id smallint unsigned not null auto_increment COMMENT 'User Id - PK', \
username varchar(20) not null COMMENT 'User name', \
email varchar(50) not null COMMENT 'User email', \
constraint pk_example primary key (id) ) COMMENT="Demo Data Catalog - MYSQL";
mysql > INSERT INTO tbl_demo ( id, username, email ) VALUES \
( null, 'annibal', 'annibal@ateam.com' ), \
( null, 'barracus', 'barracus@ateam.com' ), \
( null, 'murdock', 'murdock@ateam.com' ), \
( null, 'fenix', 'fenix@ateam.com' );

We are now ready to connect our instance to Data Catalog. Go to the data connector repository, and you can see the current integrations:

  • MySQL
  • PostgreSQL
  • SQL Server
  • Oracle

Important: This is not an officially supported Google product.

In order to setup the integration, we need the following prerequisites:

  • Auth credentials: create a service account and grant it the
    Data Catalog Admin role and download the JSON key (we need to move it later to our database instance)
  • Python 3 (pip and virtualenv)

Clone the repository and move to the mysql2datacatalog folder. Create a python3 virtualenv and activate it:

python3 -m virtualenv --python python3 env
source ./env/bin/activate

Install the required dependencies:

pip install ./lib/datacatalog_connectors_commons-1.0.0-py2.py3-none-any.whl
pip install ./lib/rdbms2datacatalog-1.0.0-py2.py3-none-any.whl
pip install --editable .

Before we can run the integration, we need to setup the following environment variables:

export GOOGLE_APPLICATION_CREDENTIALS=data_catalog_credentials_file
export MYSQL2DC_DATACATALOG_PROJECT_ID=google_cloud_project_id
#Database location - metadata that we can see in the database details
export MYSQL2DC_DATACATALOG_LOCATION_ID=google_cloud_location_id
export MYSQL2DC_MYSQL_SERVER=mysql_server
export MYSQL2DC_MYSQL_USERNAME=mysql_username
export MYSQL2DC_MYSQL_PASSWORD=mysql_password
export MYSQL2DC_MYSQL_DATABASE=mysql_database

Then, we can launch the script:

mysql2datacatalog \--datacatalog-project-id=$MYSQL2DC_DATACATALOG_PROJECT_ID \--datacatalog-location-id=$MYSQL2DC_DATACATALOG_LOCATION_ID \--mysql-host=$MYSQL2DC_MYSQL_SERVER \--mysql-user=$MYSQL2DC_MYSQL_USERNAME \--mysql-pass=$MYSQL2DC_MYSQL_PASSWORD \--mysql-database=$MYSQL2DC_MYSQL_DATABASE

Go to the Data Catalog UI and click on “Entry groups”

Then we can observe that a new entity group “mysql” have been created

Inside this entity group, we can see the database and the table:

And we can check the table definition

Note: at the time of writing, the column descriptions aren’t exported by the script to Data Catalog.

As to the price of the service, Data Catalog charges apply to:

  • Metadata storage: Business metadata as well as any on-premises metadata ingested by Data Catalog.
  • Data Catalog API calls: Data Catalog read, write, and search API calls.

For more information about the price, follow the link.

That is all for now but don’t miss the next episodes of the Bluekiri -team

--

--