5 key steps for building a centralized and secured Data Catalog on GCP

Make your people find what they’re looking for

Elena Mata Yandiola
Google Cloud - Community
9 min readFeb 4, 2022

--

My team and I have recently finished a beautiful project. Our goal was making all the employees of a big company forget about thinking “I’ve been 10 minutes looking for this data and still haven’t found what I was looking for” ever again.

Tim Gouw’s photo from Unsplash

Does anyone know how much time we waste looking for data to do our work?

Many of us have lived the nightmare of feeling that we were wasting our time looking for corporate data that we suspected existed, but we couldn’t find. According to a study carried out by Dresner in 2020 [1], 51% of employees believe that they struggle to find analytical data, this trend growing in the largest companies (> 1000 employees). In addition, it has also been detected that the harder it is to find content, the more chances there is that the business initiative will fail. To demonstrate this, let’s throw in some bleak data discovered in various studies [1] [2] about the first phase of most business initiatives, that first information search & analysis:

  • 90% of the work week spent on data related activities.
  • 92% of that time spent searching for data assets.
  • 44% of time wasted because data workers are unsuccessful in their activity.

and there is one thing that we are quite clear about: time is money. To address this issue and dramatically reduce search time, Google Cloud created Data Catalog, releasing it as generally available (GA) in April 2020.

Data Catalog is a fully managed and highly scalable data discovery and metadata management service. During this article, I’d like to show how to build your end-to-end enterprise Data Catalog by using this product. In order to achieve this, I will keep it simple: just 5 steps.

Before that, I want to make a clarification -perhaps obvious to some- to emphasise that, in order to build our business Data Catalog, what is fundamental (and what we are going to manage with this product) is not the data itself, but the metadata that describes our data.

Our fundamental objective with this Catalog is that our colleagues are able to find the information they are looking for and, if they don’t have access to the data itself, at least they can know the specific information that we do want them to know about that data. For example, we might want our employees to see:

  • who created a table or who is responsible for a file (so if it’s not sensitive information, this person knows who to ask for access to it)
  • when that information was generated and when it was updated
  • how much storage it occupies
  • if it contains sensitive information and what kind
  • and so on.

All of this without necessarily having access to the data underlying that metadata.

Now that we have understood the use case for Data Catalog, let’s see how to get value from this product. For better understanding, I added demo videos for illustrating some of our 5 steps. This demo is based on 2 different projects (finance and hr), and 3 different identities (in this case they are 3 individual accounts, but in a real use case they might be user groups or even service accounts). Find here the permissions granted to each of these identities:

  • Data Analyst can discover data in the finance project but not access data in that project. This is granted through the Data Catalog Viewer role, which provides metadata read access to all data assets and read access to all Tag Templates and Tags. Data analyst has no access to discover or access data in hr project.
  • Data Curator can discover and access data in projects finance and hr, granted through the Project Viewer role. It can also create and attach tags by using tag templates that reside in any of the two projects (Data Catalog TagTemplate User role), and also edit the attached tags (Data Catalog Tag Editor role).
  • Data Governor can discover and access data in projects finance and hr, granted through the Project Viewer role. Besides, it can access Tag Templates and create/edit them in both projects, thanks to the Data Catalog TagTemplate Owner role, which enables to create/update/delete tag templates and the associated tags.

With that said, let’s see how we can build our Data Catalog on Google Cloud in 5 steps.

1. Detect and enable the data that you want to make discoverable

For starting using Data Catalog, you just need to open the GCP console and enable the product’s API here.

Once you do that, Google Cloud sources become discoverable automatically in your Catalog. This includes BigQuery (datasets, tables and BQML models), Pub/Sub, and Dataproc Metastore data. Besides, you can also send Data Loss Prevention (DLP) scan results to Data Catalog.

Apart from Google Cloud data sources, you can also integrate on-premises data sources with your Data Catalog, such as all the typical RDBMS (MySQL, PostgreSQL, SQL Server, Teradata, etc.), BI tools (Looker, Qlik and Tableau) and Hive data sources. See all the supported sources here. The process for integrating this sources is explained in the Step 3: Use entry groups, which also enables us to integrate Google Cloud Storage files in our Catalog.

Once we have all our metadata ingested, we can use Data Catalog for two main proposes:

  • Make our (meta)data discoverable through search (see Step 2).
  • Enrich our data with additional business metadata through tags (see Step 4).

2. Know (and make your users know) how to search for data

Once we have all our metadata ingested and all the necessary permissions set, our users are able to discover all the assets that they have metadata level access to. This discovery is usually done through queries in Data Catalog’s search bar.

These searches can be quite simple, being able to search for a substring of characters that returns all the assets related in some way to that substring. For example, if we simply searched for “num”, the search engine would return all assets that contain “num”: in the asset name, in a column of some table in our environment, in a project name, a description, the value of a label, etc. If we want to do more advanced queries, we will have to stick to the Data Catalog search syntax, whose Qualifiers (operators) can be found in Data Catalog’s documentation.

To better understand how powerful this search engine is, let me try to explain it here (click “See in YouTube” for easier reading):

3. (Optional) Use entry groups

Entries represent data resources that can be created, searched and managed from Data Catalog. This resources can be:

  • Google Cloud resources, such as a BigQuery dataset or table, Pub/Sub topic, etc.
  • Custom resources with custom data types.

Entries are contained in an entry group, which is a set of logically related entries together with IAM policies that specify the users who can create, edit, and view those entries.

As we said, we can create entry groups with custom entries (e.g. files coming from sources as MySQL, Postgres or SQL Server DDBB, other data lakes & warehouses as Redshift, Teradata, etc.). We can also create entry groups formed by Google Cloud Storage (GCS) filesets that we define.

GCS filesets can only be formed by a subset of objects of the same bucket, but you can have more that one fileset in an entry group. These filesets permit (complex) wildcarding for defining them. For example, we could define a fileset by specifying the pattern gs://data-emy/2021_*/**/file?.[xls,csv]. These pattern’s meaning is:

  • * — Match any number of characters at that directory level
  • ** — Match any number of characters across directory boundaries.
  • ? — Match a single character. E.g. gs://bucket/??.txt only matches objects with two characters followed by .txt.
  • [] — Match any of the range of characters.

With all this, we could say that the specified pattern includes any file in CSV or XML format that is called file? (file1, file2, …, file9), inside any subdirectory from those folders inside the “data-emy” bucket whose name starts with “2021_”.

A fileset can have between 1 and 500 patterns (see wildcard names and GcsFilesetSpec.filePatterns API reference documentation for more information).

4. Use Tags & Templates

Data Catalog Tags are business metadata that you -better said, the Cloud Governor- can create and manage in order to make information discoverable and complete for the rest of users. These tags can be applied at the column or table level:

Tags can also be automatically created when executing a Cloud Data Loss Prevention (DLP) job, which is a job for scanning for sensitive information. The auto-generated tag will contain all as many fields as infoTypes (sensitive information types such as email addresses or credit card numbers) we were searching for, specifying as a value the number of findings the DLP API got for each of those infoTypes.

In order to achieve metadata consistency, the Data Governor can create tag templates. These templates are formed by metadata fields, key-value pairs which can be of type string, double, boolean, enumeration, or datetime; and can be required or not.

Data Catalog tag templates will allow our Data Curator users to create homogeneous tags with all the information (metadata) we need to have a robust and consistent catalog.

As this is one of the most important topics in our Catalog, I have created a video where you can see how the Data Governor would create a tag template including what type of business information is relevant to classify our data. After that, a Data Curator could use the created template to attach tags to BigQuery tables or columns. You can see it here:

5. (Optional) Make use of Policy tags for restricting the access to BigQuery columns

The four steps that we have just seen are all we need to achieve a complete Data Catalog, boosting our Data Governance through homogeneous and consistent metadata tags. However, there’s another feature from Data Catalog that we have not explored yet and provides us column-level security in BigQuery: the Policy Tags. If you are not -or interested in- using BigQuery, then you can stop reading here. If you want to learn how to control access to BigQuery tables at the column level, please keep reading.

By applying these policy tags, we can hide BigQuery columns to users that shouldn’t see that information. This way, they will just see the tables as if that specific columns weren’t existing.

To define column fine-grained access to our BigQuery table, we need to carry out three steps (please note the order of steps 2 and 3 can be interchanged):

  1. Define a taxonomy in Data Catalog and create our policy tags inside it. By defining our taxonomy, we’re defining the different levels of security where our policy tags will reside. For instance, we could define levels based on how sensitive our data is -e.g. high, medium or low sensitive-, and then assign the policy tags that we want for the different levels.
  2. Apply the policy tags to the columns we want to restrict access to. As we are now at the data level and not at the metadata one -we’re controlling who has access to the data itself, not to the metadata-, this should be applied in BigQuery and not in Data Catalog. The way that we apply policy tags is by editing the schema of the corresponding table and applying the policy tags to the specific columns that we want.
  3. Control the access by applying the IAM “Fine-Grained Reader” role at the taxonomy level that we want. This role is specifically and solely to cover this use case: restricting access at the BigQuery column level. Note: Column-level security is enforced in addition to existing dataset ACLs. A user needs both dataset permission and policy tag permission in order to access data protected by column-level security.

I know what you’re thinking, this process could seem a bit cumbersome when applied for the first time. However, it can save you tons of time if you define a great taxonomy that can serve as a basis for all your BigQuery tables. If you want to see how to create and apply these policy tags by following Google Cloud’s best practices for using policy tags, please watch the following video:

As we have seen, finding data in your organization does not have to be a nightmare. With these 5 fairly simple steps you can make your data more accessible, richer, more shareable and centrally controlled.

Hope you enjoyed the article and found practical this guide to create a Data Catalog on Google Cloud. Whether you managed to apply these steps or not, I would love to hear your opinion -feedback is a gift- in the comments or on LinkedIn.

Thanks so much for reading this far!

[1] Data Catalog Study, Dresner Advisory Services, LLC — June 15,2020, https://hdresner.gumroad.com/l/YKFrv

[2] State of Data Science and Analytics, An IDC IntroBrief, Sponsored by Alteryx, 2019, https://pages.alteryx.com/idc-infobrief-state-data-science-analytics

--

--

Elena Mata Yandiola
Google Cloud - Community

Quite interested in Big Data & ML topics. I’m currently a Data & Analytics Cloud Consultant in Google Cloud’s PSO team in Madrid -where I was born and raised-.