Google Cloud Data Catalog Filesets: unlock its full potential

Enrich your Google Cloud Storage Filesets with useful statistics about your files

Marcelo Costa
Google Cloud - Community

--

Photo by bendavisual on Unsplash

The Dilemma

Have you ever lost track of how many files you have stored at your cloud provider? Is your Object Storage solution disorganized? Do you have empty buckets you are not aware of? Do you want to better manage your files to be compliant with all those new data protection regulations?

Data Catalog to the rescue!

If you are not familiar with Data Catalog, a recently announced member of Google Cloud’s Big Data services family, there’s a great series on medium that explains some of its capabilities.

Data Catalog is a fully managed and scalable metadata management service that empowers organizations to quickly discover, manage, and understand all their data in Google Cloud

With their newest version, users are now able to create Filesets.

from google official docs

A Cloud Storage Fileset is a set of one or more files in Cloud Storage. What will define its cardinality, is the file pattern provided at creation time.

We will talk more about the file patterns and give some concrete examples later on.

In order to Create your Fileset, first you need to have a user defined Entry Group, let’s understand this relationship:

based on google official docs

The Fileset Entry is contained within a user-created Entry Group. So we must create the Entry Group beforehand.

Please follow the instructions at the google official docs, if you want to test it out and create your Fileset entry, there are samples on multiple languages, showcasing how to do it.

Once you have your Fileset Entry set up, you are able to discover it using DataCatalog’s search engine and add business Tags to it, but this raises a question… what if this Fileset Entry does not point to anything?

We can have a file pattern that matches zero files in a bucket, and that probably wouldn’t be very helpful… How can we enrich it with useful metadata about our files, and improve DataCatalog’s search capabilities and on top of it our own metadata management?

Fileset Enricher

This is the goal of the Fileset Enricher:

It’s part of the datacatalog-util python package and using the enrich command your DataCatalog Fileset Entries will be Tagged with statistics about files matching the provided file pattern.

You can simply install it by doing pip3 install datacatalog-util.

Disclaimer: This is not an officially supported Google product, it’s an open source python package, open to contributions =)

So to demonstrate this feature, let’s look at this scenario, where we have 3 different buckets with files:

GCP project and storage buckets with files

Now we are going to create 4 Fileset Entries in our Project:

  • file pattern gs://orders*/*csv
  • file pattern gs://orders*/*
  • file pattern gs://users_pii/*
  • file pattern gs://new_orders/*

Let’s take a look at them:

Created Fileset Entries

Not very useful right?

The Fileset Enricher feature will add the following Tag fields to each Fileset Entry:

Tag fields options

It even allows the user to choose the fields that should be added to the Entry, in case no fields are specified, all fields above are applied.

So let’s run the script:

datacatalog-util filesets enrich --project-id my-project
  • file pattern gs://orders*/*csv
CSV Orders Fileset Enriched with Tags

We can see that two buckets were found, and we have 4 CSV files matching the pattern.

  • file pattern gs://orders*/*
Orders Fileset Enriched with Tags

We can see that 7 files were found now, and we have an unkown_file_type, someone made a mistake? :)

  • file pattern gs://users_pii/*
PII Data Fileset Enriched with Tags

We can see that 6 files were found now, and we have 5 different file types. Maybe we should standardize this bucket?

  • file pattern gs://new_orders/*
New Orders Fileset Enriched with Tags

Wait, this bucket doesn't even exist, maybe we should delete this Entry?

  • Bonus: file pattern gs://*/*
My Project Fileset Enriched with Tags

You can even run it for all buckets in your project, and by doing that we found out that we have 868 files, and from those 89 are of unknown_type.

Data Catalog Search

This is where Data Catalog shines, imagine that you could have thousands of storage buckets and Filesets… it does not matter how good the quality of your data is, if you are unable to search and find it.

If you want to understand the basics of search, please read this great article, that explains its features, or the official docs.

Let’s run a few queries to showcase this:

  • tag:fileset_enricher_findings.buckets_found=0
Search UI, zero buckets found
  • tag:fileset_enricher_findings.buckets_found>0
Search UI, buckets greater than zero found
  • tag:fileset_enricher_findings.files_by_type:png
Search UI, png files found

Those are just a few options, but you can use any Tag Field to search and get meaningful insights out of your metadata!

Load Test

To wrap it up… when we think about an object storage solution, like Google Cloud Storage, it’s common to deal with a large number of files, so to see how this python script performed, we created a scenario having 1008689 files in a single bucket. Since we are only dealing with files metadata, it shouldn’t take long right? Let’s see how it performed.

Results from 1 million+ files

A e2-medium VM (2 vCPUs and 4GB memory) was used to run the Python script, and it took 3 minutes to finish the execution.

Closing thoughts

In this article, We have covered how to enrich your DataCatalog Fileset Entries with useful statistics about your files and then search for it, its important to point out, that this DataCatalog feature is currently in Beta and it will most likely be improved later on, in the meanwhile this python script can be a good option to improve your metadata management. Also keep in mind that the python script, uses the GCS list_buckets and list_blobs APIs to extract the metadata that matches the file pattern and generate the file statistics, so their billing policies will apply.
Hope it helps! Cheers!

[Update] Changed to use datacatalog-util python package.

Resources

  1. DataCatalog official Fileset docs: https://cloud.google.com/data-catalog/docs/how-to/filesets
  2. DataCatalog Utils GitHub: https://github.com/mesmacosta/datacatalog-util
  3. DataCatalog medium series: https://medium.com/google-cloud/data-catalog-hands-on-guide-a-mental-model-dae7f6dd49e
  4. Data Catalog getting started guide: https://cloud.google.com/data-catalog/docs/quickstarts/quickstart-search-tag

--

--

Marcelo Costa
Google Cloud - Community

software engineer & google cloud certified architect and data engineer | love to code, working with open source and writing @ alvin.ai