Google Cloud Data Catalog Filesets: unlock its full potential
Enrich your Google Cloud Storage Filesets with useful statistics about your files
The Dilemma
Have you ever lost track of how many files you have stored at your cloud provider? Is your Object Storage solution disorganized? Do you have empty buckets you are not aware of? Do you want to better manage your files to be compliant with all those new data protection regulations?
Data Catalog to the rescue!
If you are not familiar with Data Catalog, a recently announced member of Google Cloud’s Big Data services family, there’s a great series on medium that explains some of its capabilities.
With their newest version, users are now able to create Filesets.
A Cloud Storage Fileset is a set of one or more files in Cloud Storage. What will define its cardinality, is the file pattern
provided at creation time.
We will talk more about the file patterns and give some concrete examples later on.
In order to Create your Fileset, first you need to have a user defined Entry Group, let’s understand this relationship:
The Fileset Entry is contained within a user-created Entry Group. So we must create the Entry Group beforehand.
Please follow the instructions at the google official docs, if you want to test it out and create your Fileset entry, there are samples on multiple languages, showcasing how to do it.
Once you have your Fileset Entry set up, you are able to discover it using DataCatalog’s search engine and add business Tags to it, but this raises a question… what if this Fileset Entry does not point to anything?
We can have a file pattern
that matches zero files in a bucket, and that probably wouldn’t be very helpful… How can we enrich it with useful metadata about our files, and improve DataCatalog’s search capabilities and on top of it our own metadata management?
Fileset Enricher
This is the goal of the Fileset Enricher:
It’s part of the datacatalog-util
python package and using the enrich
command your DataCatalog Fileset Entries will be Tagged with statistics about files matching the provided file pattern
.
You can simply install it by doing
pip3 install datacatalog-util
.Disclaimer: This is not an officially supported Google product, it’s an open source python package, open to contributions =)
So to demonstrate this feature, let’s look at this scenario, where we have 3 different buckets with files:
Now we are going to create 4 Fileset Entries in our Project:
file pattern
gs://orders*/*csvfile pattern
gs://orders*/*file pattern
gs://users_pii/*file pattern
gs://new_orders/*
Let’s take a look at them:
Not very useful right?
The Fileset Enricher feature will add the following Tag fields to each Fileset Entry:
It even allows the user to choose the fields that should be added to the Entry, in case no fields are specified, all fields above are applied.
So let’s run the script:
datacatalog-util filesets enrich --project-id my-project
file pattern
gs://orders*/*csv
We can see that two buckets were found, and we have 4 CSV files matching the pattern.
file pattern
gs://orders*/*
We can see that 7 files were found now, and we have an unkown_file_type
, someone made a mistake? :)
file pattern
gs://users_pii/*
We can see that 6 files were found now, and we have 5 different file types. Maybe we should standardize this bucket?
file pattern
gs://new_orders/*
Wait, this bucket doesn't even exist, maybe we should delete this Entry?
- Bonus:
file pattern
gs://*/*
You can even run it for all buckets in your project, and by doing that we found out that we have 868 files, and from those 89 are of unknown_type.
Data Catalog Search
This is where Data Catalog shines, imagine that you could have thousands of storage buckets and Filesets… it does not matter how good the quality of your data is, if you are unable to search and find it.
If you want to understand the basics of search, please read this great article, that explains its features, or the official docs.
Let’s run a few queries to showcase this:
tag:fileset_enricher_findings.buckets_found=0
tag:fileset_enricher_findings.buckets_found>0
tag:fileset_enricher_findings.files_by_type:png
Those are just a few options, but you can use any Tag Field to search and get meaningful insights out of your metadata!
Load Test
To wrap it up… when we think about an object storage solution, like Google Cloud Storage, it’s common to deal with a large number of files, so to see how this python script performed, we created a scenario having 1008689
files in a single bucket. Since we are only dealing with files metadata, it shouldn’t take long right? Let’s see how it performed.
A e2-medium
VM (2 vCPUs and 4GB memory) was used to run the Python script, and it took 3 minutes to finish the execution.
Closing thoughts
In this article, We have covered how to enrich your DataCatalog Fileset Entries with useful statistics about your files and then search for it, its important to point out, that this DataCatalog feature is currently in Beta and it will most likely be improved later on, in the meanwhile this python script can be a good option to improve your metadata management. Also keep in mind that the python script, uses the GCS list_buckets
and list_blobs
APIs to extract the metadata that matches the file pattern
and generate the file statistics, so their billing policies will apply.
Hope it helps! Cheers!
[Update] Changed to use datacatalog-util
python package.
Resources
- DataCatalog official Fileset docs: https://cloud.google.com/data-catalog/docs/how-to/filesets
- DataCatalog Utils GitHub: https://github.com/mesmacosta/datacatalog-util
- DataCatalog medium series: https://medium.com/google-cloud/data-catalog-hands-on-guide-a-mental-model-dae7f6dd49e
- Data Catalog getting started guide: https://cloud.google.com/data-catalog/docs/quickstarts/quickstart-search-tag