Cataloging Metadata in Snowflake

A simple opinionated approach to data-cataloging and data-discovery in Snowflake

Venkat Sekar
Hashmap, an NTT DATA Company
8 min readJan 29, 2021

--

Your organization has successfully implemented Snowflake and various departments have embraced the migration and started the adoption. You have migrated from a traditional data warehouse, appliance, or big data platform (Netezza, Teradata, Hadoop, Exadata, Greenplum). You might also be ingesting data from 3rd party vendor APIs. Your Snowflake account might have 10 databases, each with 5 or more schema and probably thousands of tables, columns, and views. There are multiple users across various departments and groups running queries, executing jobs, and serving various business needs.

At this point, you need to access the data inventory in Snowflake and determine:

  • Who is using what data?
  • How are the tables and views related to each other?
  • When was the table last loaded? Is it being used?
  • Which columns are important in a table, and which columns are least used?

The approach to defining the answers to these questions is to have an enterprise-wide data catalog. A commercial offering like Alation, Collibra, Dataedo, or a similar offering can be implemented. Open source products like Lyft’s Amundsen, Magda, Airbnb’s Dataportal, Netflix’s Metacat, and WeWork’s Marquez are also available.

However, not all adoption is the same. Based on your enterprise, you might have already adopted a vendor (like Collibra), and you need to set up, configure, and integrate with the solution. This would involve working with multiple teams. While these adoptions are certainly enterprise-wide, not all enterprises are the same. Small to medium-sized companies do not have the capacity to host and maintain such services. Sometimes a commercial offering (like https://treeschema.com) would need approval to be adopted.

An alternative approach would be to implement a minimalistic starter solution. For example, crawling Snowflake resources with Snowflake's capabilities. How do we achieve that?

  • Snowflake Account Usage has views that give insights into various objects & activities.
  • Query History View lists all the queries issued by various actors. We can deconstruct the queries and determines the tables, views, and related columns.
  • JavaScript UDF offers the ability to run inside Snowflake with a well-known, widely adopted language.
  • The warehouse offers an isolated serverless compute option for the data to be processed and stored in Snowflake storage.

If we store the results in tables, we can build views that answer queries like:

  • Which database, schema, tables/views are actively used and critical in nature?
  • If we need to drop table A, what other tables will be affected?
  • And more

The good news is that this approach is doable. A great example of this implementation is mentioned in this blog post from UBER: Queryparser, an Open Source Tool for Parsing and Analyzing SQL.

In my previous article, Housekeeping in Snowflake with SQL and dbt, I also demonstrate this approach.

You can also watch my Hashmap Megabytes video where I walk you through the steps outlined in this blog post:

Hashmap Megabytes — Ep 13 — Cataloging Metadata in Snowflake: Data Cataloging & Data Discovery

Here’s What To Do

  • Copy the Snowflake Account Usage views into a database & schema.
  • Filter out records that currently offer fewer values like Show/Use/Describe commands etc.
  • Cleanse, parse, and tokenize the query text from the Query_History table.
  • Join back with the “Tables” & “Views” table to identify the tables, views, database, and schema used.
  • Build views/materialized table on the above table that offers various insights related to data catalog & data discovery.

The approach above is not necessarily meant for real-time needs. You can typically do this processing at the beginning of the workday or on a demand basis.

Based on processing around 1 million records from my lab, an XS warehouse was adequate to complete the processing. End to end, the process took around 5 minutes.

The materialized tables are all transient, as we can recreate the entire content.

The code is available in this GitLab Repo: https://gitlab.com/hashmapinc/oss/SnowflakeCataloguer

Views

Here are some of the views that have been implemented:

  • TABLES_W_USAGE_COUNT: This is a list of tables/views reflecting how often they have been used in various queries and the most recent time the table was accessed.
  • TABLE_NEIGHBOURS: Most typical queries use 2 or more tables. By dissecting the queries, as implemented here, we could determine if TABLE-A has a close relationship with TABLE-D and how close this relationship is compared with TABLE-B. The closeness (LINK_COUNT) is based on the count of distinct queries that reference these tables.
  • TABLE_COLUMNS_USAGE: While a table can have 5 or 50 columns, not all columns have the same importance. Most columns may not be used at all. This dataset represents which columns for a given table end up being used in queries. The count(COL_USAGE_COUNT) determines the frequency. The higher the count, might indicate how critical the
    data would be. Columns not present would indicate that data not being used.

Query Parsing

The SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY view contains various queries issued by users against any object. Every SQL execution is logged and is available through this view. The actual statement issued is available in the ‘QUERY_TEXT’ column.

Here is a sample query that was issued:

Query parsing helps in understanding lineage, usage, etc. Although SQL is standard, the exact way it is written and executed has many variations. In my development, I have come across queries embedded with comments, multiple line comments, quoted string, etc. These articles give good insights into such complexities:

This solution would mean that the parsing of the queries is done outside of the datastore. You would need to understand the implementation of the language, libraries, etc. After this, you would need a hosting environment to run these programs and maintain them.

Instead of processing outside of Snowflake, I choose to parse the query text inside Snowflake. This eliminates the need for any extra infrastructure, keeping this approach simple. Additionally, this is maintainable by Snowflake Admins. Adopting SQL & JS UDF, which are Snowflake core capabilities, would avoid any costly political battles.

Another reason to process the query parsing is that queries can sometimes be issued with PII data. By not exporting the data outside, we can avoid any privacy/compliance issues. The parsing's result is just metadata (table names, database names, table record id, etc.), which serves no value. These can be exported out if you choose to.

Current Limitations

The code and logic provided do not solve for every type of scenario or every environment. Expect some modifications based on your situation. This is mainly due to limitations on usage patterns and query text. However, such modifications would not be a big constraint or take too much time.

Below are some limitations to consider:

  • Dataset size and processing time: As mentioned, this Snowflake warehouse is a lab account. Therefore, the queries are only around 1 million or so. Your environment may take longer than 5 minutes to process if the count is higher in your organization.
  • Temp table: The temporary table will not be accounted for as the definition is not available in the Tables/Views catalog.
  • The procedure, Streams, Tasks, etc.: SQL statements executed by stored procedures, streams, and tasks are currently ignored. Based on how this concept is adopted, these statements could be enhanced down the road.

Implementation Code

The implementation code is JS UDF & SQL (thank you, Snowflake!) and all executed within the Snowflake environment. This code is available and has been implemented as a dbt project in this GITLAB REPO.

For a more in-depth walk-thru of installation, execution, etc., refer to the project README.MD.

Final Thoughts

The approach demonstrated in this article is a great place to start. Through the usage of “TABLES_USAGE_COUNT” records, I was able to catalog and organize my Snowflake environment. I am sure you will find more benefits in your adoption as well. There are more implementations available, provided there is interest from the community.

I hope this Snowflake housekeeping guide provides valuable insight into getting more value from your Snowflake cloud data warehouse. It would be great to hear your thoughts on the approach demonstrated in this article. Reach out to start a conversation and review the next steps in your Snowflake journey.

Ready to Accelerate Your Digital Transformation?

At Hashmap, we work with our clients to build better, together.

If you areconsidering moving data and analytics products and applications to the cloud or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, please contact us.

Hashmap, an NTT DATA Company, offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud service offerings. We would be glad to work through your specific requirements. Reach out to us here.

Other Tools and Content You Might Like

Feel free to share on other channels, and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Venkat Sekar is a Senior Architect for Hashmap, an NTT DATA Company, and provides Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--