Hunting Stale Tables in Snowflake

How to quickly identify tables that are no longer providing value.

Photo by Brooke Cagle on Unsplash

Living the dream

Imagine, you are a DBA and your organization finally got Snowflake and was able to centralize all its data, overcome data silos, and deliver on the data-driven vision of the future.

Your data pipelines are humming along efficiently, you’ve been liberated from tedious infrastructure management and tuning, the data scientists are delivering new insights faster than ever, you can’t remember the last time anyone complained their dashboards were slow, and you have finally been able to stop using SFTP to transfer files to business partners.

Not only that, but even though you know your Snowflake account is protected against an availability zone failure within a region out-of-the-box, you had time to set up a disaster recovery site in case the entire region went down in your cloud provider. Turns out it only took a few minutes thanks to the turn-key setup of DR. (sub-5-minute RPO/RTO anyone?)

As a bonus, you even established some secure, self-service workspace environments for business teams to do the ad-hoc analysis, data enrichment, and experimentation that is key to a data-driven business. Dynamic Data Masking and Row Level Access Control policies ensure teams are only seeing data they are authorized to see, without data duplication or an explosion of bespoke views for access control. Since the workspace schemas leverage Managed Access, security teams are happy as well.

Things are going pretty well… but something has been bothering you.

Trouble in paradise…

You have been seeing the number of tables in business team workspace environments grow and you are pretty sure a lot of them are not providing value anymore. Some of those DEV/QA environments are looking fairly old as well. But how can you tell what is important? It is not scalable to investigate all tables or ask if they are still being used. Sure data storage in Snowflake is cheap and essentially unlimited, but that is no reason to not be tidy. (You wanted a data lake, not a data swamp.)

Database table clutter
Photo by Dim Hou on Unsplash

Psst… just looking for the code? Scroll to the bottom of the article. ⬇️

Locking the business out is not an option

If the above sounds like you, you are not alone. Many organizations deal with trying to strike the right balance between enabling self-service for the business and the ever-present threat of data sprawl. Locking the business out of the core data platform only gives rise to “shadow IT” with ungoverned data exports being shuffled away to rogue data marts, desktop databases, BI tool extracts, and “spread-marts.” With data that is out of sync with the rest of the organization's data and nearly impossible to audit. Making this an unattractive and risky option.

The better approach is to give business teams read access to the data they need directly on Snowflake and a workspace environment (dedicated database or schema) to create new objects with that data.

So if having business workspaces on the core data platform is the best option, how can we tell what tables have outlived their usefulness to anyone?

When I have asked data teams how they deal with the issue their answers range from “we just delete the table and see if anyone complains” to “I just try not to think about it.” Neither of these options is particularly appealing.

Fortunately, with Snowflake, we have all the tools we need to discover exactly which tables are still are still useful to the business.

The “Snowflake” database and the Account Usage Schema

Snowflake provides a secure, tamper-proof database that contains metadata about every object in your Snowflake account. This also contains rich logging of any activity on the account over the past 365 days (with options to persist it longer). This includes every query, session, login attempt, and much more. As a bonus, you don’t pay for the storage of this vast amount of information, as it is provided via Snowflake’s Snowgrid data-sharing technology. While this data can be leveraged for a number of use cases, we are going to use it for data observability, specifically the “freshness” pillar.

Note: By default, the SNOWFLAKE database is available only to the ACCOUNTADMIN role. To grant access to another role refer to the Snowflake documentation here: Enabling Snowflake Database Usage for Other Roles.

Let’s zero in on a few key views in the Account Usage schema that can help us craft a query to identify tables that have gone “stale.”

The first view we want to take a look at is the TABLES view.
According to the Snowflake documentation:

This Account Usage view displays a row for each table and view in the account.

This view contains information on every current and past table and view that ever existed on the account.

Next up is the ACCESS_HISTORY view.
According to the Snowflake documentation:

This Account Usage view can be used to query the access history of Snowflake objects (e.g. table, view, column) within the last 365 days.

This view reveals specifically which tables were accessed, down to column level, by any query in the last 365 days. Even better, it doesn’t matter if the table was accessed directly in the query, or if it was accessed via a view based on the table. It also logs any tables that have been written to by any query in the last 365 days.

Note: This view is available only on Enterprise Edition and higher.

With these two views, we could answer the question following questions:

  • When was the last time was a table CREATED or ALTERED and by who?
  • When was the last time a table was ACCESSED or MODIFIED and by who?

Of course, we might need more information than this to make a decision about if a particular table still has value. So let’s author a query to give us more context by mixing in even more metadata that the Snowflake database makes available to us. Here is a list of additional views we will use.

  • QUERY_HISTORY: Can be used to query Snowflake query history by various dimensions (time range, session, user, warehouse, etc.) within the last 365 days (1 year).
  • SESSIONS: Provides information on the session, including information on the authentication method to Snowflake and the Snowflake login event. Snowflake returns one row for each session created over the last year.
  • TABLE_STORAGE_METRICS: Displays table-level storage utilization information, which is used to calculate the storage billing for each table in the account, including tables that have been dropped, but are still incurring storage costs.
  • OBJECT_DEPENDENCIES: Displays one row for each object dependency. For example, while creating a view from a single table, the view is dependent on the table. Snowflake returns one row to record the dependency of the view on the table.
  • TAG_REFERENCES: This view only records the direct relationship between the object and the tag.
    Note: The view only displays objects for which the current role for the session has been granted access privileges.
  • REFERENTIAL_CONSTRAINTS: Displays a row for each referential integrity constraint defined in the account.
    Note: The view only displays objects for which the current role for the session has been granted access privileges.

Example Stored Procedure

Congratulations if you have made it this far!

At the bottom of the article, I have provided the code for an example stored procedure that will output to a table providing information on tables on your Snowflake account that have not been accessed for a configurable period of time.

While Snowflake gives you the flexibility to develop a stored procedure in Javascript, Java, and Python. This procedure was developed by using SQL by way of Snowflake Scripting.

To run the procedure just call it after it is installed.

The first parameter is the minimum number of days without activity you consider a table to be “stale.” The second parameter controls whether the output table should be truncated, or if rows should just be inserted. (The table has a “report run id” column to discriminate between runs.)

The output table looks like this:

You can choose to run the procedure periodically via a native scheduled task in your Snowflake account. Happy hunting!

Notes

  • Detailed instructions for installing and using this procedure are in the code below.
  • Depending on the number of objects and query activity on your account, this procedure could take several minutes to complete.
  • The output table STALE_TABLE_REPORT contains 1 row for every table that meets the threshold set in the parameter of the stored procedure.
  • The output table will be located in the same schema as the procedure because it executes as OWNER.

Find the code here: stale_table_report_procedure.sql

--

--