Housekeeping in Snowflake with SQL and dbt

Be Sure and Take Advantage of Account Usage History in Snowflake

Published in

Hashmap, an NTT DATA Company

7 min readApr 28, 2020

After several months of very rapid adoption of Snowflake across a range of business use cases, you’ve probably created and loaded a sizeable number of existing tables from various sources. In addition, new tables and views have also been created within Snowflake as part of various data pipelines and new analytics applications.

Although Snowflake’s storage and compute are inexpensive and can be monitored, it is beneficial to maintain some housekeeping of the various objects that are created within Snowflake. This housekeeping would involve some of the following:

List out the tables/views that are least used.
Identify when the tables/views were accessed and by whom.
List out the tables that can be dropped.
Are there any tables that are loaded but not used?

You can do this now and really get a better handle on how you use the service and importantly, no additional infrastructure is needed since everything is done within Snowflake’s SaaS service.

Of course, I guess you could interview all the various Snowflake users and get a result, but that would take way too long and not be very reliable in the end. Some users would have moved onto other parts of the organization. There could also be “ghost tables” that get loaded from sources but don’t actually get used in any data pipelines.

My Recommended Solution Approach

Another way to approach this problem is by using the information present in the SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY table. If this sounds interesting, then please follow along, as I demonstrate.

In this section, I will walk through the implementation at a high level. The SQL code used for the implementation is available my GitLab repo.

NOTE: I am using my sandbox environment which is used for learning, training, and other non-critical purposes. Conventions such as SQL coding, object (tables, views, etc.) naming conventions are not followed.

Usage Query History in Snowflake

The SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY view contains the various queries issued by users against any object. Every SQL execution is logged and is available through this view. The actual statement issued is available in the ‘QUERY_TEXT’ column.

Here is a sample query that was issued:

SELECT count('x') AS count
        FROM (
            SELECT foo.SUBSCRIPTION_ID, foo.ITEMS, foo.__SDC_PRIMARY_KEY, 
                foo.DATASOURCE, foo._SDC_SEQUENCE, foo. _SDC_RECEIVED_AT, foo._SDC_BATCHED_AT, foo._SDC_TABLE_VERSION
            FROM (
                SELECT *,
                      row_number () OVER (PARTITION BY STAGING_DATA_13_LOADER_SNOW10_11761_72FD67BD_4453_4EC1_8E05_83351F9F767B.__SDC_PRIMARY_KEY ORDER BY _sdc_sequence DESC) AS rnk
                 FROM STAGING_DATA_LOADER ) AS foo
                WHERE rnk = 1
                )

To determine which table is used in the query, we can do a LIKE operation on the query_text with a list of tables, from the view: SNOWFLAKE.ACCOUNT_USAGE.TABLES.

To determine the view used, we can follow a similar approach using the view: SNOWFLAKE.ACCOUNT_USAGE.VIEWS.

To determine the underlying tables that are involved in the view, you can use the ‘VIEW_DEFINITION’ column.

Once the queries have been associated with the tables and views, using the query time, the number of queries issued, etc. we can then answer the above-mentioned housekeeping queries (repeated below):

List out the tables/views that are least used.
Identify when the tables/views were accessed and by whom.
List out the tables that can be dropped.
Are there any tables that are loaded but not used?

A Prerequisite — Make Sure You Do This First

The solution involves views from SNOWFLAKE.ACCOUNT_USAGE, so in order to do this, I recommend the following steps:

Define a custom role (e.g. HOUSEKEEPER )
Grant ‘imported privileges’ to this role. Reference: Enabling Account Usage for Other Roles
Create a separate schema (e.g. HOUSEKEEPING), the HOUSEKEEPER role should be able to view tables.

NOTE: Be careful if you have sensitive data

The ‘QUERY_TEXT’ column contains the actual query issued by various users and processes, and it’s possible that the query could contain sensitive data. I recommend that the ‘HOUSEKEEPER’ role be given only to specific people.
Limit access to the schema ‘HOUSEKEEPING’, as this will contain copies of the queries’ view.

Logic Flow

Here is the screenshot of the sequence of steps:

Below is an explanation of the various stages; the stage names are reflected accordingly in the steps:

Create a copy of the following tables in the schema ‘HOUSEKEEPING’

SNOWFLAKE.ACCOUNT_USAGE.TABLES
SNOWFLAKE.ACCOUNT_USAGE.VIEWS
SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY

I recommend taking a snapshot copy; depending on conditions, the query data retrieval might be slower when issuing against these views.

[Stage: active_base_tables] Get the list of active tables. Active tables mean that they are not deleted.
[Stage: active_views & tables_definedin_views] Get the list of active views and identify the tables used.
[Stage: active_tables] Create a temp table ‘active_tables’ which will contain the result of #1 and #2
[Stage: deduplicated_queries] Deduplicate the queries found in the SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY. Typically the query entries are unique, there have been occasions where I have come across 2 records which are identical repetition in the table.
[Stage: queries_and_views, queries_and_tables,queries_with_metadata] Using the query_text we start associating the tables to determine the list of tables used. We store the result in the table ‘queries_with_metadata’.
[Stage: views_usage ,tables_usage ,tables_not_used] We can now issue different queries to get the various housekeeping related answers.

Results of the Housekeeping Queries

Below are some screenshots of results in my sandbox environment.

VIEWS_USAGE

This is the list of views and when they were last used, based on queries issued against them:

TABLES_USAGE

This is the list of tables and when they were last used, based on queries issued against them.

TABLES_NOT_USED

This is a list of tables, which were used in queries, along with information on the database & schema. Based on the last_used/usage count you can determine which tables to retain and which tables to drop.

Current Limitations

The code/logic does not solve for every type of scenario or every environment; expect some modifications based on your situation.

Below are some limitations to consider:

Snowflake Usage: As I mentioned this walkthru was done using my sandbox environment. It’s not a very active environment, so your mileage will vary. The variation is based on the warehouse used to do the required processing.
Exact Database and Schema: In your environment, it is highly likely there are multiple tables or view definitions across different schemas, databases, etc. For this reason, filtering down to the exact database/schema in which the required table/view is presented will require some work but is very doable.

Implementation Code

The code is pure SQL (thank you Snowflake!) and all executed within the Snowflake environment. I’ve implemented the code as a dbt project in my GITLAB REPO: identobjectaccesstime.

You can see the SQL used in the project README.MD.

What’s Next?

What I demonstrated is just the basic procedure, but this approach could answer many more questions too. Here are some more example queries:

Which tables are inter-related?
Which tables are the most active?
Who will get affected if a table was accidentally dropped?

These questions can be solved by defining additional SQL queries on top of the basics that have already been implemented.

Final Thoughts

Understanding Snowflake’s account usage dataset and thinking outside the box will help you develop operational insights into your Snowflake environment. Doing these exercises will also help in making prudent decisions when buying SaaS services, and also help tackle your Snowflake usage issues.

I hope this Snowflake housekeeping guide has helped you and provided valuable insight into getting even more value from your Snowflake cloud data warehouse. It would be great to hear your thoughts and what your next set of moves is with Snowflake!

Need Help with Your Cloud Initiatives?

If you are considering the cloud for migrating or modernizing data and analytics products and applications or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, then please contact us.

Hashmap offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud (and Snowflake) service offerings.

Housekeeping in Snowflake with SQL and dbt

Be Sure and Take Advantage of Account Usage History in Snowflake

My Recommended Solution Approach

Usage Query History in Snowflake

A Prerequisite — Make Sure You Do This First

Logic Flow

Results of the Housekeeping Queries

VIEWS_USAGE

TABLES_USAGE

TABLES_NOT_USED

Current Limitations

Implementation Code

What’s Next?

Final Thoughts

Need Help with Your Cloud Initiatives?

Other Tools and Content For You

Snowflake Utilities & Accelerators | Do more with Snowflake | Hashmap

Try out all the Snowflake utilities that Hashmap has available and do more with Snowflake: Snowflake Inspector…

Data Sync to Snowflake Using Confluent Kafka Connect: Part 1

Moving On-Prem Oracle Databases to Snowflake in Azure with Kafka Connect

Data Materialization in Snowflake Using Confluent Kafka Connect: Part 2

Making On-Prem Oracle Databases Useable with Snowflake in Azure with Kafka Connect

Here’s Your Day 1 and 2 Checklist for Snowflake Adoption

by Venkatesh Sekar

Data Rebels on Tap | Hashmap Podcast

Data Rebels on Tap is a weekly podcast focusing on data and cloud technologies from a developer's perspective. Your…

Written by Venkat Sekar