Housekeeping in Snowflake with SQL and dbt

Be Sure and Take Advantage of Account Usage History in Snowflake

Venkat Sekar
Hashmap, an NTT DATA Company
7 min readApr 28, 2020

--

After several months of very rapid adoption of Snowflake across a range of business use cases, you’ve probably created and loaded a sizeable number of existing tables from various sources. In addition, new tables and views have also been created within Snowflake as part of various data pipelines and new analytics applications.

Although Snowflake’s storage and compute are inexpensive and can be monitored, it is beneficial to maintain some housekeeping of the various objects that are created within Snowflake. This housekeeping would involve some of the following:

  • List out the tables/views that are least used.
  • Identify when the tables/views were accessed and by whom.
  • List out the tables that can be dropped.
  • Are there any tables that are loaded but not used?

You can do this now and really get a better handle on how you use the service and importantly, no additional infrastructure is needed since everything is done within Snowflake’s SaaS service.

Of course, I guess you could interview all the various Snowflake users and get a result, but that would take way too long and not be very reliable in the end. Some users would have moved onto other parts of the organization. There could also be “ghost tables” that get loaded from sources but don’t actually get used in any data pipelines.

My Recommended Solution Approach

Another way to approach this problem is by using the information present in the SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY table. If this sounds interesting, then please follow along, as I demonstrate.

In this section, I will walk through the implementation at a high level. The SQL code used for the implementation is available my GitLab repo.

NOTE: I am using my sandbox environment which is used for learning, training, and other non-critical purposes. Conventions such as SQL coding, object (tables, views, etc.) naming conventions are not followed.

Usage Query History in Snowflake

The SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY view contains the various queries issued by users against any object. Every SQL execution is logged and is available through this view. The actual statement issued is available in the ‘QUERY_TEXT’ column.

Here is a sample query that was issued:

SELECT count('x') AS count
FROM (
SELECT foo.SUBSCRIPTION_ID, foo.ITEMS, foo.__SDC_PRIMARY_KEY,
foo.DATASOURCE, foo._SDC_SEQUENCE, foo. _SDC_RECEIVED_AT, foo._SDC_BATCHED_AT, foo._SDC_TABLE_VERSION
FROM (
SELECT *,
row_number () OVER (PARTITION BY STAGING_DATA_13_LOADER_SNOW10_11761_72FD67BD_4453_4EC1_8E05_83351F9F767B.__SDC_PRIMARY_KEY ORDER BY _sdc_sequence DESC) AS rnk
FROM STAGING_DATA_LOADER ) AS foo
WHERE rnk = 1
)

To determine which table is used in the query, we can do a LIKE operation on the query_text with a list of tables, from the view: SNOWFLAKE.ACCOUNT_USAGE.TABLES.

To determine the view used, we can follow a similar approach using the view: SNOWFLAKE.ACCOUNT_USAGE.VIEWS.

To determine the underlying tables that are involved in the view, you can use the ‘VIEW_DEFINITION’ column.

Once the queries have been associated with the tables and views, using the query time, the number of queries issued, etc. we can then answer the above-mentioned housekeeping queries (repeated below):

  • List out the tables/views that are least used.
  • Identify when the tables/views were accessed and by whom.
  • List out the tables that can be dropped.
  • Are there any tables that are loaded but not used?

A Prerequisite — Make Sure You Do This First

The solution involves views from SNOWFLAKE.ACCOUNT_USAGE, so in order to do this, I recommend the following steps:

  • Define a custom role (e.g. HOUSEKEEPER )
  • Grant ‘imported privileges’ to this role. Reference: Enabling Account Usage for Other Roles
  • Create a separate schema (e.g. HOUSEKEEPING), the HOUSEKEEPER role should be able to view tables.

NOTE: Be careful if you have sensitive data

  • The ‘QUERY_TEXT’ column contains the actual query issued by various users and processes, and it’s possible that the query could contain sensitive data. I recommend that the ‘HOUSEKEEPER’ role be given only to specific people.
  • Limit access to the schema ‘HOUSEKEEPING’, as this will contain copies of the queries’ view.

Logic Flow

Here is the screenshot of the sequence of steps:

Below is an explanation of the various stages; the stage names are reflected accordingly in the steps:

Create a copy of the following tables in the schema ‘HOUSEKEEPING’

  • SNOWFLAKE.ACCOUNT_USAGE.TABLES
  • SNOWFLAKE.ACCOUNT_USAGE.VIEWS
  • SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY

I recommend taking a snapshot copy; depending on conditions, the query data retrieval might be slower when issuing against these views.

  1. [Stage: active_base_tables] Get the list of active tables. Active tables mean that they are not deleted.
  2. [Stage: active_views & tables_definedin_views] Get the list of active views and identify the tables used.
  3. [Stage: active_tables] Create a temp table ‘active_tables’ which will contain the result of #1 and #2
  4. [Stage: deduplicated_queries] Deduplicate the queries found in the SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY. Typically the query entries are unique, there have been occasions where I have come across 2 records which are identical repetition in the table.
  5. [Stage: queries_and_views, queries_and_tables,queries_with_metadata] Using the query_text we start associating the tables to determine the list of tables used. We store the result in the table ‘queries_with_metadata’.
  6. [Stage: views_usage ,tables_usage ,tables_not_used] We can now issue different queries to get the various housekeeping related answers.

Results of the Housekeeping Queries

Below are some screenshots of results in my sandbox environment.

VIEWS_USAGE

This is the list of views and when they were last used, based on queries issued against them:

TABLES_USAGE

This is the list of tables and when they were last used, based on queries issued against them.

TABLES_NOT_USED

This is a list of tables, which were used in queries, along with information on the database & schema. Based on the last_used/usage count you can determine which tables to retain and which tables to drop.

Current Limitations

The code/logic does not solve for every type of scenario or every environment; expect some modifications based on your situation.

Below are some limitations to consider:

  • Snowflake Usage: As I mentioned this walkthru was done using my sandbox environment. It’s not a very active environment, so your mileage will vary. The variation is based on the warehouse used to do the required processing.
  • Exact Database and Schema: In your environment, it is highly likely there are multiple tables or view definitions across different schemas, databases, etc. For this reason, filtering down to the exact database/schema in which the required table/view is presented will require some work but is very doable.

Implementation Code

The code is pure SQL (thank you Snowflake!) and all executed within the Snowflake environment. I’ve implemented the code as a dbt project in my GITLAB REPO: identobjectaccesstime.

You can see the SQL used in the project README.MD.

What’s Next?

What I demonstrated is just the basic procedure, but this approach could answer many more questions too. Here are some more example queries:

  • Which tables are inter-related?
  • Which tables are the most active?
  • Who will get affected if a table was accidentally dropped?

These questions can be solved by defining additional SQL queries on top of the basics that have already been implemented.

Final Thoughts

Understanding Snowflake’s account usage dataset and thinking outside the box will help you develop operational insights into your Snowflake environment. Doing these exercises will also help in making prudent decisions when buying SaaS services, and also help tackle your Snowflake usage issues.

I hope this Snowflake housekeeping guide has helped you and provided valuable insight into getting even more value from your Snowflake cloud data warehouse. It would be great to hear your thoughts and what your next set of moves is with Snowflake!

Need Help with Your Cloud Initiatives?

If you are considering the cloud for migrating or modernizing data and analytics products and applications or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, then please contact us.

Hashmap offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud (and Snowflake) service offerings.

Other Tools and Content For You

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s Data Rebels on Tap podcast as well on Spotify, Apple, Google, and other popular apps.

Venkat Sekar is Regional Director for Hashmap Canada and is an architect and consultant providing Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--