Snowflake: Observability on the Access to your Data

Cesar Segura
10 min readApr 28, 2024

--

Image generated by AI @ https://tengr.ai

We are going to start highlighting why is so important the Observability in our Data Platform. This is due that different data architecture platforms require high needs about what is happening in their systems. Who is doing what, how and when? And finally we need to monitor with a comfortable dashboards that provide valuable metrics, KPIs, and overall the capability to provide all this information in real time. Observability play into the game, providing these and more features to our professionals in order to take the right actions to improve, correct and design the systems and their data.

Snowflake as a platform offers a wide of features about Observability. We can mention the below ones:

Snowsight Dashboard:

Data Objects: That provide a single valuable information about querys, logins, costs, timings, … and can be accessed by the different command through SQL.

As a part of the Observability, on this chapter, we will see how to achieve through these Data Objects information to check all the Access transactions done on each one, like table and their columns, on our Snowflake Database. In that way, we would want to get in a structured way, all the tables and columns accesed by each query, the query type, who has done this query to request this data, and the timestamp of the transaction. At the end of this article, we will see some useful cases where we can use it.

This is the first part of the Observability on Accessibilty Transactions series on Snowflake. If you haven’t seen the other parts about DML or DDL, don’t worry about it, on this one we will check again similar concepts and new ones for this case, in order to explain step-by-step our new goal.

Hands ON

I will start checking the different objects that we on Snowflake, these ones are located on the SNOWFLAKE.ACCOUNT_USAGE schema.

QUERY_HISTORY data object

Here we find the QUERY_HISTORY view, that provides useful information about all the queries executed on the platform.

An example of this output view could be the below (note that I will only show the relevant information we will need on that article):

A random extract from QUERY_HISTORY view

We will check the meaning of these columns:

  • QUERY_ID: Identification ID of each query
  • QUERY_TEXT: The SQL query itself.
  • START_TIME / END_TIME: Starting and ending time of the query.
  • QUERY_TYPE: It can contain execution of great variety of different type queries: procedures, DML, DDL, Select data information, for details refer to the documentation.
  • USER_NAME: The user who executes the query.
  • ROLE_NAME: The role name used by the user who executes the query.
  • WAREHOUSE_NAME: The name of the warehouse.
  • WAREHOUSE_SIZE: The size of the warehouse, here you get the list available.
  • TOTAL_ELAPSED_TIME: The number of seconds elapsed to execute the query.

You can see that in this prior view, it doesn’t contain neither the table name nor the column (in a structural way) that is affected by the query. This we will find in the next VIEW.

ACCESS_HISTORY data object

Here we find the ACCESS_HISTORY view, that provides useful information about the relationship between Queries and Objects modified on the platform.

An example of this output view could be the below (note again that I will only show the relevant information we will need on that article):

Random Extract from ACOUNT_USAGE.ACCESS_HISTORY view

We will check the meaning of these columns:

QUERY_ID: Identification ID of each query (same above)

BASE_OBJECTS_ACCESSED: The list of base objects accessed in the query (in a detailed way): table, views, columns, procedures, udfs, etc. As you can imagine, this data is stored in JSON semi-structured way.

DIRECT_OBJECTS_ACCESSED: The list of direct objects accessed in the query (in a detailed way): table, views, columns, procedures, udfs, etc. As you can imagine, this data is stored in JSON semi-structured way.

The main difference between the previous field is: when you execute a statement and use a view (ie named MYVIEW) in your query, this view will be showed in the DIRECT_OBJECTS_ACCESSED field, but the underlined objects specified on that view in your statement will be listed on the BASE_OBJECTS_ACCESSED field.

For example, I am using a view in the below query:

Example of statment that uses a VIEW to insert into a table. Checking on the QUERY_HISTORY view

If we check on database what is the VIEW definition, it looks like this (we show here only for better understanding):

SELECT   
T.COD_DATASET_FIELD,T.COD_BUSINESS_DEFINITION,T.COD_DATASET,
T.DT_LOAD,T.COD_COLUMN_NAME_TARGET_FLEX_GROUP,T.DT_END,T.SW_TRACK_DIFF,
T.SW_VIRTUAL_FIELD_VALUE, T.COD_TYPE,T.DESC_BUSINESS_DEFINITION,
T.ID_ORDER, T.SW_CLUSTERIZABLE
FROM EDW.do_meta_psa.datasets_fields_tmp T
LEFT JOIN (SELECT
COD_MODEL,
DT_MODEL_LOAD,
ID_MODEL_RUN,
ROW_NUMBER() OVER (
PARTITION BY COD_MODEL ORDER BY DT_MODEL_LOAD DESC) ULTIMO_REG
FROM edw.do_meta_psa.model_load_runs WHERE COD_TYPE_RUN='DEPLOY') M
ON M.COD_MODEL = 'DV_SDG_TR'
AND M.ULTIMO_REG = 1

Now, if we look this statement on the ACCESS_HISTORY view once this has been executed, it will be showed like this:

Checking the use of a VIEW in ACCESS_HISTORY

Now, we are going to compare the difference on the content of both fields, and taking into consideration the definition of the view (seen previously), we will see the below (left side the view, and the right side the all tables/views with their fields accessed using the view on left side):

Comparing DIRECT_OBJECTS_ACCESSED vs BASE_OBJECTS_ACCESSED

It’s time to query:

  1. Join both queries
  2. Filter DATABASE_ID IS NOT NULL: Exclude statements done by Service Compute
  3. Filter DIRECT_OBJECTS_ACCESSED::VARCHAR <>’[]’ : Exclude statements that doesn’t access to any dataobjects.
Sample of BASE_OBJECTS_ACCESSED column in ACCESS_HISTORY view

Now we have these information, the next step it will be extract the information of the table in a structural way. Our strategy will be based to get both fields (due to this you will see an UNION ALL), so in this case:

  • Field BASE_OBJECTS_ACCESSED will contain the informaion about the TYPE_ACCESS =’BASE_OBJECT’
  • Field DIRECT_OBJECTS_ACCESSED will contain the informaion about the TYPE_ACCESS =’DIRECT_OBJECT’

The use of the FLATTEN functionality will allow us to extract that information.

Our case we will use: LATERAL FLATTEN( BASE_OBJECTS_ACCESSED)

In that case, on this approach changes versus only Tables/Views accessed, we will have to flatten using columns, so here it is important this:

LATERAL FLATTEN(INPUT=> F1.VALUE:”columns”)

Note that, the fact we will find for columns attribute the query implicitily will filter all the information that contains columns. So the objects that will be retrieved will be based on tables and views.

Now, we are going to finally code our query to retrieve the list of data objects and columns accessed by any statement, for both BASE and DIRECT way, adding the rest of our auditing information from the QUERY_HISTORY view:

Joining QUERY_HISTORY vs ACCESS_HISTORY (adding both DIRECT and BASE access information)

On the last steps, we are interested in getting some information about tables itself. This information it will be extracted from the ACCOUNT_USAGE.TABLES View. The metadata we want to extract could be:

TABLE_OWNER: Table Owner of the table

TABLE_TYPE: Table type of the object. It could be BASE TABLE, TEMPORARY TABLE, EXTERNAL TABLE, EVENT TABLE, VIEW, or MATERIALIZED VIEW

ROW_COUNT: Number of rows prior dropping(aliased on next image as ROW_COUNT_BEFORE_DROP)

We have to take care joining with this view, it will give us the different state of a each table one is created, drop and last updated. So with all the last queries in one CTE named OBJ_ACC, we will join our new Data Object Query, not only by its qualified name, but we will add the interval time about life cycle table life. Check this join:

Query objects accessed previously, with the ACCOUNT_USAGE.TABLES view to get some attributes

Finally, we are interested in getting some information about columns as well. This information it will be extracted from the ACCOUNT_USAGE.COLUMNS View. The metadata we want to extract could be:

ISNULLABE: Switch that indicates if the information could be nullable (YES/NO)

DATA_TYPE: The format of the column data type.

In this case, there is a lot of additionl information you can retrieve, but like an example we are retrieving only these columns.

We have to take care joining with this view, it will give us the different state of a each table-column is created, drop and last updated. So with all the last queries in one CTE named OBJ_ACC, we will join our new Data Object Accesed on Tables-Columns view, not only for the table-column, we will add the interval time about life cycle table-column life, as previously done with the ACCOUNT_USAGE.TABLES view. So in that case, we don’t have the CREATED field, so we will have to build based on the history changes on DELETED field. We will use the LEAD function here. Check this part:

Query objects accessed previously, with the ACCOUNT_USAGE.COLUMNS view to get some attributes

At the end of our process, we can get our Monitoring table with the below summarized example and structure:

So we have finished! We only have to create a view in order to dispose of this new Custom Data observability Object for our needs. Here attach the desired full code:

CREATE OR REPLACE VIEW OBSERVABILITY_DB.MY_SCHEMA.DQL_ACCESSED_COLUMN_TABLES    AS
WITH T AS (
SELECT AH.QUERY_ID,QH.QUERY_TEXT,QH.START_TIME,QH.END_TIME,AH.DIRECT_OBJECTS_ACCESSED,AH.BASE_OBJECTS_ACCESSED,QH.QUERY_TYPE,
QH.USER_NAME LAST_USER_ACCESS_BY,
QH.ROLE_NAME LAST_ROLE_ACCESS_BY, QH.WAREHOUSE_NAME, QH.WAREHOUSE_SIZE, QH.WAREHOUSE_TYPE, QH.TOTAL_ELAPSED_TIME
FROM SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY AH
INNER JOIN SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY QH
ON AH.QUERY_ID =QH.QUERY_ID
AND DATABASE_ID IS NOT NULL
AND AH.DIRECT_OBJECTS_ACCESSED::VARCHAR <>'[]'
),
OBJECTS_ACCESS AS (
SELECT T.QUERY_ID,T.QUERY_TEXT,END_TIME LAST_ACCESS,
T.QUERY_TYPE,
'BASE_OBJECT' TYPE_ACCESS,
F1.VALUE:"objectDomain"::VARCHAR TYPE_OF_OBJECT,
F1.VALUE:"objectName"::VARCHAR TABLE_NAME_FULL,
SPLIT_PART(TABLE_NAME_FULL,'.',1) TABLE_CATALOG, SPLIT_PART(TABLE_NAME_FULL,'.',2) TABLE_SCHEMA, SPLIT_PART(TABLE_NAME_FULL,'.',3) TABLE_NAME, col.VALUE:"columnName"::VARCHAR COLUMN_NAME,
LAST_USER_ACCESS_BY,LAST_ROLE_ACCESS_BY, WAREHOUSE_NAME, WAREHOUSE_SIZE, WAREHOUSE_TYPE, TOTAL_ELAPSED_TIME
FROM T
,LATERAL FLATTEN(T.BASE_OBJECTS_ACCESSED) F1
,LATERAL FLATTEN(INPUT=> F1.VALUE:"columns") col
UNION
SELECT T.QUERY_ID,T.QUERY_TEXT,END_TIME LAST_ACCESS,
T.QUERY_TYPE,
'DIRECT_OBJECT' TYPE_ACCESS,
F1.VALUE:"objectDomain"::VARCHAR TYPE_OF_OBJECT,
F1.VALUE:"objectName"::VARCHAR TABLE_NAME_FULL,
SPLIT_PART(TABLE_NAME_FULL,'.',1) TABLE_CATALOG, SPLIT_PART(TABLE_NAME_FULL,'.',2) TABLE_SCHEMA, SPLIT_PART(TABLE_NAME_FULL,'.',3) TABLE_NAME, col.VALUE:"columnName"::VARCHAR COLUMN_NAME,
LAST_USER_ACCESS_BY,LAST_ROLE_ACCESS_BY, WAREHOUSE_NAME, WAREHOUSE_SIZE, WAREHOUSE_TYPE, TOTAL_ELAPSED_TIME
FROM T
,LATERAL FLATTEN(T.DIRECT_OBJECTS_ACCESSED) F1
,LATERAL FLATTEN(INPUT=> F1.VALUE:"columns") col
WHERE T.DIRECT_OBJECTS_ACCESSED<>T.BASE_OBJECTS_ACCESSED
)
SELECT
OBJ_ACC.TYPE_OF_OBJECT,OBJ_ACC.TABLE_CATALOG, OBJ_ACC.TABLE_SCHEMA, OBJ_ACC.TABLE_NAME,OBJ_ACC.COLUMN_NAME,
CO.IS_NULLABLE,CO.DATA_TYPE,
TB.TABLE_OWNER,TB.TABLE_TYPE,TB.IS_TRANSIENT, TB.ROW_COUNT ROW_COUNT_BEFORE_DROP,
OBJ_ACC.TYPE_ACCESS,OBJ_ACC.QUERY_TYPE,OBJ_ACC.QUERY_ID,OBJ_ACC.QUERY_TEXT,OBJ_ACC.LAST_ACCESS,
OBJ_ACC.LAST_USER_ACCESS_BY,OBJ_ACC.LAST_ROLE_ACCESS_BY, OBJ_ACC.WAREHOUSE_NAME, OBJ_ACC.WAREHOUSE_SIZE, OBJ_ACC.WAREHOUSE_TYPE, OBJ_ACC.TOTAL_ELAPSED_TIME
FROM OBJECTS_ACCESS OBJ_ACC
LEFT JOIN SNOWFLAKE.ACCOUNT_USAGE."TABLES" TB
ON TB.TABLE_CATALOG = OBJ_ACC.TABLE_CATALOG
AND TB.TABLE_SCHEMA = OBJ_ACC.TABLE_SCHEMA
AND TB.TABLE_NAME = OBJ_ACC.TABLE_NAME
AND OBJ_ACC.LAST_ACCESS BETWEEN TB.CREATED AND COALESCE(TB.DELETED ,CURRENT_TIMESTAMP())
LEFT JOIN (
SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, IS_NULLABLE, DATA_TYPE, DELETED ,
COALESCE(
LEAD(DELETED,1) OVER (PARTITION BY TABLE_CATALOG,TABLE_SCHEMA,TABLE_NAME,COLUMN_NAME ORDER BY DELETED DESC),
TO_DATE('1900-01-01','YYYY-MM-DD')
) CREATED
FROM SNOWFLAKE.ACCOUNT_USAGE."COLUMNS"
) CO
ON TB.TABLE_CATALOG = OBJ_ACC.TABLE_CATALOG
AND CO.TABLE_SCHEMA = OBJ_ACC.TABLE_SCHEMA
AND CO.TABLE_NAME = OBJ_ACC.TABLE_NAME
AND CO.COLUMN_NAME = OBJ_ACC.COLUMN_NAME
AND OBJ_ACC.LAST_ACCESS BETWEEN CO.CREATED AND COALESCE(CO.DELETED ,CURRENT_TIMESTAMP())
ORDER BY LAST_ACCESS DESC;

It’s time to rock!!

Now we are going to check what is happening in our dataplatform.

1- Who users/roles are accessing to a specific Database along the time, and how many queries and columns and rows are being involved?

SELECT 
TO_DATE(LAST_ACCESS) DT_ACCESS
, LAST_USER_ACCESS_BY
, LAST_ROLE_ACCESS_BY
, TABLE_CATALOG
, TABLE_SCHEMA
, TYPE_OF_OBJECT
, QUERY_TYPE
, COUNT(DISTINCT TABLE_CATALOG||TABLE_SCHEMA||TABLE_NAME) TABLES_ACCESSED
, COUNT(DISTINCT TABLE_CATALOG||TABLE_SCHEMA||TABLE_NAME||COLUMN_NAME) TABLE_COLUMNS_ACCESSED
, COUNT(DISTINCT QUERY_ID) QUERIES
FROM OBSERVABILITY_DB.MY_SCHEMA.DML_AFFECTED_TABLES.DML_AFFECTED_COLUMN_TABLES
where TABLE_CATALOG='EDW'
GROUP BY ALL
ORDER BY 1 DESC,2,3;

An example what is happening in my EDW Database:

2- We would want to know all the objects that haven’t been accessed since one week ago, and show when it was accessed the last time, what query was done, and the user / role who executed the last access. This query could be great to identify all tables/views that have not been accessed yet, in order to drop all objects and keeping clean your database with a housekeeping.

SELECT 
TABLE_CATALOG,
TABLE_SCHEMA,
TABLE_NAME,
TYPE_OF_OBJECT,
MAX(LAST_ACCESS) LAST_ACCESS,
MAX_BY(QUERY_TYPE, LAST_ACCESS) LAST_QUERY_TYPE_ACCESS,
MAX_BY(LAST_USER_ACCESS_BY, LAST_ACCESS) LAST_USER_ACCESS_BY,
MAX_BY(LAST_ROLE_ACCESS_BY, LAST_ACCESS) LAST_ROLE_ACCESS_BY,
FROM OBSERVABILITY_DB.MY_SCHEMA.DML_AFFECTED_TABLES.DML_AFFECTED_COLUMN_TABLES
GROUP BY ALL
HAVING MAX(LAST_ACCESS) <= DATEADD(WEEK,-1,CURRENT_TIMESTAMP)
ORDER BY 5 ASC;
Query results retrieved, for the execution done on 27/4/2024

Finally, you can find other use cases like monitoring the access to your Sensitive information in order to know suspicious access filtering by the table/view and your columns.

Now, it’s your time! You can practice and extract more valuable metadata that we have used in our view! This metadata is waiting for you. Why are you waiting to discover it? These insights are there ready to be inspected!

Considerations

Take into consideration that we are using ACCOUNT_USAGE schema. This one has the special characteristic of high latency, from minutes to 1–2 hours, depending on the view you use. That can be changing on time, so I recommend that you check them out each view on Snowflake documentation to take the last reference of latency times.

Conclusions

You have seen how Snowflake provides us a different Data Objects we can use in order to customize our Data Monitoring. On that Monitoring series we provide a guide, but you can improve that in multiple ways, for example adding new information on each step. In this scenario, it will depend on your current requirements.

An example, could be improving our Observability and Monitoring:

  • Data Objects: We can use together the information about is being accessed, with other Observability views that provides DML transactions affects on tables/views or with DDL insights we provide in the other post, in order to get a full list of statements DML / DDL / DQL.
  • Dashboard Observability: As mentioned, in other posts, we can build our customized dashboard on our preferred BI tool, in order to monitor all the transactions about access, altering or modifying any specific database, schema, table, view and/or column. This could be done perfectly with Snowsight Snowflake UI.

Remember, we could also data load this information in a incremental way, into our own tables using TASKs, in order to keep our own observability tables improving the efficiency of our SQL requests. This scenario will provide us the best approach to make a heavy queries on our metadata. For example, for ML and forecasting processes.

About me

SME on different Data Technologies @SdgGroup, with 20+ years of experience in Data Adventures. Snowflake Squad Founder, Data Vault Certified Practitioner and Data Saiyan.

If you want to know more in detail about the aspects seen here, or other ones, you can follow me on medium || Linked-in here.

I hope you have joined and this can help you!

--

--

Cesar Segura

SME @ SDG Group || Snowflake Architect || Snowflake Squad Spotlight Member || CDVP Data Vault