Snowflake: Observability on DML Transactions in Affected Columns (Part 2)

DML columns Observability in Snowflake Image generated by AI (at https://pixlr.com/)

We are going to start highlighting why is so important the Observability in our Data Platform. This is due that different data architecture platforms require high needs about what is happening in their systems. Who is doing what, how and when? And finally we need to monitor with a comfortable dashboards that provide valuable metrics, KPIs, and overall the capability to provide all this information in real time. Observability play into the game, providing these and more features to our professionals in order to take the right actions to improve, correct and design the systems and their data.

Snowflake as a platform offers a wide of features about Observability. Between them we can mention:

Snowsight Dashboard:

  • Managing and monitoring Costs and Performance
  • Trackability about transactions

Data Objects: That provide valuable information about querys, logins, costs, timings, …

As a part of the Observability, we will see how to achieve through these Data Objects information to check all the DML transaction done on each tables and their columns on our Snowflake Database tables. In that way, we will want to list in a structured way, all the tables and columns affected by each query, their type, the rows affected and the timestamp of the transaction.

This is the second part of the Observability on DML Transactions series on Snowflake. If you haven’t seen the first part, don’t worry about it, on this one we will check again some aspects, in order to explain step-by-step our new goal.

Hands ON

I will start checking the different objects that we on Snowflake, these ones are located on the SNOWFLAKE.ACCOUNT_USAGE schema.

QUERY_HISTORY data object

Here we find the QUERY_HISTORY view, that provides useful information about all the queries executed on the platform.

An example of this output view could be the below (note that I will only show the relevant information we will need on that article):

A random extract from QUERY_HISTORY view

We will check the meaning of these columns:

  • QUERY_ID: Identification ID of each query
  • QUERY_TEXT: The SQL query itself.
  • START_TIME / END_TIME: Starting and ending time of the query.
  • QUERY_TYPE: It can contain execution of great variety of different type queries: procedures, DML, DDL, Select data information, for details refer to the documentation.
  • USER_NAME: The user who executes the query.
  • ROWS_INSERTED/UPDATED/DELETED: The number of rows affected if applies for the different type of dml queries.

You can see that in this prior view, it doesn’t contain neither the table name nor the column (in a structural way) that is affected by the query. This we will find in the next VIEW.

ACCESS_HISTORY data object

Here we find the ACCESS_HISTORY view, that provides useful information about the relationship between Queries and Objects modified on the platform.

An example of this output view could be the below (note again that I will only show the relevant information we will need on that article):

Random Extract from ACOUNT_USAGE.ACCESS_HISTORY view

We will check the meaning of these columns:

QUERY_ID: Identification ID of each query (same above)

OBJECTS_MODIFIED: The list of objects modified in the query (in a detailed way): table, views, columns, procedures, udfs, etc. As you can imagine, this data is stored in JSON semi-structured way.

It’s time to query:

  1. Join both queries
  2. Filter the type of the queries that interest us. In that case the below: ‘UPDATE’,’MERGE’,’INSERT’,’MULTI_TABLE_INSERT’,’DML’
  • UPDATE: Update statements in columns in a table
  • MERGE: Merge statements will include UPDATE, DELETE and INSERT
  • INSERT: Insert statetments
  • MULTI_TABLE_INSERT: Multi Insert Statements
  • DML: Some others DML statements
Sample of OBJECTS_MODIFIED column in ACCESS_HISTORY view

Now we have these information, the next step it will be extract the information of the table in a structural way. So in this case this information is on the field OBJECTS_MODIFIED. The use of the FLATTEN functionality will allow us to extract that information.

Our case we will use: LATERAL FLATTEN( OBJECTS_MODIFIED)

In that case, on this approach changes versus only Tables affected, we will have to flatten using columns, so here it is important this:

LATERAL FLATTEN(INPUT=> F1.VALUE:"columns")

Now, we are going to finally code our query and retrieve the list of tables and columns affected by a DML, rows affected, auditing information, and DML executed:

On the last steps, we are interested in getting some information about tables itself. This information it will be extracted from the ACCOUNT_USAGE.TABLES View. The metadata we want to extract could be:

TABLE_OWNER: Table Owner of the table

TABLE_TYPE: Table type

ROW_COUNT: Number of rows prior dropping (aliased on next image as ROW_COUNT_BEFORE_DROP)

We have to take care joining with this view, it will give us the different state of a each table one is created, drop and last updated. So with all the last queries in one CTE named TDML, we will join our new Data Object Tables view, not only for the table, we will add the interval time about life cycle table life. Check this join:

Finally, we are interested in getting some information about columns as well. This information it will be extracted from the ACCOUNT_USAGE.COLUMNS View. The metadata we want to extract could be:

ISNULLABE: Switch that indicates if the information could be nullable (YES/NO)

DATA_TYPE: The format of the column data type.

In this case, there is a lot of additionl information you can retrieve, but like an example we are retrieving only these columns.

We have to take care joining with this view, it will give us the different state of a each table-column is created, drop and last updated. So with all the last queries in one CTE named TDML, we will join our new Data Object Tables-Columns view, not only for the table-column, we will add the interval time about life cycle table-column life. So in that case, we don’t have the CREATED field, so we will have to build based on the history changes on DELETED field. We will use the LEAD function here. Check this part:

At the end of our process, we can get our Monitoring table with the below summarized example and structure:

So we have finished! We only have to create a view in order to dispose of this new Custom Data observability Object for our needs. Here attach the desired full code:

CREATE OR REPLACE VIEW OBSERVABILITY_DB.MY_SCHEMA.DML_AFFECTED_COLUMN_TABLES
AS
WITH T AS (
SELECT AH.QUERY_ID,QH.QUERY_TEXT,QH.START_TIME,QH.END_TIME,AH.OBJECTS_MODIFIED,QH.QUERY_TYPE,
QH.EXECUTION_STATUS,QH.ERROR_CODE, QH.ERROR_MESSAGE,
QH.USER_NAME LAST_DML_BY,QH.ROWS_INSERTED,QH.ROWS_UPDATED,QH.ROWS_DELETED
FROM SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY AH
INNER JOIN SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY QH
ON AH.QUERY_ID =QH.QUERY_ID
WHERE QH.QUERY_TYPE IN ('UPDATE','MERGE','INSERT','MULTI_TABLE_INSERT','DML')
AND DATABASE_ID IS NOT null
),
TABLES_DML AS (
SELECT END_TIME LAST_DML,
LAST_DML_BY,ROWS_INSERTED,ROWS_UPDATED,ROWS_DELETED,
F1.VALUE:"objectName"::VARCHAR TABLE_NAME_FULL,
T.QUERY_ID,T.QUERY_TYPE,T.QUERY_TEXT,
SPLIT_PART(TABLE_NAME_FULL,'.',1) TABLE_CATALOG,
SPLIT_PART(TABLE_NAME_FULL,'.',2) TABLE_SCHEMA,
SPLIT_PART(TABLE_NAME_FULL,'.',3) TABLE_NAME,
col.VALUE:"columnName"::VARCHAR COLUMN_NAME
FROM T
,LATERAL FLATTEN(T.OBJECTS_MODIFIED) F1
,LATERAL FLATTEN(INPUT=> F1.VALUE:"columns") col
)
SELECT
TDML.TABLE_CATALOG, TDML.TABLE_SCHEMA, TDML.TABLE_NAME,TDML.COLUMN_NAME,
CO.IS_NULLABLE,CO.DATA_TYPE,
TB.TABLE_OWNER,TB.TABLE_TYPE,TB.IS_TRANSIENT, TB.ROW_COUNT ROW_COUNT_BEFORE_DROP,
TDML.QUERY_ID,TDML.LAST_DML,TDML.QUERY_TYPE,TDML.QUERY_TEXT,
TDML.ROWS_INSERTED ROWS_INSERTED_DML,TDML.ROWS_UPDATED ROWS_UPDATED_DML,
TDML.ROWS_DELETED ROWS_DELETED_DML,TDML.LAST_DML_BY
FROM TABLES_DML TDML
LEFT JOIN SNOWFLAKE.ACCOUNT_USAGE."TABLES" TB
ON TB.TABLE_CATALOG = TDML.TABLE_CATALOG
AND TB.TABLE_SCHEMA = TDML.TABLE_SCHEMA
AND TB.TABLE_NAME = TDML.TABLE_NAME
AND TDML.LAST_DML BETWEEN TB.CREATED AND COALESCE(TB.DELETED ,CURRENT_TIMESTAMP())
LEFT JOIN (
SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, IS_NULLABLE, DATA_TYPE, DELETED ,
COALESCE(
LEAD(DELETED,1) OVER (PARTITION BY TABLE_CATALOG,TABLE_SCHEMA,TABLE_NAME,COLUMN_NAME ORDER BY DELETED DESC),
TO_DATE('1900-01-01','YYYY-MM-DD')
) CREATED
FROM SNOWFLAKE.ACCOUNT_USAGE."COLUMNS"
) CO
ON TB.TABLE_CATALOG = TDML.TABLE_CATALOG
AND CO.TABLE_SCHEMA = TDML.TABLE_SCHEMA
AND CO.TABLE_NAME = TDML.TABLE_NAME
AND CO.COLUMN_NAME = TDML.COLUMN_NAME
AND TDML.LAST_DML BETWEEN CO.CREATED AND COALESCE(CO.DELETED ,CURRENT_TIMESTAMP())
ORDER BY LAST_DML DESC;

It’s time to rock!!

Now we are going to check what is happening in our dataplatform.

1- What DML on our database along the time, and how many queries, tables, columns and rows are being affected?

SELECT 
TO_DATE(LAST_DML) DT_DML
, TABLE_CATALOG,QUERY_TYPE
, COUNT(DISTINCT TABLE_CATALOG||TABLE_SCHEMA||TABLE_NAME) TABLES_AFFECTED
, COUNT(DISTINCT TABLE_CATALOG||TABLE_SCHEMA||TABLE_NAME||COLUMN_NAME) TABLE_COLUMNS_AFFECTED
, COUNT(DISTINCT QUERY_ID) QUERIES
, SUM(ROWS_INSERTED_DML) ROWS_INSERTED -- ROWS INSERTED IN DIFFERENT COLUMNS IN A SINGLE DML
, SUM(ROWS_UPDATED_DML) ROWS_UPDATED -- ROWS INSERTED IN DIFFERENT COLUMNS IN A SINGLE DML
, SUM(ROWS_DELETED_DML) ROWS_DELETED -- ROWS DELETED IN DIFFERENT COLUMNS IN A SINGLE DML
FROM OBSERVABILITY_DB.MY_SCHEMA.DML_AFFECTED_TABLES.DML_AFFECTED_COLUMN_TABLES
GROUP BY ALL
ORDER BY 1 DESC,2,3;

An example what is happening in my account:

2- We would want to know the history of UPDATE statements and the table-columns affected.

SELECT 
LAST_DML
, TABLE_CATALOG DB
, TABLE_SCHEMA SCHEMA
, QUERY_TYPE
, TABLE_NAME
, COLUMN_NAME
, QUERY_TEXT
, LAST_DML_BY
, ROWS_UPDATED_DML
FROM EDW.DO_META_PSA.DML_AFFECTED_COLUMN_TABLES
WHERE QUERY_TYPE IN ('UPDATE')
GROUP BY ALL
ORDER BY 1 DESC,2,3;

Now, it’s your time! You can practice, and extract more valuable metadata that we have used in our view! This metadata is waiting for you. Why are you waiting to discover it, these insights are there ready to be inspected!

Considerations

Take into consideration that we are using ACCOUNT_USAGE schema, this one has the special characteristic of high latency, from minutes to 1–2 hours, depending of the view you use. That can be changing on the time, so i recommend that check them out each view on Snowflake documentation, to take the last reference of latency times.

Conclusions

You have seen how Snowflake provides us a different Data Objects we can use in order to customize our Data Monitoring. On that Monitoring series we provide a guide, but you can improve that in multiple ways, for example adding new information on each step. In this scenario, it will depend on our current requirements.

An example, could be improving our observability:

Data Objects: We can UNION together the DML insigths provided here, together withe DDL insights we provide in the other post, in order to get a full list of statements DML/DDL.

Dashboard Observability: We can build our customized dashboard on our preferred BI tool, in order to monitor all the transaction on a specific database, schema, table, view and column. This could be done perfectly with Snowsight Snowflake UI.

Remember, we could also data load this information in a incremental way, into our own tables using TASKs, in order to keep our own observability tables improving the efficiency of our SQL requests. This scenario will provide us the best approach to make a heavy queries on our metadata. For example, for ML and forecasting processes.

If people like this article, I will going on deep on the Observability and Monitoring features that provide us Snowflake. Give me some feedback about what would be interesting for you. Thanks for reading.

--

--

Cesar Segura
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

SME @ SDG Group || Snowflake Architect || Snowflake Squad Spotlight Member || CDVP Data Vault