Effortless Performance Analysis with Waii’s Text-to-SQL Platform
As a performance engineer, I often have to analyse metrics across various components to gain valuable insights on answering performance questions. In the context of Snowflake warehouse, a lot of performance related data is available in the form of views — from query_history
to warehouse_load_history
and access_history
. Documentation on accessing these views are well covered in snowflake community article.
I set up these snowflake views in my SNOWFLAKE_PERF_COPY database in one of my internal clusters. But instead of slogging through writing lines of SQL code to mine these invaluable views, I decided to get the help of Waii’s text-to-SQL platform. My goal was simple: ask simple to complex performance questions in plain, natural language and let the system come up with relevant SQL.
Snowflake has various aspects of performance with respect to data loading, transformation, and reporting queries. For this blog, I will primarily focus on dissecting non-data load queries within one of my internal clusters.
Slow query insights
My adventure begins with a fundamental question: which queries have been consuming the most amount of time in the data warehouse? Snowflake’s native “query_history” already has this covered with its comprehensive details and nifty filters.
However, from a performance perspective, I wanted to go beyond slow queries and dive into the tables they frequently accessed along with table sizes and cache utilisation. This information would be very useful, when there are lots of tables present in the system and we need to quickly nail down the few databases of interest. I posted this question to Waii:
“Over the past two weeks, identify the 200 queries with the longest runtime. For these queries, provide insights into the frequently accessed tables, including the number of times each table was accessed and the users who ran these queries. Additionally, report the average runtime of these queries rounded to one decimal place, along with the associated warehouses where they were executed. Include the size of these tables and the average amount scanned from cache. Present the top 200 rows sorted by average runtime and include a breakdown of successful and failed queries.”
WITH filtered_query_history AS (
SELECT *
FROM snowflake_perf_copy.perf.query_history
WHERE
start_time >= DATEADD(DAY, -14, CURRENT_DATE)
),
top_200_queries AS (
SELECT
query_id,
user_name,
warehouse_name,
total_elapsed_time,
execution_status,
percentage_scanned_from_cache
FROM filtered_query_history
ORDER BY
total_elapsed_time DESC
LIMIT 200
),
joined_access_history AS (
SELECT
t.query_id,
t.user_name,
t.warehouse_name,
t.total_elapsed_time,
a.base_objects_accessed,
execution_status,
percentage_scanned_from_cache
FROM top_200_queries AS t
INNER JOIN snowflake_perf_copy.perf.access_history AS a
ON t.query_id = a.query_id
),
flattened_access_history AS (
SELECT
j.user_name,
j.warehouse_name,
j.total_elapsed_time,
f.value['objectName'] AS table_name,
execution_status,
percentage_scanned_from_cache
FROM joined_access_history AS j
CROSS JOIN LATERAL FLATTEN(input => j.base_objects_accessed) AS f
)
SELECT
f.table_name,
COUNT(*) AS num_accesses,
ARRAY_AGG(DISTINCT f.user_name) AS users,
ROUND(AVG(f.total_elapsed_time), 1) AS avg_runtime,
COUNT(CASE WHEN f.execution_status = 'SUCCESS' THEN 1 END) AS successful_queries,
COUNT(CASE WHEN f.execution_status = 'FAIL' THEN 1 END) AS failed_queries,
AVG(f.percentage_scanned_from_cache) AS avg_percentage_scanned_from_cache,
ROUND(tsm.active_bytes / 1024 / 1024 / 1024, 1) AS table_size_gb
FROM flattened_access_history AS f
INNER JOIN snowflake_perf_copy.perf.table_storage_metrics AS tsm
ON CONCAT(tsm.table_catalog, '.', tsm.table_schema, '.', tsm.table_name) = f.table_name
GROUP BY
f.table_name,
tsm.active_bytes
ORDER BY
avg_runtime DESC
LIMIT 200
Initially, Waii overlooked table sizes and missed adding it, but a subtle hint, “Add table size from table_storage_metrics,” did the trick of including table size details. It was able to join different views and interestingly was able to handle complex data type like “base_objects_accessed” in access_history table.
Workload mysteries
Hourly query execution pattern
I wanted to analyse query execution patterns on an hourly basis throughout the week to identify workload anomalies. Snowflake provides a workload graph for warehouses.
It would be great to have a heapmap/gridmap to show the number of queries executed by hour and day.
Question to Waii
For the last 1 week, can you identify any patterns in query execution times, number of queries executed categorised by the hour of the day and day of the week? The report should present this data with each day of the week as columns, spanning from Sunday to Saturday.
WITH filtered_query_history AS (
SELECT *
FROM snowflake_perf_copy.perf.query_history
WHERE
start_time >= DATEADD(DAY, -7, CURRENT_DATE)
),
query_hour_day AS (
SELECT
EXTRACT(HOUR FROM start_time) AS hour_of_day,
EXTRACT(DAYOFWEEK FROM start_time) AS day_of_week,
COUNT(query_id) AS num_queries
FROM filtered_query_history
GROUP BY
EXTRACT(HOUR FROM start_time),
EXTRACT(DAYOFWEEK FROM start_time)
)
SELECT
hour_of_day,
SUM(CASE WHEN day_of_week = 1 THEN num_queries ELSE 0 END) AS sunday,
SUM(CASE WHEN day_of_week = 2 THEN num_queries ELSE 0 END) AS monday,
SUM(CASE WHEN day_of_week = 3 THEN num_queries ELSE 0 END) AS tuesday,
SUM(CASE WHEN day_of_week = 4 THEN num_queries ELSE 0 END) AS wednesday,
SUM(CASE WHEN day_of_week = 5 THEN num_queries ELSE 0 END) AS thursday,
SUM(CASE WHEN day_of_week = 6 THEN num_queries ELSE 0 END) AS friday,
SUM(CASE WHEN day_of_week = 7 THEN num_queries ELSE 0 END) AS saturday
FROM query_hour_day
GROUP BY
hour_of_day
ORDER BY
hour_of_day
I used this data to generate a report using Google Chart. It is important to consider aspects such as response time and the amount of bytes scanned for further analysis. However, as a starting point, I focused on exploring the number of executed queries.
Queries executed on each hour of the day throughout the week:
Tracking database growth over time
It’s a commonly recognised fact that in production environments, certain databases have the potential to have unforeseen growth, leading to delays in query execution. This led me to ask Waii about analysing the database growth over time.
What is the database growth over time? How has storage usage grown week on week for the last 8 weeks, and which databases have contributed the most to this growth?
WITH weekly_storage_usage AS (
SELECT
database_name,
DATE_TRUNC('WEEK', usage_date) AS week,
SUM(average_database_bytes) / NULLIF((
1024 * 1024 * 1024
), 0) AS storage_gb
FROM snowflake_perf_copy.perf.database_storage_usage_history
GROUP BY
database_name,
DATE_TRUNC('WEEK', usage_date)
),
week_on_week_growth AS (
SELECT
database_name,
week,
storage_gb,
LAG(storage_gb) OVER (PARTITION BY database_name ORDER BY week) AS prev_storage_gb
FROM weekly_storage_usage
)
SELECT
database_name,
week,
ROUND((
storage_gb - prev_storage_gb
), 1) AS growth_gb
FROM week_on_week_growth
WHERE
week >= DATEADD(WEEK, -8, CURRENT_DATE)
ORDER BY
growth_gb DESC nulls last
In search of inefficient grouping operators
I wanted to analyse the efficiency of common group/aggregation/filters executed by those top 200 slow queries for fine-tuning them later. Snowflake provides additional statistics about each query through get_query_operator_stats
table function (Reference). This function accepts query_id as input. Since joining the table function with “query_history” was not straight forward, I wanted to have a small stored procedure to populate these statistics in a separate table. I experimented with Waii, and it generated the following stored procedure. Although there were a few minor compilation issues initially, they were easily resolved.
CREATE OR REPLACE TABLE snowflake_perf_copy.perf.q_op_stats AS
SELECT * FROM TABLE(GET_QUERY_OPERATOR_STATS('dummy_query_id')) WHERE 1 = 0;
CREATE OR REPLACE PROCEDURE ITERATE_QUERIES() RETURNS STRING LANGUAGE JAVASCRIPT AS
$$ var query = `WITH top_queries AS (SELECT query_id
FROM SNOWFLAKE_PERF_COPY.PERF.QUERY_HISTORY
WHERE
execution_status = 'SUCCESS' AND query_type = 'SELECT'
AND start_time >= DATEADD(day, -14, CURRENT_DATE)
ORDER BY total_elapsed_time DESC LIMIT 200) SELECT query_id FROM top_queries`;
var stmt = snowflake.createStatement({sqlText: query});
var result_set = stmt.execute();
while (result_set.next()) {
var query_id = result_set.getColumnValue(1);
var insert_query = `INSERT INTO snowflake_perf_copy.perf.q_op_stats SELECT * FROM TABLE(GET_QUERY_OPERATOR_STATS('${query_id}'))`;
var insert_stmt = snowflake.createStatement({sqlText: insert_query});
insert_stmt.execute();
}
return 'Done'; $$;
CALL ITERATE_QUERIES();
With this dataset, I wanted to identify high cardinality grouping operators. These are operators that emit 80% of the input rows to output — a potential clue for optimising query performance. 80% is not a hard number, it can vary from system to system. In certain cases, identifying these common groupings can lead to deeper understanding of disk spills in slow queries. Administrators can also explore creating materialised views in certain cases to avoid these expensive groupings across queries.
Question to Waii:
Can you display inefficient high cardinality grouping operators, specifically those where the output rows exceeded 80% of the input rows? Please provide the query text, operator attributes, and overall execution percentage, rounded to two decimal places. Additionally, indicate whether there were any local disk spills for these queries, and sort the results by overall execution percentage.
WITH high_cardinality_grouping_operators AS (
SELECT
q.query_id,
q.operator_attributes,
q.operator_statistics,
q.execution_time_breakdown,
qh.query_text
FROM snowflake_perf_copy.perf.q_op_stats AS q
INNER JOIN snowflake_perf_copy.perf.query_history AS qh
ON q.query_id = qh.query_id
WHERE
q.operator_type = 'Aggregate'
AND (
CAST(q.operator_statistics['output_rows'] AS FLOAT)
/ NULLIF(CAST(q.operator_statistics['input_rows'] AS FLOAT), 0)
) > 0.8
)
SELECT
query_text,
operator_attributes,
ROUND(CAST(execution_time_breakdown['overall_percentage'] AS FLOAT) * 100, 2) AS overall_percentage,
CASE
WHEN CAST(execution_time_breakdown['local_disk_io'] AS FLOAT) > 0
THEN 'Yes'
ELSE 'No'
END AS local_disk_spill
FROM high_cardinality_grouping_operators
ORDER BY
overall_percentage DESC
Conclusion:
Though there are lot of performance questions when looking at optimising a system, in this blog I explored whether few of these performance questions can be expressed via natural language to gain insights via Waii’s text-to-SQL platform. Overall, it effortlessly generated SQL queries for most questions and in few cases, required gentle nudges in the form of hints.