How to Build a Query Cost Monitoring Tool in Snowflake Notebooks
Are you struggling to manage and optimize your database compute costs effectively? If so, understanding the detailed breakdown of these costs per query might provide invaluable insights.
This blog introduces a robust solution by leveraging Snowflake’s account usage data to monitor and visualize compute costs associated with individual queries. Here, you’ll learn how to create an interactive dashboard for detailed cost analysis.
By the end of this blog, readers will be able to:
- Retrieve and merge query data from Snowflake.
- Convert this data into a usable format with Pandas.
- Implement interactive elements using Streamlit.
- Visualize data using Altair charts.
Getting Started
Before you can build your Query Cost Monitoring tool, ensure you have access to a Snowflake account. This tool requires basic knowledge of SQL and Python and familiarity with Pandas and Streamlit libraries.
Let’s start by setting up our environment and retrieving necessary data. Initially, you should download the required notebook from the query cost monitoring GitHub repository. Make sure your notebook environment includes essential libraries such as numpy
, pandas
, altair
and streamlit
.
Retrieving Query Cost Data
Write the SQL Query
Our first task is to write a SQL query that fetches credits_used
and other relevant data from Snowflake's account_usage
view. It’s also worthy to note that this SQL cell is named sql_data
that we’ll use shortly.
SELECT
query_history.query_id,
query_history.query_text,
query_history.start_time,
query_history.end_time,
query_history.user_name,
query_history.database_name,
query_history.schema_name,
query_history.warehouse_name,
query_history.warehouse_size,
metering_history.credits_used,
execution_time/1000 as execution_time_s
FROM
snowflake.account_usage.query_history
JOIN snowflake.account_usage.metering_history ON query_history.start_time >= metering_history.start_time
AND query_history.end_time <= metering_history.end_time
WHERE
query_history.start_time >= DATEADD(DAY, -7, CURRENT_TIMESTAMP())
ORDER BY
query_history.query_id;
Convert Table to a DataFrame
Next, convert the SQL result into a Pandas DataFrame for further processing and analysis in our forthcoming Python code.
sql_data.to_pandas()
Create an Interactive Dashboard
Develop Interactive Components
Conceptually, we’ll build dynamic widgets using Streamlit to allow users to filter data based on different parameters such as time span, warehouse name, warehouse size, or user name.
First, we’ll start with importing the data and laying out the structure of the data app via st.columns()
into 3 columns where each will be occupied by input widgets namely the slider and drop-down selection box.
import pandas as pd
import streamlit as st
import altair as alt
# Get data
df = py_dataframe.copy()
# Create date filter slider
st.subheader("Select time duration")
col = st.columns(3)
with col[0]:
days = st.slider('Select number of days to analyze',
min_value=1,
max_value=7,
value=7,
step=1)
with col[1]:
var = st.selectbox("Select a variable", ['WAREHOUSE_NAME', 'USER_NAME', 'WAREHOUSE_SIZE'])
with col[2]:
metric = st.selectbox("Select a metric", ["COUNT", "TOTAL_CREDITS_USED"])
Next, we’ll filter the data by applying the aforementioned user-specified widget values.
# Filter data according to day duration
df['START_TIME'] = pd.to_datetime(df['START_TIME'])
latest_date = df['START_TIME'].max()
cutoff_date = latest_date - pd.Timedelta(days=days)
filtered_df = df[df['START_TIME'] > cutoff_date].copy()
# Prepare data for heatmap
filtered_df['HOUR_OF_DAY'] = filtered_df['START_TIME'].dt.hour
filtered_df['HOUR_DISPLAY'] = filtered_df['HOUR_OF_DAY'].apply(lambda x: f"{x:02d}:00")
# Calculate frequency count and sum of credits by hour and query
agg_df = (filtered_df.groupby(['QUERY_ID', 'HOUR_DISPLAY', var])
.agg(
COUNT=('QUERY_ID', 'size'),
TOTAL_CREDITS_USED=('CREDITS_USED', 'sum')
)
.reset_index()
)
st.warning(f"Analyzing {var} data for the last {days} days!")
Here, we’re defining session states for the buttons that will serve as the logic for what happens when buttons are clicked. This will allow us to display the underlying DataFrames of the filtered and processed data.
# Initialize the button state in session state
if 'expanded_btn' not in st.session_state:
st.session_state.expanded_btn = False
# Callback function to toggle the state
def toggle_expand():
st.session_state.expanded_btn = not st.session_state.expanded_btn
# Create button with callback
st.button(
'⊕ Expand DataFrames' if not st.session_state.expanded_btn else '⊖ Collapse DataFrames',
on_click=toggle_expand,
type='secondary' if st.session_state.expanded_btn else 'primary'
)
# State conditional
if st.session_state.expanded_btn:
expand_value = True
else:
expand_value = False
with st.expander("See Filtered DataFrame", expanded=expand_value):
st.dataframe(filtered_df.head(100))
with st.expander("See Heatmap DataFrame", expanded=expand_value):
st.dataframe(agg_df)
The above code snippets will generate the following interactive app right inside Snowflake Notebooks:
Implement Data Visualizations
Here, we’ll use Altair to create insightful visualizations such as heatmaps, stacked bar charts, and bubble plots to allow visual detection of trends.
## Heatmap
heatmap = alt.Chart(agg_df).mark_rect(stroke='black', strokeWidth=1).encode(
x='HOUR_DISPLAY:O',
y=alt.Y(f'{var}:N',
title='',
axis=alt.Axis(
labels=True,
labelLimit=250,
tickMinStep=1,
labelOverlap=False,
labelPadding=10
)),
color=f'{metric}:Q',
tooltip=['HOUR_DISPLAY', var, metric]
).properties(
title=f'Query Activity Heatmap by Hour and {var}'
)
st.altair_chart(heatmap, use_container_width=True)
## Stacked bar chart with time series
bar_time = alt.Chart(agg_df).mark_bar().encode(
x='HOUR_DISPLAY:O',
y=f'{metric}:Q',
color=alt.Color(f'{var}:N', legend=alt.Legend(orient='bottom')),
tooltip=['HOUR_DISPLAY', var, metric]
).properties(
title=f'Query Activity by Hour and {var}',
height=400
)
st.altair_chart(bar_time, use_container_width=True)
## Bubble plot with size representing the metric
bubble = alt.Chart(agg_df).mark_circle().encode(
x='HOUR_DISPLAY:O',
y=alt.Y(f'{var}:N', title=''),
size=alt.Size(f'{metric}:Q', legend=alt.Legend(title='Query Count')),
color=alt.Color(f'{var}:N', legend=None),
tooltip=['HOUR_DISPLAY', var, metric]
).properties(
title=f'Query Distribution by Hour and {var}',
height=550
)
st.altair_chart(bubble, use_container_width=True)
Let’s now see how the generated data visualizations look like.
Starting with the heatmap:
On to the stacked bar chart:
And finally the bubble plots:
Conclusion
Congratulations on successfully developing a query cost monitoring tool! The interactive dashboard that we’ve just created allow users to analyze query patterns, identify cost drivers, and make data-driven decisions for optimization. You can further adapt this dashboard by adding additional metrics, implementing additional visualizations, etc.
Resources
Videos:
- How To Break Down Compute Costs By Individual Queries In Snowflake Notebooks
- YouTube playlist on Snowflake Notebooks
Articles:
Documentation:
Additional Reading:
Happy coding!