Troubleshoot query latency changes using SPANNER_SYS.QUERY_PROFILES_TOP_* tables

Pablo Arrojo
Google Cloud - Community
6 min readMar 14, 2024

One of the best tools in Cloud Spanner to understand why query latency changed is Query plan samples.

Query plan samples allow you to view samples of historic query plans and compare the performance of a query over time:

Query plan samples illustration

This is very useful to detect the reason for query latency degradation due to changes in execution plans and also understand the root cause for these changes (like schema changes, optimizer version update or new optimizer statistics).

However, there are situations where query latency regression is not related to changes in execution plans. Sometimes we need a deeper level of introspection to understand what is causing these performance issues. For these cases, built-in tables SPANNER_SYS.QUERY_PROFILES_TOP_* can provide us with a lot of insights.

These tables stores every sampled plan with a detailed profile of the queries:

This is really interesting and useful, since we have access to query statistics which are not available from visual tools:

"queryStats":{
"statistics_load_time":"0",
"remote_server_calls":"3/3",
"rows_scanned":"15",
"query_plan_creation_time":"2.08 msecs",
"optimizer_statistics_package":"auto_20240306_22_41_27UTC",
"optimizer_version":"6",
"cpu_time":"5.56 msecs",
"deleted_rows_scanned":"0",
"query_text":"select id, counter from user_operations where id in UNNEST(@ids)",
"rows_returned":"15",
"bytes_returned":"635",
"filesystem_delay_seconds":"0 msecs",
"locking_delay":"0 msecs",
"runtime_creation_time":"0.73 msecs",
"elapsed_time":"6.61 msecs",
"data_bytes_read":"218224"
}

These query statistics provide us additional deep insights of sampled queries to understand the root cause for query latency changes.

Below, some examples of how we could troubleshoot query performance regressions using these statistics:

Locking Delay

Cloud Spanner provides nice tools to identify and understand locking issues in our database, like Lock Insight visual tool or built-in tables SPANNER_SYS.LOCK_STATS_TOP_* .

For example, here we are detecting locks and hot keys over user_operations table:

If queries affected by these locks have been sampled, we could understand how much time queries are incurring in lock waits just looking at locking_delay statistic:

spanner> SELECT interval_end,
-> text_fingerprint,
-> JSON_VALUE(STATS,'$.elapsed_time') AS elapsed_time,
-> JSON_VALUE(STATS,'$.locking_delay') AS locking_delay,
-> JSON_VALUE(STATS,'$.query_text') AS query_text,
-> from (
-> SELECT interval_end,
-> text_fingerprint,
-> JSON_QUERY(query_profile,'$.queryStats') AS STATS
-> from
-> SPANNER_SYS.QUERY_PROFILES_TOP_MINUTE
-> where text_fingerprint=785512920672022105);
+----------------------+--------------------+--------------+---------------+-------------------------------------------------------------------------------------+
| interval_end | text_fingerprint | elapsed_time | locking_delay | query_text |
+----------------------+--------------------+--------------+---------------+-------------------------------------------------------------------------------------+
| 2024-03-07T12:22:00Z | 785512920672022105 | 2.27 msecs | 0 msecs | @{LOCK_SCANNED_RANGES=Exclusive} SELECT counter FROM user_operations WHERE id = @id |
| 2024-03-07T12:21:00Z | 785512920672022105 | 100.88 msecs | 98.61 msecs | @{LOCK_SCANNED_RANGES=Exclusive} SELECT counter FROM user_operations WHERE id = @id |
| 2024-03-07T12:02:00Z | 785512920672022105 | 2.5 msecs | 0.01 msecs | @{LOCK_SCANNED_RANGES=Exclusive} SELECT counter FROM user_operations WHERE id = @id |
| 2024-03-07T12:01:00Z | 785512920672022105 | 2.32 msecs | 0.01 msecs | @{LOCK_SCANNED_RANGES=Exclusive} SELECT counter FROM user_operations WHERE id = @id |
| 2024-03-07T11:41:00Z | 785512920672022105 | 2.37 msecs | 0.01 msecs | @{LOCK_SCANNED_RANGES=Exclusive} SELECT counter FROM user_operations WHERE id = @id |
+----------------------+--------------------+--------------+---------------+-------------------------------------------------------------------------------------+
5 rows in set (9.47 msecs)

Remote Server Calls

How data is distributed or co-located in our database could affect query performance, and identifying this behavior could be a hard task.

In this example, we see how latency for a query increased up to 3x:

If we compare the sampled plans and their metrics we can see that there are no differences that explain this latency change (same execution plan, same number of scanned rows and so on):

Lowest latency sample
Highest latency sample

Now, checking remote_server_calls statistic we confirm the reason for this performance regression. Latency increased as our query is looking for rows distributed across 3 splits (remote_server_calls: 3/3), whereas before all rows searched were co-located in a single split (remote_server_call: 0/0) :

spanner> SELECT interval_end,
-> text_fingerprint,
-> JSON_VALUE(STATS,'$.elapsed_time') AS elapsed_time,
-> JSON_VALUE(STATS,'$.cpu_time') AS cpu_time,
-> JSON_VALUE(STATS,'$.rows_scanned') AS rows_scanned,
-> JSON_VALUE(STATS,'$.remote_server_calls') AS remote_server_calls,
-> JSON_VALUE(STATS,'$.query_text') AS query_text,
-> from (
-> SELECT interval_end,
-> text_fingerprint,
-> JSON_QUERY(query_profile,'$.queryStats') AS STATS
-> from
-> SPANNER_SYS.QUERY_PROFILES_TOP_MINUTE
-> where text_fingerprint=2960945325229107391);
+----------------------+---------------------+--------------+------------+--------------+---------------------+-------------------------------------------------------------------+
| interval_end | text_fingerprint | elapsed_time | cpu_time | rows_scanned | remote_server_calls | query_text |
+----------------------+---------------------+--------------+------------+--------------+---------------------+-------------------------------------------------------------------+
| 2024-03-07T17:30:00Z | 2960945325229107391 | 5.74 msecs | 4.83 msecs | 15 | 3/3 | select id, counter from user_operations where id in UNNEST(@ids) |
| 2024-03-07T17:10:00Z | 2960945325229107391 | 6.61 msecs | 5.56 msecs | 15 | 3/3 | select id, counter from user_operations where id in UNNEST(@ids) |
| 2024-03-07T16:50:00Z | 2960945325229107391 | 8.38 msecs | 4.8 msecs | 15 | 3/3 | select id, counter from user_operations where id in UNNEST(@ids) |
| 2024-03-07T16:30:00Z | 2960945325229107391 | 5.69 msecs | 4.86 msecs | 15 | 3/3 | select id, counter from user_operations where id in UNNEST(@ids) |
| 2024-03-07T16:10:00Z | 2960945325229107391 | 5.81 msecs | 5.12 msecs | 15 | 3/3 | select id, counter from user_operations where id in UNNEST(@ids) |
| 2024-03-07T15:50:00Z | 2960945325229107391 | 4.01 msecs | 3.29 msecs | 15 | 0/0 | select id, counter from user_operations where id in UNNEST(@ids) |
| 2024-03-07T15:30:00Z | 2960945325229107391 | 3.84 msecs | 3.13 msecs | 15 | 0/0 | select id, counter from user_operations where id in UNNEST(@ids) |
+----------------------+---------------------+--------------+------------+--------------+---------------------+-------------------------------------------------------------------+
6 rows in set (21.29 msecs)

Deleted rows

I’ve written in detail about how deleted rows could affect query latency here.

This scenario can also be detected using the query profile table checking deleted_rows_scanned statistic.

spanner> SELECT interval_end,
-> text_fingerprint,
-> JSON_VALUE(STATS,'$.elapsed_time') AS elapsed_time,
-> JSON_VALUE(STATS,'$.cpu_time') AS cpu_time,
-> JSON_VALUE(STATS,'$.rows_scanned') AS rows_scanned,
-> JSON_VALUE(STATS,'$.deleted_rows_scanned') AS deleted_rows_scanned,
-> JSON_VALUE(STATS,'$.query_text') AS query_text,
-> from (
-> SELECT interval_end,
-> text_fingerprint,
-> JSON_QUERY(query_profile,'$.queryStats') AS STATS
-> from
-> SPANNER_SYS.QUERY_PROFILES_TOP_MINUTE
-> where text_fingerprint=-1544692717167823854);
+----------------------+----------------------+--------------+-------------+--------------+----------------------+------------------------------------------------------------+
| interval_end | text_fingerprint | elapsed_time | cpu_time | rows_scanned | deleted_rows_scanned | query_text |
+----------------------+----------------------+--------------+-------------+--------------+----------------------+------------------------------------------------------------+
| 2024-03-07T19:11:00Z | -1544692717167823854 | 50.95 msecs | 52.26 msecs | 3 | 54600 | select min(ts) from user_operations where user_id=@userid |
| 2024-03-07T19:10:00Z | -1544692717167823854 | 60.04 msecs | 59.74 msecs | 3 | 54600 | select min(ts) from user_operations where user_id=@userid |
| 2024-03-07T18:51:00Z | -1544692717167823854 | 70.66 msecs | 71.1 msecs | 3 | 54600 | select min(ts) from user_operations where user_id=@userid |
| 2024-03-07T18:50:00Z | -1544692717167823854 | 65.18 msecs | 65.12 msecs | 3 | 54600 | select min(ts) from user_operations where user_id=@userid |
| 2024-03-07T18:30:00Z | -1544692717167823854 | 7.53 msecs | 6.98 msecs | 3 | 0 | select min(ts) from user_operations where user_id=@userid |
| 2024-03-07T18:10:00Z | -1544692717167823854 | 7.14 msecs | 6.91 msecs | 3 | 0 | select min(ts) from user_operations where user_id=@userid |
+----------------------+----------------------+--------------+-------------+--------------+----------------------+------------------------------------------------------------+

Summary

Above I’ve described just a few of possible scenarios where using built-in table SPANNER_SYS.QUERY_PROFILES_TOP_* is a good practice to understand why our query latency has changed.

Unfortunately, not all queries have sampled query plans (according to documentation, only queries that consume higher CPU might be sampled) hence this kind of analysis will be available just for a limited number of queries.

Here a query to obtain all query statistics for sampled query plans:

SELECT interval_end,
text_fingerprint,
JSON_VALUE(STATS,'$.elapsed_time') AS elapsed_time,
JSON_VALUE(STATS,'$.cpu_time') AS cpu_time,
JSON_VALUE(STATS,'$.rows_scanned') AS rows_scanned,
JSON_VALUE(STATS,'$.deleted_rows_scanned') AS deleted_rows_scanned,
JSON_VALUE(STATS,'$.rows_returned') AS rows_returned,
JSON_VALUE(STATS,'$.data_bytes_read') AS data_bytes_read,
JSON_VALUE(STATS,'$.bytes_returned') AS bytes_returned,
JSON_VALUE(STATS,'$.remote_server_calls') AS remote_server_calls,
JSON_VALUE(STATS,'$.locking_delay') AS locking_delay,
JSON_VALUE(STATS,'$.filesystem_delay_seconds') AS filesystem_delay_seconds,
JSON_VALUE(STATS,'$.query_plan_creation_time') AS query_plan_creation_time,
JSON_VALUE(STATS,'$.runtime_creation_time') AS runtime_creation_time,
JSON_VALUE(STATS,'$.statistics_load_time') AS statistics_load_time,
JSON_VALUE(STATS,'$.optimizer_version') AS optimizer_version,
JSON_VALUE(STATS,'$.optimizer_statistics_package') AS optimizer_statistics_package,
JSON_VALUE(STATS,'$.query_text') AS query_text,
from (
SELECT
interval_end, text_fingerprint,JSON_QUERY(query_profile,'$.queryStats') AS STATS
FROM
SPANNER_SYS.QUERY_PROFILES_TOP_MINUTE
where text_fingerprint=<id>);

--

--