Cloud Spanner: Read Statistics

Published in

Google Cloud - Community

5 min readJul 15, 2020

Cloud Spanner is Google’s fully managed scalable relational database service. We recently announced a new feature, Spanner read statistics, that lets you run SQL queries to retrieve read statistics for your database during one-, 10-, and 60-minute intervals. These read statistics allow you to see the most common and most resource-consuming reads (done through Spanner Read APIs) executed on your database.

In this post, you’ll see how to use these read statistics to identify which reads are involved in high CPU usage.

Read statistics

Read statistics provide insight into how an application is using the database, and are useful when investigating performance issues. For example, you can check which read shapes are running against a database and how frequently they run to help explain the performance characteristics of these read shapes. You can use the read statistics for your database to identify read shapes that result in high CPU usage. At a high level, read statistics will help you to understand the behavior of read traffic going into a database in terms of resource usage.

Spanner provides built-in tables that store statistics about reads. You can retrieve statistics from these SPANNER_SYS.READ_STATS* tables using SQL statements. See Read Statistics in our official Spanner introspection documentation for more details.

Note: we’ll use the term “read shape” in this document to refer to the set of columns read in a read request. Read statistics track properties of these read shapes.

Aggregated read statistics

Spanner captures aggregated read statistics in the following system tables:

SPANNER_SYS.READ_STATS_TOTAL_MINUTE: Read shape statistics aggregated across one-minute intervals.
SPANNER_SYS.READ_STATS_TOTAL_10MINUTE: Read shape statistics aggregated across 10-minute intervals.
SPANNER_SYS.READ_STATS_TOTAL_HOUR: Read shape statistics aggregated across 60-minute intervals.

Each row in the above tables contains aggregated statistics of all reads executed over the database during the specific time interval. So, the above tables contain only one row for any given time interval.

Troubleshooting high CPU usage with read statistics

Spanner read statistics come in handy in cases where you need to investigate high CPU usage on your Spanner database or when you are just trying to understand the CPU-heavy read shapes on your database. Inspection of read shapes that use significant amounts of database resources gives Spanner users a potential way to reduce operational costs and possibly improve general system latencies. Using the following steps, we’ll show you how to use read statistics to investigate high CPU usage in your database.

Step 1: Selecting a time period to investigate

Start your investigation by looking for a time when your application began to experience high CPU usage. For example, let’s say the issue started occurring around 5:20pm on May 28, 2020.

Step 2: Gathering read statistics for the selected time period

Having selected a time period to start our investigation, we’ll look at statistics gathered on the READ_STATS_TOTAL_10MINUTE table around that time. The results of this query might give us clues about how CPU and other read statistics changed over that period of time. The following query returns the aggregated read statistics from 4:30 pm to 7:30 pm (inclusive).

SELECT
  interval_end,
  ROUND(avg_cpu_seconds,4),
  execution_count,
  avg_locking_delay_seconds
FROM SPANNER_SYS.READ_STATS_TOTAL_10MINUTE
WHERE
  interval_end >= "2020-05-28T16:30:00"
  AND interval_end <= "2020-05-28T19:30:00"
ORDER BY interval_end;

Let’s take the following data as an example of the result we get back from our query.

Here we see that average CPU time, avg_cpu_seconds, is higher in the highlighted intervals. For instance, the interval_end with the value 2020–05–28 19:20:00 has a higher CPU time. Let’s choose that interval to investigate further in the next step.

Step 3: Finding which read shapes are causing high CPU usage

Digging a little deeper, we now query the READ_STATS_TOP_10MINUTE table for the interval that we picked based on the preceding step. The results of this query can help indicate which read shapes cause high CPU usage.

SELECT
  read_columns,
  ROUND(avg_cpu_seconds,4),
  execution_count,
  avg_rows
FROM SPANNER_SYS.READ_STATS_TOP_10MINUTE
WHERE
  interval_end = "2020-05-28T19:20:00"
ORDER BY avg_cpu_seconds DESC LIMIT 3;

The following data is an example of the result from the query, returning information about the top three read shapes ranked by avg_cpu_seconds. Note the use of ROUND in our query to restrict the output of avg_cpu_seconds to four decimal places.

One reason for high CPU usage might be that you start to execute a few read shapes more frequently (execution_count). Perhaps the average number of rows that the read returned has increased (avg_rows). If none of those properties of the read shape reveal anything interesting, you can examine other properties such as avg_locking_delay_seconds, avg_client_wait_seconds or avg_bytes.

Step 4: Applying best practices to reduce high CPU usage

When you have gone through the preceding steps, consider whether any of these best practices will help your situation.

The number of times Spanner executed read shapes during the interval is a good example of a metric that needs a baseline to tell you if a measurement is reasonable or a sign of a problem. Having established a baseline for the metric, you’ll be able to detect and investigate the cause of any unexpected deviations from normal behavior.
When you notice a spike in CPU usage, check if it can be correlated with a similar sudden spike in user requests or application behavior. If so, it might not indicate a problem.
Try the following query to find the top read shapes ranked by the number of times Spanner executed for each read shape:

SELECT interval_end, read_columns, execution_count
FROM SPANNER_SYS.READ_STATS_TOP_MINUTE
ORDER BY execution_count DESC
LIMIT 10;

If you are looking for the lowest possible read latencies, especially when using multi-region instance configurations, use stale reads instead of strong reads to reduce/remove the AVG_LEADER_REFRESH_DELAY_SECONDS component of read latency.
If you are only doing reads, and you can express your read using a single read method, you should use that single read method. Single reads do not lock, unlike read-write transactions, therefore you should use read-only transactions over more expensive read-write transactions when you are not writing data.

In summary, Spanner read statistics provide greater observability and insight into your database behaviors in terms of read traffic. This is in addition to already existing transaction statistics and query statistics.