Writing efficient queries for
Cloud Spanner interleaved tables

Published in

Google Cloud - Community

8 min readMay 17, 2021

In this blog, we discuss how to write efficient queries for Cloud Spanner interleaved tables.

TL;DR

You can write a query which selects rows from a parent table and interleaved tables at once with SELECT AS STRUCT subqueries. This allows you to run queries efficiently while consuming less CPU utilization on the Cloud Spanner database.

Introduction

Cloud Spanner is a fully-managed, horizontally scalable relational database service by Google. One of the unique capabilities of Cloud Spanner is table interleaving, which allows you to define parent-child relationships between tables. The data in interleaved tables are physically co-located and it can improve read and write performance. For more details of table interleaving, please check this blog post written by Christoph and Anand.

When you design the schema with table interleaving, you might have a situation where you want to select rows from the parent table and child tables at once. This would be a common use case where your application has an API which needs to aggregate the user’s information from multiple Cloud Spanner tables.

Query for interleaved tables

To discuss this topic further, let’s think with the following tables and data:

CREATE TABLE Singers (
  SingerId INT64 NOT NULL,
  FirstName STRING(1024) NOT NULL,
  LastName STRING(1024) NOT NULL,
) PRIMARY KEY (SingerId);CREATE TABLE Albums (
  SingerId INT64 NOT NULL,
  AlbumId INT64 NOT NULL,
  Title STRING(1024) NOT NULL,
) PRIMARY KEY (SingerId, AlbumId),
  INTERLEAVE IN PARENT Singers ON DELETE CASCADE;CREATE TABLE Concerts (
  SingerId INT64 NOT NULL,
  ConcertId INT64 NOT NULL,
  Price INT64 NOT NULL,
) PRIMARY KEY (SingerId, ConcertId),
  INTERLEAVE IN PARENT Singers ON DELETE CASCADE;# Sample Data
INSERT INTO Singers (SingerId, FirstName, LastName) VALUES (1, "Nick", "Porter");
INSERT INTO Albums (SingerId, AlbumId, Title) VALUES (1, 1, "Total Junk"), (1, 2, "Nice Field");
INSERT INTO Concerts (SingerId, ConcertId, Price) VALUES (1, 1, 3800), (1, 2, 5000), (1, 3, 4000);

Singers is a topmost parent table. Albums and Concerts are the child tables below Singers.

Given that we have these tables, what is the best approach to select all rows from the tables where SingerId = 1?

Figure 1: Selecting all rows where SingerId = 1

The naive approach would be selecting rows from a parent table and child tables respectively.

SELECT * FROM Singers WHERE SingerId = 1;
SELECT * FROM Albums WHERE SingerId = 1;
SELECT * FROM Songs WHERE SingerId = 1;

This might be ok for some situations, but as the number of interleaved tables increases, the number of queries also increases proportionally and it could degrade the performance.

Another approach would be using a JOIN clause to join the parent and child tables.

SELECT * FROM Singers s
INNER JOIN Albums a ON s.SingerId = a.SingerId
INNER JOIN Concerts c ON s.SingerId = c.SingerId
WHERE s.SingerId = 1;

With this approach you can use a single query to select all rows, but notice that the above query uses the same SingerId as join conditions for multiple tables. This type of join becomes a Cartesian Product and it could produce a large result set.

For example, if we run the preceding query for the tables which store 2 rows in Albums and 3 rows in Concerts, the result set becomes 2 * 3 = 6 rows as shown below:

Figure 2: The result set of the query using JOIN

Therefore, using JOIN for dozens of interleaved tables would not be a good idea as it could produce a large data set and potentially degrade the query performance.

Solution: Using subqueries with SELECT AS STRUCT

The solution to select rows from multiple tables at once is using subqueries with SELECT AS STRUCT statement. The query with this approach looks as follows:

SELECT *,
  ARRAY(SELECT AS STRUCT * FROM Albums WHERE SingerId = 1) as Albums,
  ARRAY(SELECT AS STRUCT * FROM Concerts WHERE SingerId = 1) as Concerts,
FROM Singers WHERE SingerId = 1;

This query looks a little bit complicated, so let’s break it apart and understand each part of the query. At first, take a look at the following subquery:

SELECT AS STRUCT * FROM Albums WHERE SingerId = 1

This SELECT statement selects rows with SingerId is equal to 1 from the Albums table. SELECT AS STRUCT statement is a way to get the result set as a STRUCT type and it produces the result with a single column storing multiple fields as follows:

+------------------------------------------------+
| Value                                          |
+------------------------------------------------+
| {SingerId: 1, AlbumId: 1, Title: "Total Junk"} |
| {SingerId: 1, AlbumId: 2, Title: "Nice Field"} |
+------------------------------------------------+

Then, a single Singer could have a zero or multiple Albums, so we can wrap the above statement with ARRAY() to make ARRAY Subquery.

ARRAY(SELECT AS STRUCT * FROM Albums WHERE SingerId = 1)

Finally, you can write down ARRAY(SELECT AS STRUCT ...) subqueries as many as the number of interleaved tables, and get the query we have seen first.

SELECT *,
  ARRAY(SELECT AS STRUCT * FROM Albums WHERE SingerId = 1) as Albums,
  ARRAY(SELECT AS STRUCT * FROM Concerts WHERE SingerId = 1) as Concerts,
FROM Singers WHERE SingerId = 1;

When you run this query in the Cloud Console, you can get the following result.

Figure 3: The result of the query using SELECT AS STRUCT subqueries

Though this result seems to have lost the field name of Albums and Concerts columns, in reality both columns actually retain the name and the type of the STRUCT field when accessing it from the programming language.

For example, Go’s Row.ToStruct can convert the above result into a Go’s struct in the following way.

type Singer struct {
  SingerID  int64      `spanner:"SingerId"`
  FirstName string     `spanner:"FirstName"`
  LastName  string     `spanner:"LastName"`
  Albums    []*Album   `spanner:"Albums"`
  Concerts  []*Concert `spanner:"Concerts"`
}type Album struct {
  SingerID int64  `spanner:"SingerId"`
  AlbumID  int64  `spanner:"AlbumId"`
  Title    string `spanner:"Title"`
}type Concert struct {
  SingerID  int64 `spanner:"SingerId"`
  ConcertID int64 `spanner:"ConcertId"`
  Price     int64 `spanner:"Price"`
}func decodeRow(row *spanner.Row) (*Singer, error) {
  var singer Singer
  err := row.ToStruct(&singer)
  return &singer, err
}

Lastly, let’s compare the query execution plans between using JOIN and using subqueries.

The following figure shows an execution plan for the query using JOIN.

Figure 4: Execution plan for the query using JOIN

The query joins 3 tables, so you can notice that the Cross Apply operator is used two times to join tables.

By contrast, the following figure shows an execution plan for the query with subqueries.

Figure 5: Execution plan for the query with subqueries

You can see that ARRAY(SELECT AS STRUCT) is executed as a subquery with Array Subquery operator.

Benchmark

Let’s run a benchmark to understand how efficient the above approach is.

In this benchmark, we assume that we have 5 tables, a parent table and 4 child tables, and select rows from those 5 tables. The following figure illustrates this benchmark scenario:

Figure 6: The benchmark to select rows from 5 tables

For simplicity we use the tables with columns including only the primary key.

CREATE TABLE `Parent` (
  ParentId INT64 NOT NULL,
) PRIMARY KEY (ParentId);CREATE TABLE Child01 (
  ParentId INT64 NOT NULL,
  Child01Id INT64 NOT NULL,
) PRIMARY KEY (ParentId, Child01Id),
  INTERLEAVE IN PARENT `Parent` ON DELETE CASCADE;CREATE TABLE Child02 (
  ParentId INT64 NOT NULL,
  Child02Id INT64 NOT NULL,
) PRIMARY KEY (ParentId, Child02Id),
  INTERLEAVE IN PARENT `Parent` ON DELETE CASCADE;CREATE TABLE Child03 (
  ParentId INT64 NOT NULL,
  Child03Id INT64 NOT NULL,
) PRIMARY KEY (ParentId, Child03Id),
  INTERLEAVE IN PARENT `Parent` ON DELETE CASCADE;CREATE TABLE Child04 (
  ParentId INT64 NOT NULL,
  Child04Id INT64 NOT NULL,
) PRIMARY KEY (ParentId, Child04Id),
  INTERLEAVE IN PARENT `Parent` ON DELETE CASCADE;

In this benchmark, we will compare two transactions; The first one is composed of multiple queries to select rows from tables one by one.

SELECT * FROM Parent WHERE ParentId = @id;
SELECT * FROM Child01 WHERE ParentId = @id;
SELECT * FROM Child02 WHERE ParentId = @id;
SELECT * FROM Child03 WHERE ParentId = @id;
SELECT * FROM Child04 WHERE ParentId = @id;

The second one is composed of a single query which selects rows from all tables at once with subqueries.

SELECT 
  *,
  ARRAY(SELECT AS STRUCT * FROM Child01 WHERE ParentId = @id) as c01,
  ARRAY(SELECT AS STRUCT * FROM Child02 WHERE ParentId = @id) as c02,
  ARRAY(SELECT AS STRUCT * FROM Child03 WHERE ParentId = @id) as c03,
  ARRAY(SELECT AS STRUCT * FROM Child04 WHERE ParentId = @id) as c04,
FROM Parent p WHERE ParentId = @id;

Other condition of the benchmark is as follows:

Node size for Cloud Spanner is 1.
Each table has 1M rows.
Transactions to run the queries are read-only transactions.
Benchmark code randomly selects the row keys.
Client machine is 32 vCPU GCE running on the same location as Cloud Spanner instance.

You can access the benchmark code in GitHub.

Benchmark Result

Here is the benchmark result:

These graphs show Cloud Spanner’s CPU utilization, Transaction Per Second (TPS), and Latency per query, from top to bottom. The left side of the graphs shows the result of using multiple queries while the right side of the graphs shows the result of using a single query.

As you can see from the graphs, while CPU utilization is almost same in both approaches, using a single query could process 2.5 times more transactions (10,000 TPS) than using a multiple queries could (4,000 TPS). To put it another way, with a proposed approach Cloud Spanner consumes less CPU utilization to serve the same amount of TPS.

Looking at the latency, using a single query looks slower (1.8 ms) than using multiple queries (1.4 ms), but this graph shows the latency *per query*, so when we grasp the latency at a transactional level, using multiple queries is slower because there are 5 queries in a single transaction and the total latency becomes 1.4 ms * 5 queries = 7 ms.

Summary

We discussed how we can write an efficient query for Cloud Spanner’s interleaved tables by using ARRAY(SELECT AS STRUCT) subqueries. Note that the benchmark result could differ depending on various conditions such as schema, query, or the number of interleaved tables, so if you’re interested in the performance benefit with this approach, please try it with your application’s schema and query.

Acknowledgement

I would like to thank Mourad El Azhari for the thorough review and helpful comments to make this article more accurate and readable!

Writing efficient queries forCloud Spanner interleaved tables