Cloud Spanner’s Table Interleaving: A Query Optimization Feature

By Christoph Bussler and Anand Jain

Cloud Spanner’s use of distributed persistent data management

Schema feature: table interleaving

CREATE TABLE Singers (
SingerId INT64 NOT NULL,
FirstName STRING(1024),
LastName STRING(1024),
SingerInfo BYTES(MAX),
) PRIMARY KEY (SingerId);
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
AlbumId INT64 NOT NULL,
AlbumTitle STRING(MAX),
) PRIMARY KEY (SingerId, AlbumId),
INTERLEAVE IN PARENT Singers ON DELETE CASCADE;
SELECT s.FirstName,
s.LastName,
a.AlbumTitle
FROM Singers s JOIN Albums a ON s.SingerId = a.SingerId;
SELECT s.FirstName,
s.LastName,
ARRAY(SELECT AS STRUCT a1.AlbumTitle
FROM Singers s1
JOIN Albums a1 ON s1.SingerId = a1.SingerId
WHERE s.SingerId = s1.SingerId) Albums
FROM Singers s
WHERE EXISTS(SELECT a2.AlbumTitle
FROM Albums a2
WHERE a2.SingerId = s.SingerId);

Alternative to table interleaving

CREATE TABLE SingersAndAlbums (
SingerId INT64 NOT NULL,
FirstName STRING(1024),
LastName STRING(1024),
SingerInfo BYTES(MAX),
Albums ARRAY<STRING(MAX)>,
) PRIMARY KEY (SingerId);
  • Storage collocation. Since the albums are stored as an array in a column they are automatically collocated on storage with the singer’s data.
  • Convenient querying. Selecting the albums of a singer is a non-join select statement.
  • Array elements are scalar. Array elements can only be scalar types. For the use case above, the album’s id and title would have to be stored in a single string (with some delimiter or as a JSON object).
  • Array manipulation. Array manipulation has specific functions and SQL syntax, as explained here.
  • Column value size limit 10 MB. The limit of a column value size is 10 MB. If the data in an array can exceed that, an error occurs and an alternate design has to be found.
  • Complete array unit of read access. Compared to interleaved rows which can be individually selected, an array is always read in its entirety.

Additional benefits of table interleaving

Performance optimization

Best practice design process

  1. Design the database schema without interleaved tables in an initial schema design phase
  2. During the schema design phase, keep in mind that tables representing 1:n relationships might have to be changed to interleaved tables later on for performance reasons. Design the primary keys appropriately if possible so that the primary key of the potential parent table is the leading part of the composite primary key of the potential child table
  3. Design all queries (to the extent possible) that will operate on the schema
  4. Review all joins and assess their nature. For the purpose of table interleaving, call out the joins that query 1:n relationships. Measure their execution performance and observe the execution frequency for those joins that query 1:n relationships
  5. If there are 1:n joins between two tables for about 80+% of the time consider putting those tables into an interleaving relationship. Measure the query execution performance in both schemas (with and without interleaved tables) and determine if table interleaving provides a performance benefit

Caveats and edge cases

  • When interleaving tables, there is a soft limit of about 8GB for the size of all child rows combined. Up to that point all child rows are collocated with the parent row on the same split. Any additional child rows are stored separately for the next 8GB in another split.
  • It is tempting to keep historical data for a parent row as child rows. For example, every change of a parent is added as a separate child row. This is a design that ensures an ever growing number of child rows and will run into the 8GB soft limit at some point in time possibly slowing down query performance. History data is in general best stored in a separate table.
  • Accessing an interleaved table without accessing the parent table will not reap the benefits of storage collocation. Instead a query accessing child rows only might incur a performance penalty as the query might have to access many different splits holding child rows that otherwise might be stored contiguously.
  • If only parent rows are accessed without at the same time joining child tables, and the query’s predicate not a primary key then the benefit of interleaving might not materialize either since the query will have to access many splits “jumping” over the child rows.
  • If after the schema design phase interleaved tables are used, and in the worst case all tables are interleaved with each other in one form or another, then step back and reevaluate if this is really the best design given that the queries are unknown or maybe not known completely at this point in time.

Data size matters

Summary

Next steps

Acknowledgements

Disclaimer

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store