Cloud Spanner — Understanding how the number of participants affects commit latency and instance’s CPU usage

Pablo Arrojo
6 min readMar 23, 2023

--

Cloud Spanner is a distributed database, which means that as your database grows, Spanner divides your data into chunks called splits. Individual splits can move independently from each other and get assigned to different servers, which can be in different physical locations.

All splits involved in a transacción are known as participants. We can see how many splits participate in a transacción checking the avg_participants metric in SPANNER_SYS.TXN_STATS* table or using the Transaction Insights introspection tool.

So, Why is this metric useful to troubleshoot performance issues?

Let’s do a simple laboratory to find out.

Lab description

The test will be very simple. I’ll use the YCSB tool to run a workload with a single session which runs 2.5K RPM over a table without secondary indexes.

This tool uses mutations to write data, and I’ve configured it to insert 6 rows per commit.

I’ll start with an empty table, so at the beginning the number of participants will be just one. As the table grows, the number of participants will also grow (reaching in the worst case a number of six participants) and we could detect if there is any variation over latency and other metrics caused by an increase of the number of participants.

From the Spanner side, I’ll use a one node regional instance.

YCSB example

bin/ycsb load cloudspanner -P cloudspanner.properties -P workloads/workloada -p recordcount=40000000 -p cloudspanner.batchinserts=6 -threads 1 -target 250 -s

Let’s get started

Our job is already running and we can begin to check the metrics.

As we’re starting with an empty table the number of participants is just one. Now, we can use these metrics as a baseline:

  • avg_participants: 1
  • avg commit latency: 3.2ms
  • CPU: 6.37%

A few minutes later the number of participants for our transaction increased to two, how did this affect the metrics?

That’s interesting, commit latency doubled (3ms to 6ms) exactly at the same time that number of participants jumped from 1 to 2.

What about the CPU?

The CPU usage also doubled, increasing to 13.66%.

Ok, let’s think about this for a minute.

When there is more than one participant for a transaction, Spanner uses 2PC as a commit protocol. What does this mean?

First, let’s review how a transaction commit works with a single participant:

  • The split leader acquires the locks, sends the commit decision to their replicas and finally commits the transaction and applies the mutations.

Now, how 2PC works:

  • There is a first phase where Spanner chooses one participant as a coordinator. The job of the coordinator is to make sure the transaction either commits or aborts atomically across all participants.
    The coordinator and participants acquire locks over its split leader and their replicas.
    Then the coordinator sends the commit decision over its replicas.
  • Finally, there is a second phase:
    The coordinator communicates the transaction outcome to the participants, and they replicate the commit decision to their replicas.
    The coordinator confirms the transaction and then the mutations are applied over all the participants and its replicas.

Of course, both are very simplified explanations about how commits work, but I think that can help us to understand the reason for the increase in the commit latency.

In summary, now the commit is composed for two internal sequential operations with an extra cross-split coordination hence we have a higher latency compared to a commit with a single participant.

What happens if the number of participants keeps growing? Let ‘s see.

To save time, I loaded a couple of millions rows to have a larger table (hoping this helps to have a higher number of participants for our transaction) and then started our job again.

Now, the avg_participants number has increased to 3.5, but there is not a significant change in the commit latency compared with the last test. This is still about 6ms.

Let’s take a look at the CPU:

Unlike commit latency, the CPU usage was actually affected. This increased from 13.6% to 19.6%

At this point we have two scenarios:

  • Higher number of participants don’t affect the commit latency obtained since avg_participants is greater than or equal to 2.
  • CPU usage increases linearly with respect to the number of participants.

I want to be sure before making a conclusion so let’s do one more test.

Again, I loaded some millions of rows to have a larger table and then restarted our job. Here the results:

Despite the number of participants being more than two times higher (2 vs 4.8) there is no difference in the commit latency.

Let’s check the CPU usage:

It keeps increasing, now being around 24%. Well, I guess that my suppositions were right.

So, Why does CPU usage increase according to the number of participants?

I think the simplest answer is that we get a higher total CPU time because Spanner is doing more internal operations (split coordination, locks acquired, applying mutations, data replication and so on) per unit time. Let’s explain this a bit more:

In our test we have a single session running 2.5k RPM. Let’s say for each split involved in a transaction there is an operation that takes a CPU time T.

  • With a single participant per transaction, there are 2.5kT CPU time in use in a minute.
  • With 4.8 participants (higher number reached), there are 12kT CPU time in uso in a minute.

Therefore, having a higher CPU usage is an expected behavior.

Summary

Here is a simple graph to show more clearly how the commit latency and CPU usage were affected as the number of participants increased.

We can notice how CPU usage increases linearly according to the increment of the number of participants, but commit latency stays stable since we reach a number of participants greater than or equal to two.

Conclusion

With this simple laboratory we now understand how the number of participants involved in our transactions could impact on the Spanner Instance performance.

In regards to the latency, having a number of participants greater than one we should expect until 2x higher commit latency with respect to that obtained with a single participant. Since it only affects commit latency it does not necessarily have a significant impact on the total latency of the transaction. For example, for a transaction with a total latency of 23ms where the commit latency is about 3ms, if the commit latency doubles this will represent just an increment of 15% for the total latency of the transaction.

But what I find most significantly is how the CPU usage is affected. In our laboratory we started with 6% of CPU usage and then finished with 24%, this is 4x higher even processing exactly the same throughput and number of RPM. It is important to keep this in mind because there are several scenarios that could cause an increment of the number of participants in our transactions (like table growth, or adding secondary indexes or change streams) which could lead to get a CPU utilization above the recommended threshold, affecting the overall performance of the Spanner instance.

--

--