Google Cloud Spanner: Stream Data from a Query to a Transaction in NodeJS
Google Cloud Spanner is a fully managed, scalable, relational database service for regional and global application data. It is the first scalable, enterprise-grade, globally-distributed, and strongly consistent database service built for the cloud specifically to combine the benefits of relational database structure with non-relational horizontal scale.
Queries in Cloud Spanner can be executed using a streaming RPC. The NodeJS client library uses this streaming API to return query results as standard NodeJS streams. These streams can be piped together with other streams to write the data to some other output, and NodeJS will automatically apply back pressure to the read stream if the write stream is slower than the query stream. This post shows how to stream results from a query into a read/write transaction on Cloud Spanner in order to write (some of) the query data back to Spanner.
The Database#runTransaction function should be used to write data to Cloud Spanner in a transaction. This is however not a stream, and NodeJS will only apply back pressure to the read stream if it is piped into another stream. We therefore need to implement our own custom stream implementation that writes the data to Spanner. See the NodeJS documentation for more background information on how to implement custom write streams.
Step 1: Write one row per transaction
We'll start with a simple implementation that only writes one row per transaction. For this we need to create a simple implementation of Writable that overrides the Writable._write method.
The above code example will select all rows from the Singers table and then write all of these back into the same table, effectively duplicating all the rows in the table. Each row is written in a separate transaction.
NodeJS will apply back-pressure to the read stream and keep the memory consumption low, as the write stream is a lot slower than the read stream. We do not need to implement any custom logic for back-pressure to happen.
The exact memory consumption will depend on the size of the table that is being read and how Cloud Spanner partitions the query result. The query result consists of a stream of PartialResultSets, and the memory usage will be at least as big as the size of one PartialResultSet.
Step 2: Write multiple rows per transaction
The above example works, but it is highly inefficient. It would be better to batch more rows together and write these in a single transaction, instead of starting a separate transaction. This can be achieved by implementing the Writable._writev method instead of Writable._write.
In addition, we can set a highWaterMark in the options that are passed to the stream constructor. The default for highWaterMark is 16 for object streams. That means that our stream would buffer at most 16 rows before creating a transaction and writing these together. This value can be increased to a higher value, depending on the size of your rows and available memory. A higher value will ensure a higher throughput, but at the cost of increased memory consumption.
This example uses a highWaterMark of 200 rows. This can be further increased, but it must remain within the transaction limits of Cloud Spanner. One transaction may write at most 20,000 mutations, where a mutation is roughly equal to the number of rows * number of columns in each mutation.
Query results from Cloud Spanner can be returned as a NodeJS stream and be piped into any other NodeJS writable stream. The above example has shown:
- How to create a custom stream to write data to a Cloud Spanner transaction.
- How we can use the internal buffering and back-pressure mechanisms of NodeJS streams to increase the execution speed of streaming data from a query to a transaction, while keeping the memory usage under control.
by the author.