Google Cloud Spanner Nodes
As we continue with our exploration of Google Cloud Spanner, I found one concept to stand out, that of nodes.
Typically in cloud computing, an instance is synonymous with a singular virtual machine, and a node is the CPUs, memory, storage and networking underlying that VM. In Cloud Spanner, the relationship still holds, but is more complex due to the nature of instances. Nodes are also directly related to your instance configuration, and thus your scaling, so it is important to have a good understanding, and I explain in more detail below.
Nodes
Cloud Spanner documentation loosely defines a node as a collection of resources, namely CPU, RAM, and 2TB of storage.
This is quite simple, and means that if your database monitoring shows that you are using more resources than is optimal, or that you are running low on storage, you simply need to go into the Cloud Console, and add a node. This will, as per the definition, add more computing and storage resources to your instance.
This is where the Cloud Spanner node differs from the norm, is that adding a node doesn’t just add a single set of compute resources to your instance, but in fact it adds a set of resources to each replica within your instance.
When adding a node to a traditional distributed/clustered database, you are only adding a single computing resource or server to your cluster. As an administrator, you have to manage how that node is used by the database, whether it becomes a new shard, a read-write replica, a read-only replica, a witness replica, a warm failover cluster — depending on the database, the options could be numerous, and administration overhead costly.
In Spanner, the administration is transparent, and the definition of a node therefore encompasses all the resources required to increase the capacity of your instance with a full set of resources, regardless of it being regional or multi-regional, whether it requires read replicas or witness replicas or both.
That is quite a powerful concept, and it speaks to the “fully managed”, “unlimited scale”, and “99.999% availability”. Each addition of a node is automatically managed by the instance and therefore replicated, sharded and essentially used to increase scale in the highly available architecture of Cloud Spanner. In other databases, adding compute or storage necessitates configuration — is it used for primary compute, failover, replication, backup, etc. In Cloud Spanner, you simply click in the Cloud Console to add another node, and everything just happens in the background, your application has more highly available resources as per your instance configuration.
The Cloud Spanner documentation on instances lists all the different instance configurations and explains the difference between the number of replicas you will have depending on your regional configuration.
A quick note on replicas, nodes and instance configurations
Each regional instance has 3 replicas, so each node added to your instance in this case will result in an increase in compute and storage for each replica, so 3 sets. This maintains 99.99% availability, and the stated performance of “up to 10,000 queries per second (QPS) of reads or 2,000 QPS of writes (writing single rows at 1 KB of data per row)”.
Most Multi-region instances have 4 read-write replicas, and one witness replica, so adding a node increases the total compute and storage across all 5 replicas. This allows Cloud Spanner to provide the 99.999% availability. Over and above that, the “nam6” region has 2 additional read-only replicas, and “nam-eur-asia1” configuration spans 3 continents, with 4 additional read-only replicas.
This is quite astounding, as it means that adding a node to your “nam-eur-asia1” instance means that Google provisions 9 sets of CPUs and memory, and 9 sets of 2TB of additional capacity (one for each replica) to support your highly available instance.
These 9 sets of resources are managed in such a way that you not only get replication and fail-over for high availability, but you also get external consistency regardless of the global distribution of both your database and its users. It is worth reading the white-papers and documentation on TrueTime and external consistency if you are interested in understanding how Google uses Paxos engines along with atomic and GPS clocks to manage all these resources AND provide the highest level of consistency available.