Get started with Google Cloud Spanner
Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my current or any previous employers.
Introduction:
Before we talk about Cloud Spanner, lets look at the broad range of database services available on GCP which caters various applications and use cases. These databases majorly fall in two categories. First, traditional databases which offers compatibility and easy of migration from on-prem to GCP. Second is cloud native databases who offers unique online unlimited scale and industry-leading availability.
- Cloud SQL provides ease of migration and compatibility for traditional databases such as SQL server, MySQL, and Postgres.
- Bare Metal Solution for Oracle workloads.
- Cloud Firestore to handle Document type storage.
- Cloud Bigtable supports Key-Value pair.
- BigQuery for data warehouse and analytics requirements.
- Caching service with Cloud Memory Store which supports both Redis and Memcached.
Why Cloud Spanner?
Cloud Spanner is a fully managed native cloud database solution built by Google on GCP platform. It is the first scalable, enterprise-grade, globally-distributed, and strongly consistent database service built for the cloud specifically to combine the benefits of relational database structure with non-relational horizontal scale.
Cloud Spanner provides all the traditional benefits of a SQL database: Strong consistency, high availability, schema changes without downtime, and SQL queries. But unlike many relational databases ,which only support vertical scaling, Cloud Spanner scales horizontally also — from one to thousands of servers.
With fully automatic data replication and server redundancy within and across regions delivers availability upto 99.999% depending on the instance type i.e. regional or multi-regional.
Google uses Spanner database internally to manage the workload for its various products and services for many years such as AdWords and Gmail.
In summary, Cloud spanner is no compromise database for modern applications that is globally distributed, strong consistent and combines the benefit of relational database with non-relational horizontal scale.
Lets get started with Cloud Spanner
To start with Cloud Spanner, lets provision a Cloud Spanner instance first using GCP Console. Login to GCP console and select Spanner. It will show following screen when clicked on provision Cloud Spanner instance.
During the creation of an instance, it’s important to know the instance configuration and the node count.
The instance configuration can be a regional or a multi-regional . Choose the regional configuration if the users and the services are co-located in one region. However, if the application often needs to read data from multiple geographic locations (for example, to serve data to users in both Europe and India), or if the writes originate from a different location than the reads (for example, if there are write workloads in Europe and read workloads in North America), then a regional configuration may not be optimal. In that scenario, multi-regional will be better setup.
The node count for an instance should be figured out based on the expected read and write throughput requirement and the size of each read and write. It also depends on the total data stored in the instance. Cloud Spanner needs 1 node for 2 TB of data stored while the throughput varies by data size in each row and by the region of the node. A good rule of thumb to use — Assuming each row has about 1 KB of data, each node supports about 7000 QPS read throughput or about 1800 QPS write throughput.
For demo purpose, lets select regional configuration and 1 node count.
Once an instance has been created, the next step is to create a database within that instance to store data in tables.
Cloud Spanner database supports two SQL language — Google Standard SQL or PostgresSQL. You can choose any of these two while creating the database inside Cloud Spanner Instance.
Once database is created, you are good to create your schema and tables. Lets understand some basic principles about schemas and data model for Cloud Spanner
Spanner Database
Spanner relies on below three concepts to maximise the efficiency and performance of a database.
- Primary keys
- Table interleaving
- Foreign keys
Primary keys help Spanner to know how to store rows that belong to multiple tables. Every table has a primary key, composed of one or more columns of that table. Please note- the primary key column(s) of the parent table must be the prefix of the primary key of the child table. This is how Spanner physically co-locate rows of related tables. Check out the GCP public documentation to know more about choosing a primary key
Table interleaving is a good choice for parent-child relationships where the child table’s primary key includes the parent table’s primary key columns. This co-location of child rows with their parent rows can enhance the performance significantly.
Foreign keys are a more general parent-child solution and address additional use cases. They are not limited to primary key columns, tables can have multiple foreign key relationships, and does not imply co-location of the tables in the storage layer.
The right combination of these tools can help avoid hotspots and spread the load across multiple servers.
Lets see some sample DDL statements using which you can create and drop objects in Spanner database
CREATE TABLE demo_table (
id INT64 NOT NULL,
name STRING(MAX)
) PRIMARY KEY(id);
ALTER TABLE demo_table
ADD COLUMN new_column STRING(1024);
DROP TABLE demo_table;
CREATE INDEX demo_table_index
ON demo_table (name);
DROP INDEX demo_table_index;
CREATE OR REPLACE VIEW demo_view
SQL SECURITY INVOKER AS
SELECT demo_table.name AS CustomerName FROM my_table;
DROP VIEW demo_view;
Lets look at some good practices while designing the schema and tables in Spanner
Primary Key — The primary key should be chosen such that it does not create hotspots in the database. One cause of hotspots is having a column whose value monotonically increases as the first key part, because this results in all inserts occurring at the end of your key space. Some common techniques to avoid hotspot and spreading the load across the servers
- Swap the order of the keys so that the column that contains the monotonically increasing or decreasing value is not the first key part.
- create a column that contains the hash of the actual unique key, then use the hash column (or the hash column and the unique key columns together) as the primary key.
- Universally Unique Identifier (UUID) as defined by RFC 4122 can also be used as the primary key. Version 4 UUID is recommended, because it uses random values in the bit sequence.
- For unique numerical primary keys, one way to have a roughly equal distribution over the entire number space is to generate sequential numbers by conventional means, then bit-reverse them to obtain the final values.
Row size — The size of a row should be less than 4 GB for best performance. The size of a row includes the top-level row and all of its interleaved child and index rows.
Interleaved tables — Cloud Spanner can only create splits along top-level rows. So it’s important to interleave the tables in such a way so that any of the child table rows do not become hotspots because Cloud Spanner cannot create splits at child table row level.
Timestamp-based keys — If the database has a table for history that’s keyed by timestamp, then it should use descending order for the key column(s). If this is not the case and there is a need to fetch the latest history (which is often the case with history tables), recent history will require a seek in the middle which can be highly inefficient.
Cloud Spanner Scaling
Cloud Spanner allows scaling the number of nodes in an instance up and down on demand without any maintenance window. However, it is important to know when to scale the number of nodes up or down. You can check the utilization of cloud spanner instance in Spanner Monitoring Console — It provides two charts which show CPU utilization
- 24 hour rolling average CPU utilization
- High Priority CPU utilization
If any of these go higher than recommended threshold (threshold is denoted by red dashed line in the chart) then you need to scale up the instance. Similary , if you see utilization way below then you can take the decision to scale down the Spanner instance.
Cloud Spanner scales linearly. So you can estimate the amount of scaling by assessing your current operating state (traffic and utilization) and the expected operating state. For example, if you have an instance with 15 nodes currently at 80% CPU utilization and you would like to reduce that to 60%, scaling your instance up by 5 nodes is a good starting point.
Although, scaling up and down in Spanner is quite easy and literally we just need to change the number of nodes based on the need but doing it every few minutes is not recommended as scaling up or scaling down an instance involves unloading database splits from existing servers that are serving them and moving them to different servers in your instance. Traffic which was previously issued to these servers will be redirected to the new servers that are hosting these database splits. Hence, we can expect to see an increase in tail latencies when a scale down event occurs or rebalancing in case of scaling up.
The cost of using Cloud Spanner is directly proportional to the number of nodes being used in instances. Also, the performance of Cloud Spanner depends on the number of nodes because the nodes determine the available throughput and storage for your Cloud Spanner instance. Hence, having the correct number of nodes is critical for customers who want to have the optimum mix of performance and cost.
Conclusion
Cloud Spanner is a powerful database service that offers a wide range of features and benefits. It is a good choice for applications that require high availability, strong consistency, and horizontal scalability. If you are looking for a database service that can meet the needs of your mission-critical applications, then Cloud Spanner is a good option to consider.