Unique integer generation in distributed systems
By João Reis, Senior Software Engineer — .NET
This post was originally published on our F-Tech Blog. Come check it out here :-)
The path to a geo-distributed platform
When we started designing a solution that would put us on the path to becoming a truly geo-distributed platform, one of the main blockers we encountered was the dependency of ‘IDENTITY’ columns in SQL Server databases. Several services with such dependency would eventually have to move its persistence layer to a multi-datacenter Apache Cassandra database instead of the current (at that time) single-datacenter SQL Server database.
Moving from SQL Server to Apache Cassandra
Replacing a database that follows a Master-Slave architecture with one that follows a masterless ring design poses several challenges. One of these challenges is the integer identifier generation process.
How can we generate an integer identifier using a masterless distributed NoSQL database like Cassandra? The most common approach to this problem is to either:
- Drop integer identifiers and just move to UUIDs.
- Use Twitter Snowflake or a similar algorithm (explained in more detail in the next section).
Replacing integers with UUIDs
UUIDs are a dream come true for anyone that needs to generate an identifier in a completely distributed way. Unfortunately, moving every single integer identifier to UUID would be a nightmare and would make us break the existing platform’s API contracts. Our clients would not be happy to know that we would simply stop supporting every endpoint in our API that depends on a generated integer identifier!
Another problem with UUIDs is related to the user experience. Eventually, the user will be exposed to some of these identifiers. Imagine that a customer calls Customer Service and is asked to provide the identifier of the product that she wants to buy. Having to spell a complete UUID by phone to the Customer Service employee is not a pleasant experience.
Keeping the integers: Twitter Snowflake
Twitter Snowflake and other similar algorithms assume that you can just allocate a significant number of bits of the generated value for metadata and not the actual sequence number. Identifiers generated by Twitter Snowflake are composed of a timestamp, worker number and sequence number. Worker number is chosen at startup via zookeeper and sequence number is a per-thread counter.
A standard implementation of twitter snowflake that generates 64-bit integers will allocate 41 bits just to the timestamp part. This means that using an algorithm like this one or similar ones to generate 32-bit integers would not work very well (probably wouldn’t work at all). These algorithms are always used to generate integers with 64-bits or more.
Moving from 32-bit integers to 64-bit?
We eventually started considering this move, but migrating to 64-bit integer identifiers would also mean a breaking change in our Platform API (even if it would be a more controlled move than just dropping integers and moving to UUIDs). Furthermore, changing the Domain models and database schemas of every service to support 64-bit integers would also consume some time and effort.
If we knew that we would be using up all the available 32-bit identifiers in the next couple of years, then we would probably have moved to 64-bit integers at the same time that we moved to Cassandra, but we weren’t (and still aren’t) close to reaching that stage.
We decided that we should develop a solution for our problem as a proof of concept and see how successful we would be. A solution would have the following requirements:
- A generated identifier must be a 32-bit integer.
- No collisions.
- Identifier generation must be possible even if the datacenter has lost connectivity with other datacenters temporarily (connectivity will be restored sooner or later).
This proof of concept was considered a success during internal testing, so we ended up adopting this custom made solution as the default strategy to generate 32-bit integer identifiers in any service that uses Cassandra as its database. I’ll explain how our solution was designed in the next sections.
This algorithm leverages a feature that Cassandra provides which lets us perform CAS (Compare and Set) operations. These operations are also named Lightweight Transactions and can be executed with consistency level SERIAL or LOCAL_SERIAL. A Lightweight Transaction with SERIAL consistency is basically a Paxos protocol with 50% + 1 nodes of the entire Cassandra cluster (all datacenters) as participants and a Lightweight Transaction with LOCAL_SERIAL consistency is the same procedure as SERIAL but considering the local datacenter only instead of the entire cluster.
In order to avoid the Twitter Snowflake style of algorithms to generate identifiers, we must have something in our database that lets us perform a CAS operation atomically, and Cassandra’s Lightweight Transactions fulfil that requirement.
Figure 1 shows a simplified sequence for the procedure. This procedure is explained in the following sections.
Figure 1: Simplified sequence diagram of the integer generation procedure.
Allocating integers to the local datacenter
To make sure that a specific datacenter only uses a certain range of integers, the process uses a CAS statement using SERIAL consistency. Since this is a slow and demanding operation, this range of values should be large enough so that they don’t have to be executed that often. Also, since SERIAL consistency cannot be guaranteed when a datacenter is isolated (failure in the connection between datacenters for example), the range of values used in this CAS statement should be large enough so that we don’t have downtime in the integer generation requests when multi-datacenter connections fail.
Let’s say the connection between two or more datacenters dies and stays dead for 24 hours. If we want to ensure no downtime exists in this scenario, then this CAS statement should allocate a range of integers that can fulfil the integer generation requests by that datacenter for at least 24 hours. Obviously, the specific value depends on the throughput for that specific identifier and datacenter.
Table 1 shows an example of a table that is used for the integer allocation process. One can see the schema that is required for this procedure to work. Here is a brief explanation of each column:
- key is the name of the integer sequence.
- dc is the name of the datacenter.
- last_integer is the last integer that was allocated to a specific process (relevant to the local process procedure that is explained in the next section).
- reserved_min is the lower bound of the range of integers that is allocated in an iteration of the process.
- reserved_max is the upper bound of the range of integers that is allocated in an iteration of the process.
If you take a look at the dc column, you will notice one of the rows contains the value GLOBAL. There is one such row for every integer sequence, and this is the row that represents the global state of the sequence, i.e. the state across all datacenters.
Allocating integers to the local datacenter essentially means “pulling” integers from the GLOBAL row to the one that is specific to the local datacenter, i.e. executing a conditional BATCH statement with SERIAL consistency that does two things:
- Move the reserved range (reserved_min and reserved_max) forward in the local datacenter row.
- Update the last_integer field of the GLOBAL row to match the new upper bound of the newly reserved range.
This BATCH statement must be conditional to ensure that there are no concurrency issues related to multiple allocation processes occurring at the same time.
key productID productID dc GLOBAL eu_west1 last_integer 102110 100267 reserved_min N/A 100110 reserved_max N/A 102110
Table 1: Simplified schema for the Cassandra table that is used by the algorithm.
Allocating integers to the local process
Each process that runs inside the datacenter eu_west1 allocates a range of integers for his own use, within the range of values allocated for eu_west1 following the process described earlier.
To do this, the process uses a CAS operation using LOCAL_SERIAL. This will ensure that each range of values will be used by only one process inside the datacenter eu_west1.
Looking at Table 1, it is a matter of simply adding X to the last_integer field of the local datacenter row, where X is the number of integers that the process wishes to allocate. Again, similarly to the procedure that was described in the previous section, we must use a conditional statement to ensure there are no concurrency issues. One huge difference from the previous procedure is that we only need to use LOCAL_SERIALconsistency here because processes only update the row that is specific to their local datacenter.
Before a datacenter or a process run out of integers to use, i.e. the generated identifier is getting closer to the upper limit of the allocated range (reserved_maxcolumn), a task will be executed asynchronously with the purpose of pre-allocating the next range of integers. This happens in both of the procedures that were described previously:
- When a datacenter needs more integers from the GLOBAL row.
- When a process needs more integers from the local datacenter row.
For this purpose, two new columns were added to the schema. Table 2 shows the new schema that supports this pre-allocation procedure. Instead of storing one range, the algorithm now stores two ranges and both ranges will be the same until the pre-allocation procedure kicks in. Table 2 shows an example of a scenario where the pre-allocation procedure hasn’t kicked in yet.
key productID productID dc GLOBAL eu_west1 last_integer 101110 100267 reserved_min N/A 100110 reserved_max N/A 102110 preloaded_min N/A 100110 preloaded_max N/A 102110
Table 2: Revised schema that supports the pre-allocating operation (2 new columns).
The threshold that triggers the pre-allocation procedure is configurable, but it is usually set at 50%, i.e. when 50% of the integers in the reserved range are used. The main advantage of this pre-allocation procedure is to avoid situations where a process/datacenter runs out of integers to use and has to wait synchronously until the allocation procedure is done, which leads to worse response times on our APIs.