Unique ID Generator in a Distributed System

Aarti
4 min readJul 19, 2023

--

Table of Content:

  • Why do we need unique ids in the distributed system?
  • Methods to generate unique IDs in a Distributed System
  • Conclusion

Why do we need unique ids in the distributed system?

Recently I was reading an article about Uber's system Design, and I came across one interesting fact, there are many actually but let’s focus on why we are here, so basically they have built a system in which they divide the whole earth into tiny cells and assign them a unique id. There must be millions of such tiny cells, I got curious here, how do they manage to assign unique ids to millions of cells or in general how does the system handle such a scenario when there are multiple nodes (databases or servers) involved?
For those who don’t know about distributed systems, a distributed system is a collection of multiple autonomous systems working on a common target and each node is connected through a network to establish a connection between each other, so it becomes important to have a unique identifier for a single entity to perform any sort of operation or to track progress of any request. That’s why it’s become important to have unique ids in such a distributed system.

Methods to generate unique IDs in a Distributed System

  1. AUTO_INCREMENT approach: Most basic one which comes to our mind when we think about the unique id or primary key. These are the integer ids that get incremented by one after each data insertion in the database, in a distributed system where we have multiple servers instead of incrementing these IDs by 1, the system increments it by ‘n’ (no. of nodes).

Small implementation of this approach:
CREATE TABLE demo_table ( id INT AUTO_INCREMENT PRIMARY KEY, name STRING, .....);

Let’s add a few rows to this table
INSERT INTO demo_table (name, ....) VALUES (your_name, .....); → ID - 1
INSERT INTO demo_table (name, ....) VALUES (your_name, .....); → ID - 2

In the distributed system, (here n=2)

However, this strategy has some drawbacks -

a. As shown in the above picture in the distributed system, each node has its own ID but there is a possibility of collision when there is a change in the number of available nodes, IDs do not scale well when new nodes have been added or removed.
b. This simple integer value does not provide any context related to the data it added to the node, this makes it more difficult to understand and analyze the data without extra metadata.

2. UUID (Universal Unique Identifiers): Another approach for generating unique ids is a 128-bit long alphanumeric string stored on each server, which has a very low probability of collision.

According to Wikipedia “After generating one billion UUIDs every second for approximately 100 years, the probability of collision reaches to 50%”

We noticed in the auto_increment approach, it becomes difficult to sync with all nodes in case of a change in the number of nodes but UUID can be generated independently by each node in a distributed system without any coordination with the rest of the system, which is another advantage of using this approach as it makes it very easier to integrate a node with any technologies and languages, allowing distributed system to scale up horizontally.

3. Snowflake Algorithm: Snowflake algorithm is an approach to generate unique ids, but the interesting part is it divides its ID into multiple sections (yes, Divide and Conquer) so what are these sections?

Let me explain to you each section one by one,

  • Timestamp — Pretty straightforward isn’t it? It stores the time of creation of that very ID, which is typically measured in milliseconds. Also, help in ordering IDs or identifying specific IDs based on the time of creation.
  • Machine ID — The machine identifier is given to each machine present in each datacenter. This identifier ensures that even if multiple machines are generating IDs simultaneously, each machine will have its own distinct identifier in the generated IDs.
  • Sequence Number — This section of the IDs stores unique counter values which keep on incrementing by 1 so even if a node generates multiple IDs at the same time from the same machine within the same datacenter we would be able to distinguish between them because of this sequence number.

So, after combining all these sections, the Snowflake algorithm guarantees the uniqueness of the IDs generated across the distributed system.
Twitter also follows this method to generate unique IDs, called Twitter Snowflake.
It’s not necessary that every snowflake implementation is exactly the same, it may vary according to requirements and it also depends on a specific system or framework.

Conclusion

In conclusion, it becomes important for us to know all possible ways and figure out the best one for our use case. Whether it’s for distributed systems, databases, or other scenarios, the choice of a suitable ID generation method can impact system performance, scalability, and data integrity.
Regarding Uber’s system design article that left me with so many questions, I started looking for answers and I stumbled upon these approaches for unique IDs.
Uber System Design: https://medium.com/@narengowda/uber-system-design-8b2bc95e2cfe
https://towardsdatascience.com/ace-the-system-design-interview-uber-lyft-7e4c212734b3

--

--