Why Did We Shift Away From Database-Generated Ids?

Published in

Ingeniously Simple

4 min readDec 21, 2018

Picture source https://flic.kr/p/dei4BD (Licence under CC BY-ND 2.0)

Give us some context, will ya? 🔭

In our current journey of building a Data Catalog for SQL Server as a team, we are optimizing our engineering efforts for decoupling. There are some specific reasons why this is very important to us, and fundamentally this boils down to two core principals which I hope every software engineering professional would agree to:

We don’t want our complexity to grow linearly as we add more functionality into the system, which would drastically slow us down as we grow in the eyes of both business and value confidence.
We would like to be in a position to accommodate evolving demands and needs, in view of various aspects such as new customer needs, performance, new query models, business model changes, etc. In other words, we would like to be able to swap out any components within our system with a more appropriate one which is suitable for the needs of today, not the past.

Here is a nice quote from protoactor-go open source project, which I found very accurate to what I am trying to state here:

How does this effect the persistence? 🤔

With this high level principal in mind, we don’t want to couple ourselves to a particular database engine for our state persistence. In practical terms, this means that we wouldn’t leak the persistence specific concerns into our domain layer. The main reason why we want to achieve this is related to the fact that our perception of truth today might make us rely on a certain database technology such as SQL Server but it’s not certain that this would satisfy the needs of capability in the future.

Specific example to this would come up when we want to provide an truly audit-able system. With that specific requirement, it makes more sense to be able to persist the domain events occurring withing the bounded context instead of storing the current state (a.k.a. event sourcing). This will require fundamentally different storage needs.

Luckily, there are widely-known, battle tested patterns available to solve the problems associated to this such as Aggregate design in Domain-Driven Design in combination with CQRS, etc. Therefore, the assumption here is that cost of achieving the desired state here should be low effort for us.

Id Generation on the Server vs. Database 🔬

Many data storage systems such as SQL Server have ways to generate unique identifiers per record (e.g. row, document, etc.). The auto-increment key allows a unique number to be generated when a new record is inserted into a table. So, every time we want to create a new record, the database engine automatically creates a primary key, and it’s its responsibility to ensure that this key is unique to the table.

However, the assumption that the database technology can generate the identifiers for our domain aggregates again ties us with the data storage system. What if we want to change data storage system with a one which doesn’t have a feature of generating auto-increment primary keys? We can’t. Also, each data storage system has different ways of generating these identifiers which may cause us to end up with a different primitive type. Besides this, these types of keys may not be suitable for a distributed system. For instance, when we have a table in our SQL Server database with generated primary keys, we don’t have an easy way to scale this table horizontally in a distributed environment.

These concerns could be overcome by letting the unique aggregate identifier generation happen by the consumer of the domain layer (i.e. the transport layer which communicates with the domain layer through commands and queries). This reduces the environment dependency which in turn lets us not to rely on the database for Id generation. This approach also has another benefit: it can support distribution. For instance, we can have a table partitioned onto two physical SQL Servers and split the cost of lookups. If we had an auto-increment key, this wouldn’t work with SQL Server for instance.

What have we decided to do? 🚀

Based on these facts, we decided to let the consumer of the domain layer to generate the identifiers for the domain aggregates. We are representing these identifiers as 64-bit unsigned integers within the domain layer. The domain consumer is free to decide what the representation of this should be based on their context (e.g. ASP.NET Core MVC can serialize the identifiers as string in order to make it easy for its clients to consume the resources objects, etc.).

Why 64-bit integer and why not UUID?

Lastly, you might be wondering why 64-bit integers. The main aim here is for us to be able to generate unique identifiers even across the aggregate roots. UUIDs are very cheap way to do this considering nearly every platform have static APIs for this (e.g. Guid.NewGuid() in .NET, etc.). The biggest pain with UUIDs are cost on the storage system both on storage and indexing. These are not big issues for us though. However, there are already established approaches to generating unique identifiers as more efficient primitive types such as 64-bit integer within a distributed system. Twitter Snowflake algorithm is one of them and that sort of led us to choose 64-bit integer over UUID. We are using open source IdGen library from Rob Janssen, which is a Twitter Snowflake-alike ID generator for .NET.