Unique Identifiers for Fast Retrieval of Data

Rohit Saboo
Engineering@Nauto
Published in
5 min readJun 10, 2020

Think beyond traditional unique identifiers as keys for fast retrieval of data
by Adam Sowiński and Rohit Saboo

Photo by Twitter: @jankolario on Unsplash

When we store data in tables, we keep a column as a key to allow for fast retrieval of data. Often, one or more attributes of the data are combined to create this key. For instance, take a trip taken by a driver in a Nauto-equipped vehicle. It has the following key attributes: a Nauto device identifier, the start/end time of the trip started, the route driven, and driver information.

As data is collected by the Nauto device and processed, Nauto’s edge-to-cloud platforms continuously updates the trip to which the data belongs. If all of the trip related data arrived in order, a combination of device and trip start time can uniquely identify a trip and can be used as a key.

Unique identifiers

In many cases, we are able to find some set of unchanging attributes in the data that could serve as a unique identifier. On occasion, though, it’s difficult to find such a set. In the case of trips, for example, we assumed that trips-related data arrived in order and therefore used trip start time along with the Nauto device identifier as part of the key. In reality, trips related data rarely arrives in order:

╔═════════════════════════════════════════╦════════════════════════╗
║ Location data arriving at ║ Percentage ║
╠═════════════════════════════════════════╬════════════════════════╣
║ end of trip ║ 46% ║
║ in-between a trip ║ 34% ║
║ beginning of trip ║ 19% ║
║ between two trips causing them to merge ║ 1% ║
╚═════════════════════════════════════════╩════════════════════════╝

As a result, we cannot rely on using trip start time as an identifier as the trip is very likely to grow at the beginning, too, invalidating the existing trip start time.

In such a case we can use GUIDs (Globally Unique Identifiers) or UUIDs (Universally Unique Identifiers) as unique identifiers. These identifiers are automatically generated numbers by the underlying system that are either guaranteed to be unique or have an astronomically low probability of being the same for two different records. In such a case, this new unique identifier serves as the key and we can lookup using these identifiers.

Added cost of unique identifiers

In addition to looking up a record via its identifier, we may also want to lookup through some of its attributes. Therefore, we would want to add a “secondary index” for each of these attributes or combination of attributes that we want to use for a fast lookup. In the case of trips, we often want to find trips that overlap a certain time range.

In some database systems (SQL), there is only a small storage and compute cost related to the identifier column and secondary indexes though the cost associated with updates is not negligible. In some other databases (e.g. Cassandra and Scylla) materialized views may be used, which are entirely replicated tables and add to storage costs, or global indexes may be used, which add a level of cross-node indirection adding to lookup costs. In certain other databases such as bigtable, there’s no provision for secondary indexes or materialized views. Here, one would have to manage these indexes or views oneself.

A specialized identifier

At Nauto, we use Scylla, a Cassandra compatible NoSQL database, as the storage solution for trips for various performance reasons. We often need to look up trips that overlap a given time point. To make such lookups fast, we use the following schema:

CREATE TABLE trips (
// partition keys
device_id text,
// trip start time truncated to day-long buckets
bucket timestamp,
// clustering keys
start_time timestamp,
end_time timestamp,
// non-keyed data
driver text,
path blob,
...
PRIMARY KEY ((device_id, bucket), end_time, start_time)
) WITH CLUSTERING ORDER BY (end_time DESC, start_time DESC)

Using this schema, to query for a trip that overlaps a particular time, we query for a single trip that ends after the specified time (with the appropriate device identifier and bucket). If the trip start of the returned trip happens to be before the specified time, we have a hit; otherwise we know that no such trip exists. This query is fast as it takes advantage of both the partition key and the clustering key.

We already saw that no combination of existing attributes was usable as an identifier for trips, and maintaining UUIDs or GUIDs adds quite a bit of expense. Is there something else that can be done? In order to find such an identifier, we would need to find something that stays constant throughout the time a trip is being built and is unique for the trip.

One way to find such a constant is to find some logical monotonicity in the events involved in building a trip. Note this introduces a second clock, which is not related to the time of the location samples that form the trip, but when these samples were actually used to build the trip.

Let’s look at a sample series of events. The x-axis corresponds to the device time at which the location sample was taken. The y-axis corresponds to the server time (the second clock) when the sample was received by the trip builder and used for starting or extending a trip.

The first location received by the trip builder while building a trip never changes and is a constant. The time of arrival itself is not usable because it is from a different clock (server time) compared to the device clock. However, the device time corresponding to the earliest arriving location sample is from the device clock. This time stamp, let’s call it t*, does not change whether new location points come at the beginning of a trip, end of a trip, or in the middle of a trip. It stays constant for the trip. If we use <device identifier, t*> as an identifier, we can use the same schema and efficient lookup mechanism described above for finding a trip that overlaps a given time point.

When a location point arrives causing a merge between two trips, we pick the earliest of the two time points for the two sub trips to form part of the identifier of the new trip. Thus, at least one of the identifiers continues to stay the same (in a deterministic fashion) and the other identifier is not used anymore. Further, given that the identifier-based lookup uses the mechanism of finding a trip that overlaps the specified time point, and that a merged trip will cover all the time points from the existing two sub trips, using either of the two identifiers for a lookup will retrieve the merged trip.

Conclusion

In conclusion, sometimes even rethinking the most basic of concepts such as a unique identifier, can yield large benefits.

--

--