The Right Amount of Randomness: Creating Unique Identifiers
At Image.wiki we need to be able to quickly assign a unique identifier to each image that users upload to our system. Reconciling identical images is an important task but not the subject of this blog — we must treat every image we receive like a separate entity even if the file already exists in our database, as it may be accompanied by unique metadata. This seems like a trivial topic — just find an algorithm to spit out some random numbers! But there are many considerations and no pre-baked solutions out there. And once we start issuing these IDs they’ll become a vital part of our system, referenced in many places and influencing the way we structure our data. This is not something we want to re-do in the future.
So here are the basic requirements of our use-case, roughly ordered by importance.
This is by far the most important feature. If we create the same identifier for more than one image, we can’t return a reliable URL to the user or API client. This is a dreadful situation so we need a solution with a very low probability of collisions for a billion or more records in our use-case. Note: what we’re talking about here is not exactly an issue of randomness. We could choose an algorithm that picks each character in a totally random fashion (i.e.; it utilizes a sufficient entropy source) but if we were to use too few characters (say less than 5), the available permutations would be sure to cause collisions! What we’re talking about here is having uniformity, that is, outputting strings as evenly as possible across the range of possible combinations. In other words we need enough randomness combined with enough character permutations. This is one reason why what works best for us is not a Hash Function. Uniformity is difficult to achieve in a traditional hash function (see this excellent stackoverflow thread) and I didn’t want to worry about avoiding inputs that could lead to a collision. In fact, I didn’t want to worry about inputs at all, which is a good lead into the next requirement.
Maybe it’s my fear of commitment but I couldn’t live with the idea that our ID creation would be inextricably tied to a particular storage solution. Using an auto-incremented primary key or anything from our database would certainly create such a situation. Creating IDs in the application layer completely independently of the database reduces points of failure and allows multiple application threads to exist without being bottlenecked by a single storage system that is tasked with dishing out primary keys. What if we switch away from an RDBMS to a NoSql solution? What if we decide to write to a cache first instead of writing directly to persistent storage? There’s a great blog from Instagram Engineering that covers this topic; they ultimately do use PostgreSQL in their solution in an eloquent way.
We considered a great open source library for encoding integers called Hashids that produces some nice, url-friendly identifiers like those you might find with Bit.ly, Tinyurl or other url-shortening services. But the only way to avoid collisions with this tool would be to use constantly changing integer inputs (such as an auto-incremented primary key). Hashids also has an upper limit: it can’t take integers in the BIGINT range (greater than a billion), at least in this php implementation. Clearly, this wouldn’t work for our use-case.
Fixed and short length
There are plenty of libraries for creating UUIDs, which are 128-bit identifiers that have a higher chance of me spontaneously combusting than producing a collision. But since this ID is likely to be a common lookup parameter, it’s important to keep the length short so the values fit nicely in a cache memory and have the shortest possible indexes. This takes UUIDs or GUUIDs out of the running. They also aren’t easily human readable which is not ideal.
Opacity (safety from “Url hacking”)
If our IDs had any discernible sequential property, it would be an invitation for someone to build a scraper to pull down data en masse (another reason you shouldn’t use auto-incremented IDs without at least obfuscating them). This issue was low on the priority list since we protect against abusive requests in other ways and it’s ultimately impossible to have complete opacity anyways (see Birthday attack). But it was still a consideration.
Simple implementation, fast and not compute-intensive
These things go without saying.
The best solution we found is a randomly generated, base64-encoded string. The ruby implementation we use is SecureRandom.urlsafe_base64 from the native ruby class SecureRandom (previously ActiveSupport::SecureRandom).
It randomly generates the number of bytes of your choice and base64 encodes the results with only url-safe characters.
The code is beautifully concise:
uniqueid = SecureRandom.urlsafe_base64(9).chop
We specify 9 bytes and it produces a 12 character string which we then chop down to 11 characters. Why not just specify 8 bytes to get 11 characters? Well, the Base64 algorithm takes the byte-incremented input and merges all the bits, then it dices up the sequence into 6-bit segments and each of those segments is translated from 64 possible characters. So if the total input bits is not evenly divisible by 6 (which would be the case with 8 bytes) the end of the last segment is padded with the char ‘=’ in order to return full characters. So, we simply create more chars than we need and chop it down so that the 11th char is fully random, otherwise, the permutations of the final string would be cut in half since it would always be ‘==’.
All of the chars it produces are in the ASCII set and should, therefore, be 1 byte each if encoded in UTF-8, keeping it at 88 bits or a nice varchar 11 in a database!
I happen to notice that YouTube’s video IDs are 11 characters that are url-friendly base64 values (A-Z, a-z, 0–9, ‘/’ and ‘+’ replaced by ‘_’ and ‘-‘). I suspect this is how they generate their unique IDs.
Now, what are the chances of a collision here? I’m not awesome at calculating probabilities but here’s a shot at it. First get the possible combinations:
64 possible characters with a string of length 11 so we do 64length:
6411 = 7.3786976e+19 possible combinations.*
So let’s say there are going to be 1 billion records. We need to compute the probability of a repeat event given ~ 73,786,976,000,000,000,000 possible combinations and n=1 billion random choices. This is known as the birthday problem, and with no ambition to get out the scratch paper, I found a calculator online do this for me. The result is:
p = 0.0067533565004800344
So there is a less than 1 percent chance of a collision when we get to 1 billion records. That’s something I can live with. But I think I’ll still have a unique index on this column in the database just in case!
At first, I thought it would be nice if the IDs were an obfuscated composite of multiple fields (like timestamp + some image property + some machine ID) so that I could recreate the IDs if I ever had to. But I couldn’t think of a way to do this without sacrificing some of the important features above. Plus, in the event that I needed to reconstruct my IDs, I would have probably lost some of that composite data as well. More realistically, I thought it would be nice to have some time-sortable component to the IDs so that we could sort by newest (users are more likely to request newer images) without having to access any other data points. But I can always create a cache based on the time stamp in the future and the simplicity I gain right now makes up for any future work I may have to do in this regard.
I want to hear your thoughts.
*Note: 6411 is not really the possible random combination because the randomness comes from the 9 bytes originally created by SecureRandom (before base64 encoding) while the permutations come from the final 11 characters. So one might say we should instead calculate 2569 since each of the bytes has 28 or 256 possibilities. But during base64 encoding the total created bits are merged, diced up into eleven 6 bit segments and each of those segments is translated from 64 possible characters. If anyone can help quantify this into the equation, please let me know.