Random Strings: The Terrible Cost of Friendliness

Richard Marmorstein
Engineering@Livestream
4 min readAug 21, 2017

When you are making random alphanumeric strings, you probably want them to be unique. But sometimes you might also want them to be friendly: maybe a human will have to transcribe them from a mobile device, or maybe a customer will have to dictate them over the phone to a support agent. These goals are opposed. The longer your string, the stronger the guarantee of uniqueness, but the angrier your customers. Similarly, the longer your alphabet, the stronger the uniqueness guarantee, but the more likely a human will mix up the letters. The Arabic numeral ‘0’ is easy to mix up with the letter ‘O’, for instance. Over the phone, the letter ‘C’ sounds an awful lot like the American pronunciation of the letter ‘Z’, and you don’t want your customers to have to resort to ‘alpha bravo charlie’. Eliminating easily confused characters from your alphabet is something to consider.

The longer and more complicated your strings, the angrier your customers.

But at what cost?

If we get rid of zero, capital letters, ‘c’, ‘z’, ‘p’, ‘b’, ‘m’ and ’n’, that leaves 9 digits and 20 letters, for an alphabet of size 29. If we choose to generate random identifiers with a length of 8. That results in a key space of 29⁸, or about 500 trillion. Which means, if you generate 2 identifiers, the probability of them being the same is one in 500 trillion.

Happy Birthday

Usually, though, you don’t generate only 2 identifiers — you generate *n* identifiers. And as *n* increases, the probability that at least one of those identifiers is the same as another grows quite quickly. This observation is called the “birthday problem” (the chance of you having the same birthday as me is tiny — 0.27% — but the chance of two people having the same birthday in a room of 23 people is greater; than 50%). There’s an equation to estimate this probability, but the math to derive it involves, like, Taylor series, and the only thing I remember from Calculus about Taylor series is that I hate them. So, I stole the equation from Wikipedia, and here it is expressed in Javascript:

/**
* @param k — size of alphabet
* @param l — length of identifiers
* @param n — number of generated identifiers
*/
function probCollision (k, l, n) {
// d — total number of possible identifiers
const d = Math.pow(k, l)
return Math.pow((d — 1) / d, n * (n — 1) / 2)
}

Now let’s graph this function for k=29, l=8 (8-letter words from an alphabet of 29 characters)

Yikes! It looks like once you’ve generated a million of these identifiers, it’s more likely than not that you’ve had a collision.

Do you care?

It all depends on your application. If you’re a web service, and a duplicate identifier means that a client gets a 500 error once in a blue moon and has to try again — well, maybe that’s a reasonable price to pay for making the identifiers a little more tractable. But maybe you’re a bank or something, and a duplicate identifier means the collapse of the world economy. Then you might judge this to be an unacceptable risk.

It also depends on the size of your working set. If you are never going to come close to a million actively stored entities, then identifiers with the example parameters are fine. But if your working set is going to be much larger than that, collision will occur more and more commonly under these parameters.

Of course, if you are generating random strings for cryptographic purposes, you will want your strings to be as unique and as unfriendly as possible and this discussion is moot. No sense being friendly to the people trying to hack you! You also want to make sure your random identifiers are actually appropriate for this use case.

Find what’s right for you

Here’s an interactive tool, made by yours truly, to help you find the character set and identifier length that addresses the trade-off in the way that’s best for you and your application. You have no excuse to guess or pick randomly. Understand what’s at stake, and make an informed decision.

Have fun with your random strings!

Are you a unique software engineer? Or a friendly software engineer? Some compromise between the two? Come join our team of friendly, unique engineers at Livestream.

Follow me on Twitter.

--

--

Richard Marmorstein
Engineering@Livestream

An autonomous generalized problem-solving and entertainment system, implemented chiefly in deoxyribonucleic acid, featuring a robust natural language interface.