Generating UUIDs at scale on the Web

Matthieu Wipliez
Teads Engineering
Published in
12 min readJul 10, 2020

--

TL;DR can you trust every browser to generate globally unique identifiers at scale? At Teads, we have tried, and the answer is yes, with a few caveats. This article describes the experiments we’ve run and the discoveries we made along the way.

Why we need client-side unique identifiers

Generating unique identifiers is a common need that third-party scripts integrated on Web pages and e-commerce sites have for analytics, marketing purposes, or advertising.

These scripts are almost always loaded from a CDN (Content Delivery Network) whenever they get used at a big enough scale to get optimal response times and to reduce the load on origin servers.

This means that scripts cannot be generated on-the-fly. A workaround could be (or used to be) to have the CDN generate a unique identifier and store it in a cookie, except that user privacy legislation like GDPR and ePrivacy directives in Europe or the CCPA in the USA prevent cookies to be set until the user has given their unambiguous consent.

Uniquely identifying advertising experiences

As an online advertising company, Teads gathers and stores data about each ad experience. An ad experience consists of all the events that occur when a user visits a Web page and loads an ad script, starting with initializing a player for the ad, and including requests to the ad server and user actions such as clicks. To classify a set of events as referring to the same experience, we need to be able to identify this experience uniquely, and do so from the very beginning, i.e. before calling an ad server.

Until now, the ad server was generating a unique identifier and was sending it as part of the ad response. This was problematic because events prior to the response did not have an identifier, so you needed to cross-reference pieces of data to find the events that belonged together. Server-side generated identifiers are pretty much guaranteed to be unique, and before touching production systems, we had to make sure that browsers, too, could generate identifiers that are universally unique.

Universally Unique IDentifiers

A UUID (Universally Unique IDentifier, also known as GUID — Globally Unique IDentifier) is a 128-bit value that can be generated by a computer independently, i.e. without communicating with other computers, and is expected to be unique with a very high probability. UUIDs are written as a sequence of hexadecimal digits separated by dashes.

Below is an example of a version 4 UUID as defined by RFC 4122:

Initially conceived for distributed computing as part of the Network Computing System (NCS), UUIDs have been used in many different cases where their properties are useful. On Windows, the use of UUIDs is pervasive as they identify all COM classes (CLSID) and interfaces, and therefore are used by all COM-based Windows APIs and applications, as well as many OS objects such as users, security policies, etc.

As a matter of fact, of the four variants that can be specified, apart from the RFC-compliant variant shown above and the reserved variant, the other two are 1) NCS backward compatibility (most significant bit is 0, digits 0 to 7) and 2) Microsoft Corporation backward compatibility (most significant bits are 110, digits C and D).

Other applications of UUIDs include filesystems, for instance in the GUID Partition Table (part of UEFI), or databases where they can be used instead of traditional integers as the primary keys of records. In the context of online advertising, they are frequently used to uniquely identify a user viewing an ad on the Web. For instance, the Interactive Advertising Bureau (IAB) recommends the use of UUID for the IDFA (Identifier for Advertising) / AAID (Google Advertising ID for Android), which uniquely identifies a user on mobile.

Pick your version

UUIDs version 1 and 2 use a combination of the MAC address of the computer that generates the identifier, a timestamp equal to the current time UTC with a 100-nanosecond precision, and a “clock sequence” to disambiguate identifiers within a 100ns period, which can be monotonically incremented or random.

Every device with a network controller is supposed to have a unique 48-bit MAC address, which makes it impossible to have two devices generating the same UUID. However, this is also the weakness of these versions, because it means that such UUIDs can be used to uniquely identify a user in a personal way. Note that this is an issue when UUIDs are generated on a user’s device, but not on a server, for instance MySQL uses UUID v1.

UUIDs version 3 and 5 are produced by hashing a string (using MD5 for v3 and SHA-1 for v5), and because hashing is deterministic, the output is as unique as the input. This can be useful if you want to use URLs as a unique identifier, but they are not suitable for our purpose.

Finally, in the case of version 4, all bits except variant and versions are random, which amounts to 122 random bits. This guarantees that there is no personally identifiable information being carried by these UUIDs. The caveat is that to benefit from the uniqueness and unpredictability guarantees offered by UUID, one should use a cryptographically secure random number generator (CSRNG).

Let’s generate a UUID in the browser

As we have just seen, version 4 UUIDs are the best in our case, provided we have a CSRNG. This instantaneously rules out good old Math.random, whose implementation is browser-dependent and offers no guarantee with respect to being safe for cryptographic use. In practice, the major browsers use a variant of Xorshift pseudo-random number generators, which are pretty good as pseudorandom number generators (PRNGs) go.

The difference between a CSRNG and a PRNG is that PRNGs use a single seed and are therefore fully deterministic, whereas it is not possible to predict the output of a CSRNG based on previously generated numbers.

The Web Cryptography API, or Crypto API, published in 2017 defines a getRandomValues function. According to caniuse, 96.6% of users have a browser that supports Crypto; what we found is that among our users, support is close to 99.9%, said another way, the Crypto API is available virtually everywhere (even including fringe devices such as the PS Vita). This is an important consideration: we have 1.5B unique users representing more than a million different OS x browser x browser version x device combinations, so we need to be confident that all users can run our code without any issues.

Generating a 128-bit (16 bytes) random number with the Crypto API is as simple as:

crypto.getRandomValues(new Uint8Array(16))

To turn these random bytes into a RFC-compliant version 4 UUID, one needs to set the variant and version bits, and then convert the bytes to hexadecimal digits separated by dashes.

Another possibility is to use the File API in combination with the URL.createObjectURL function to obtain a Blob URL containing a UUID. Support for URL.createObjectURL is similar to Crypto at 99.9%.

const url = URL.createObjectURL(new Blob())
url.substring(url.lastIndexOf('/') + 1)

The File API does not specify which version of UUID should be used or how it should be generated. In practice, Chromium-based browsers (Chrome and Edge) and WebKit reuse their Crypto implementation to generate random bytes, and then set/clear bits to create a v4 UUID. Firefox calls OS-level functions when they exist (CoCreateGuid on Windows, CFUUIDCreate on macOS), and otherwise falls back to using Crypto like Chromium and WebKit.

Finally, browsers implement Crypto.getRandomValues by relying on the OS either to provide random numbers directly or to gather entropy and then regularly feed it to a PRNG, making it cryptographically secure (CSPRNG).

A word of caution

Our script is integrated on thousands of Web sites that very often include other third-party scripts, and each script has the possibility to redefine/overload most JavaScript functions. We’ve found that some scripts were overloading the Math.random function to always return the same value, and some others were redefining the window.URL property to return the URL of the current page.

There are two ways to have a script run in a context that cannot be affected by third-party scripts: iframes and Web Workers. Web Workers are more interesting because they are faster to instantiate, since they only create a new JavaScript execution context, not a full DOM.

Experiments for UUID generation

We implemented a feature to generate a UUID with Crypto (and fall back on Math.random) and send it to our servers, and set up an A/B test. This allowed us to check that Crypto was indeed supported by the majority of browsers and that there were no issues with our code, without impacting the majority of users. We did an A/B test of the feature running in the current frame and the feature running in a Web Worker if possible.

For users that had the “uuid worker” feature activated, we measured that 50% of them had a device that was taking more than 200ms to instantiate a worker. In our case, because we want to generate a UUID first thing in the process, it was not acceptable to introduce such a delay. We then switched to a File API based implementation, using Crypto as a fallback and Math.random as a last resort.

Analysis of generated UUIDs

What we found initially was that close to 2 requests per thousand carried a duplicate UUID. This is sobering, to say the least.

The theory says that there’s a 50% chance of having one collision if you generate 1 billion UUIDs per second for 85 years. In our case, we will be generating about 1 billion UUIDs per day, so we should be safe for about 7 million years.

Where is this difference coming from?

The difference is that we were looking at duplicated requests instead of colliding identifiers. A duplicate request comes from the same client and is sent one or more times to the server as illustrated below. There can be several reasons for this, what we found out was that the vast majority of those duplicate requests was simply caused by a bug in a third-party script.

Collisions, on the other hand, occur when a given identifier is used by more than one client. On the schema below, there is a collision between clients 1 and 3 that have both generated the same (red) UUID starting with “0a87341d…”. Remember that theoretically, this is the “once every 7 million years” event when you generate one billion UUIDs per day.

Collisions

After we removed duplicate requests (coming from the same User-Agent, IP address hash, referrer, etc) the number of requests with a colliding UUID was equal to about 2 in 10,000 requests. This is not the whole story though. When looking at the number of identifiers instead, we obtain around 5 non-unique identifiers per million.

This is 40 times smaller. This is highly unexpected: when you think of a collision, you picture that 2 very unlucky users have generated the same identifier, but here what we had was, on a single day, hundreds and even thousands of different clients all over the world generating the same UUID. Remember that browsers provide a CSPRNG that is essentially as good as what you can use on a server. What is going on here?

If we take all requests with a colliding UUID, and zoom in on the User-Agent of browsers, this is what we get:

Almost a third of all those requests are generated by Chrome Mobile 41.0. This is quite surprising, as Chrome Mobile 41 is over 5 years old. Another thing these requests have in common is the city from which they originated, based on their IP: almost 2 thirds are coming from Mountain View. All requests (100%) made by Chrome Mobile 41.0 originate from Mountain View. Can you think of one company with its headquarters there?

We are not alone in observing this: in a question on StackOverflow about UUID generation in the browser, one of the answers mentions Googlebot as the main source of collisions. Googlebot is also mentioned in this issue for having a “fake” Math.random and “new Date()” implementation, or in that issue about duplicate event identifiers. Chrome Mobile 41, hosted in Mountain View, is actually either Googlebot or some other Google service, even if it does not say so. This should no longer be the case, as Google announced in December, 2019 that they would start updating Googlebot to use the latest version of Chrome on desktop and mobile.

But that’s not all. Requests linked to an identifier that was generated in Mountain View represent a whopping 92% of requests with a colliding UUID. Picturing the User-Agent of browsers that are generating the remaining 8% of requests looks like this:

EvoPdf, WnvPdf, and HiQPdf are HTML to PDF conversion libraries for .NET, and it is likely that they simply reuse the same identifiers several times when crawling pages with our script on it. Collisions of UUIDs generated by PS Vita browser seem legitimate (not associated with fraudulent activity) and are likely due to a poor Crypto implementation: there is no browser that generates an UUID that collides with a UUID generated on PS Vita. It is possible that their Crypto implementation is just a weak PRNG.

Finally, the case of Internet Explorer looks less like it has a poor Crypto implementation, and more like it is being (ab)used by malicious scripts. 75% of requests with a colliding UUID come from 3 ISPs:

  • Nobis Technology Group,
  • PSINet Inc.,
  • and “m247 europe srl” (apparently mislabeled, should be “PrivateInternetAccess”).

A quick search points out that these ISPs provide VPNs or public proxies. Something feels off, and indeed these three ISPs represent only 0.1% of our global traffic, far from the 75% we’re seeing here.

Looking deeper, of the 30 000 times our script is loaded, in 32% of cases, the script cannot contact the ad server because of a network error, and when it can, the server blocks more than 98% of requests for fraud suspicion (checked by DoubleVerify).

Conclusion

The vast majority of browsers (99.9%) provide the APIs needed to generate random (version 4) UUIDs, either with URL.createObjectURL or crypto.getRandomValues. From what we have seen in the source code of major browsers, the implementation of these functions is of a similar quality to what can be found on servers. It is therefore highly surprising that they generate a significant number of collisions with 5 non-unique identifiers per million.

Upon closer look, the APIs are not at fault, rather these collisions seem to be mainly (92%) due to Googlebot and some other Google-related services. The rest of collisions (8%) are either coming from a fringe browser (PS Vita), automated browser agents (HTML to PDF converters) or are associated with fraudulent activity, most likely because of man-in-the-middle agents/proxies.

The collision rate of 5 non-unique identifiers per million is acceptable for our use case, especially since we have analyzed its causes. To avoid this “noise” in our system, we are setting up a filter to maintain a set of duplicate UUIDs that are added to a blocklist against which incoming requests are checked.

Acknowledgments

Thanks to everyone who contributed to this article and to the work it describes! First and foremost, Nicolas Crovatti for believing that we could generate unique identifiers in the browser, trusting me to get to the bottom of this, and encouraging me to write this article; Thomas Azemard for helping me analyze the data (especially Chrome Mobile 41 and PS Vita!); my colleagues of the Format team for reviewing my code (special thanks to Benoit Ruiz who reviewed the numerous iterations of it!) and the article; my colleagues in the SSP and Analytics teams for their help to implement this in production (and sorry for all the non-hydrated macros!); and finally Benjamin Davy without whom there would be no article.

--

--