When unique IDs are not…unique
Amongst data experts (who often have differing ideas about the naming and structuring of things) there is one thing that I think is almost universally acknowledged: unique identifiers for entities and things are A Good Thing. They’re good for helping people and services who rely on that data keep track of changes over time, and they’re good for helping handle some of the ambiguity and inconsistency that comes from the complexity of our fast-moving modern world.
Before we throw a Hooray For Broad Consensus party, I’m afraid there is a catch. In my continuing attempts to help government build a core infrastructure of authoritative data, I’ve seen a lot of identifiers which are widely considered to be unique but then turn out to be…not unique.
In all of the below, I’ve used totally made up examples. They’re for illustration only, because talking about data concepts in the abstract isn’t helpful for anyone. If there is a real-life example of a pitfall that I know of, I’ve deliberately chosen illustrative examples in an entirely different subject domain (turns out the examples are mostly food based, because I’m hungry as I write this).
I think this is important to point out because one of the strongest consistent themes throughout my time working in this area is that the public sector is full of people doing really great work with data, in really challenging circumstances. One of the things that makes this work challenging is that making some of the big changes often needed comes with an associated risk of exposing the ‘before’, as well as the shiny ‘after’. That can make organisations understandably nervous about making any changes to their data management at all. Undermining the work of others damages us all. I’m a very firm believer in the retrospective prime directive, and it applies here (it applies everywhere, anytime, imo).
“Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.”
— Norm Kerth, Project Retrospectives: A Handbook for Team Review
So, let’s get started. Here are a few common pitfalls that can make unique identifiers a little less useful than we might all hope, and a bit squidgier than we might like for building infrastructure. For each one of these pitfalls, there are multiple real examples of really good practice — I’ll try and collect some of those for a later blog post.
Inconsistency about which changes should result in a brand new identifier
Say you create unique identifiers for ice cream shops. IC317 identifies Scoopy Doopy Doo. After a year of trading, Scoopy Doopy Doo changes its name to Cone with the Whip, but it keeps the same proprietor and occupies the same premises. No real issues there (apart from my terrible puns).
If, later, the proprietor changes — say the shop is bought out by an existing chain — does it keep the same ID, or does it get a new one?
If the proprietor stays the same but the premises change, is that a change that justifies an ID, or not?
In this example, the inconsistency is caused because the ID has an implied association with a combination of things: the proprietor or company, the name, and the premises.
There are other types of inconsistency which come about when unique IDs are an incidental part of the data, rather than a fundamental structural part.
For example, if the company that runs Cone with the Whip folds, and sells the premises to a DIY shop, but keeps their trademark in that name so nobody else can use it later (why wouldn’t they? it’s a clear winner), what happens to IC317?
As much as I’d love to say in cases like this the ID is always retired, we’ve found that isn’t always the case. Often, systems that generate IDs in the background and rely on names of things to uniquely identify them, for instance, simply put orphaned IDs like this back into the pot, to be assigned to the next new thing added to the dataset.
That would mean that a new shop, Whip It!, owned by an unrelated company in an entirely different premises, could end up being ‘uniquely’ identified as IC317 too.
Identifiers which imply a meaning which *could* be there, but isn’t
This is one of my favourites, and I think the best example of it is everybody’s favourite: customer reference numbers.
How often have you submitted a form or a request or a complaint and been issued with a ‘unique reference’. How often have you then dutifully quoted that reference in any further correspondence or communication about that thing, assuming that it has meaning to the person on the other end?
Anecdotally, it often does not. Often, it’s automatically generated by whichever system you unwittingly interacted with during your initial form submission, and is attached (with varying consistency and relevance) to other bits of baggage your request amasses as it moves through various systems and processes.
This one is a shame because these references *could* be meaningful, and I know I have often assumed (as a customer) that they do. I might assume something like:
WEB- to tell them your original communication came through a web form
WEB-171114-to tell them the date of your first contact
WEB-171114–3- to show that you chose option 3 from the list of common issues presented as part of the form submission
WEB-171114–3-A-V-A-K to build up over time to show which teams (indicated by each letter) your query has passed through, so they’ve got a decent gauge of how peeved you’re likely to be if they’re ‘can I just transfer you’ number five.
This is appealing as a user of countless online query forms and similar things, but also as a product manager who likes evidence to help inform decisions. Something like this would tell the team operating the service useful things like the percentage of their queries which come through each of their routes, how long it takes them to solve a query from source, and things like ‘queries about [thing 3] typically pass through 4 teams, including more than 2 visits to team A’.
As a Product Manager, meaningful IDs with a free side of context sound great. But then…
Unique identifiers that do more than uniquely identify things
Oh. I should have known it sounded too good to be true. This is something that happens quite a lot, it turns out, and I can see why given how appealing it sounds.
Unfortunately, it causes other problems. Using something meaningful or laden with context as your ‘unique ID’ makes it really hard to evolve or expand the dataset that the unique ID is part of.
Imagine you have a classification of something, and it changes in big wholesale chunks, once every 10 years. It has more than one level, so there are parent codes and child codes.
10 years is a long time, and you might want to shuffle the groups for the next release so that some of the ‘child’ codes move to new parents.
For example, in 2007, veganism wasn’t nearly as popular as it is now, so you might have had something like:
Parent: FF01 -> vegan products
>> Child: FF01–1 -> milk alternative
>> Child: FF01–2 -> meat alternative
Where FF might even indicate ‘free from’, too. But now it’s 2017, and veganism has increased 360% in 10 years, and the breadth of products available has increased drastically in response.
So now it makes more sense to do something more granular:
Parent: FF01-> milk alternatives
>> Child FF01–1 -> soya milk
>> Child FF01–2 -> oat milk
>> Child FF01–3 -> hemp milk
… and so on
Parent FF02 -> meat alternatives
>> Child FF02–1 -> soya protein ingredient
>> Child FF02–2 -> soya protein product (inc sausages, chunks, strips)
>> Child FF02–3 -> gluten protein ingredient
>> Child FF02–4 -> gluten protein ingredient (inc patties, chunks, strips)
…and so on
Evolving the IDs like this makes it complex for people relying on the old identifiers in their system to map to the new values, and difficult to benefit from the kind of history modelling that can be enabled with more independent identifiers.
I know it sounds counter-intuitive to make your unique identifiers less meaningful, which is why I think it’s easier to think of it from the other side: unique identification is a big responsibility with hefty implications. Don’t give your unique identifiers the additional burden of conveying meaning, or context. I’ll give Ron Swanson the last word on this one:
These are just a couple of the things that have struck me during the last 18 months working with data as infrastructure, and I’m sure I’ve missed some (I’d love to hear your suggestions or examples in the comments, but please suitably anonymise if they aren’t your examples).
I’m aware that this might be one of the more contentious areas of data design, so I want to reiterate that the opinions here are mine alone and therefore subject to my favourite product management definition: a strong opinion, softly held. I always have time for constructively and respectfully expressed rebuttals*, so have at it if you’re so inclined.
*I absolutely reserve the right not to respond to destructive or disrespectful challenge, by the way. I’m only human.