Identifying people — in data

First off: This is *not* an article about how to identify individuals in anonymous or anonymized data. It is about the standard codes and identifiers you can use in data sets or databases that deal with people as an entity type. It is a follow-up to a previous post about the importance of unique identifiers and the associated “Unique identifiers’ cheat sheet”.

Unique identifiers can be important when processing data

There is no global unique id for humans. And until humans evolve to a significantly higher level of civilization, there are many good reasons for not implementing such a scheme. Then again, until humans evolve to a significantly higher level of civilization, there are also many good reasons to assign unique ids and make it hard for people’s identities to be mixed up.

Whatever your position may be on the above, we data nerds often need to work with data on people, and need ways to address the identifier issue.

Some governments issue national identification numbers that are publicly available and can almost without limitation be used for purposes of uniquely identifying people in any database. Here’s an example. If you’re a developer or data analyst working locally in such a market, this can be a blessing. Other countries have similar schemes but are quite rigorous about how they can be used, such as with the Social Security Number in the US. And in yet other countries no such scheme exists at all. You can find a lot of information about the different local national identification number schemes on Wikipedia.

When it comes to international travel the combination of three data points is usually the basis for identification:

  • Full name
  • Date of birth
  • Place of birth

Each of these is somewhat consistent. The name is the only one that logically can change over the course of a lifetime, but there are still many different ways to write both dates and place names, so those are not without problems either. Nevertheless, the idea is that the combination of all three provide a somewhat unique identifier.

But this is far from a safe assumption. Take a person with a common name born in a big city in a populous country, and rest assure that they are not alone.

In the Philippines parents go to great lengths to come up with unique names for their children to save them from trouble later in life — which apparently is how Philippine Senator Joker Arroyo got his name. Or former Vice President Jejomar Binay, whose name is composed of the first few letters of Jesus, Joseph and Mary.

The Philippine name story is fascinating on many levels. From the data standpoint, it shows the value of unique identifiers, how unreliable methods without unique ids can be, and how ingenious people can become when it comes to data hacks. On that last point, just think about dear Bobby Tables. (If you got this one without clicking the link, you already know too much about data to be reading this blog :)

Anyways. If your work is government related, you might run into a project where this triple-combination of name, city and date-of-birth is involved as the main identifier. Much more common are cases where you have to rely on something else. Email addresses are probably the most common unique identifiers used for people in data work, followed by username, phone number and full name.

These all have their shortcomings.

  • Email addresses usually belong to a single person, but a single person may have multiple email addresses, so you may have multiple entries in your data about the same person without knowing it. People may also deliberately put in unique or straight-out phony addresses in fear of spam, which amplifies the duplication problem and makes it harder to cross reference with other databases.
  • Usernames are unique only to your system, are easily forgotten and can not be relied on to create linkage to data from other systems.
  • Phone numbers may be office numbers or home numbers that belong to multiple people, and a person may have multiple phone numbers or change numbers over time.
  • Full names are very rarely unique identifiers and may both change (although not very frequently) and be given in different ways: Middle name excluded, middle initial, last name first, with or without special characters, etc.

All things considered, email addresses are usually the best bet if you have a choice — and they have the added benefit of being verifiable, or rather verifiably proven to exist and accessible by the person that entered them.

Social media handles may in some cases be a good choice, with many of the same benefits as email addresses.

Personally I am a fan of approaches that allow people to identify themselves in multiple ways, i.e. register more than one email address belonging to them as well as a username and common social media handles. They may then use any of these to identify themselves when they log in or otherwise need to be identified. The benefits of this approach are that you have multiple hooks to link to other data and the user is less likely to forget their user id than going with e.g. usernames alone.

In any case. Short of a biometric id — which is a very expensive and (rightfully) heavily regulated area — no perfect scheme exists for identifying people in data. For each project you will have to pick a scheme that works well for your intents and purposes.

Be careful with data about people in general. In many parts of the world there are strict limitations as to how you can store and what you can do with such data, especially combinations of data from multiple sources.

And most importantly: Missteps and mistakes can cause real people, real pain.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.