Record linkage made easy

Iain @routineactivity
2 min readMar 20, 2024

--

Merging multiple datasets and gaining a view of unique persons in administrative data is a regular obstacle for data professionals. But with Splink that process can be fast, accurate and scaleable.

person link network analysis graph, Generated with AI ∙ 20 March 2024

In criminal justice and law enforcement, we often need to collate a single-person view. It may be to understand a person's journey through the system, identify a trajectory or help understand overall risk and harm.

When working with legacy systems, it might mean that a person's records are spread across multiple databases (arrests, intelligence, crimes, stops). There may not be a unique ID attached to that person. Their name, dates of birth or other identifiers may be affected by inconsistent data entry (different spellings, exclusion or inclusion of middle names, a miskeyed digit in date of birth).

Newer systems that claim a ‘golden thread’ by the presence of a unique identifier across databases are not immune from duplicate creation either. Moving addresses, changing names, and providing incorrect information (knowingly or mistakenly) can lead to duplication. Providing false particulars to obfuscate and evade law enforcement sanctions can also lead to problematic data.

I’d dealt with this previously using different string matching techniques — cosine, Jaccard, soundex and Levensthein. You can see a short notebook I created here using Premier League footballer names and ID numbers comparing these methods.

Different string match techniques using PL footballers dataset

I wasn’t aware of Splink until last year when a colleague introduced it to me. It is nothing short of amazing. The documentation is great and there are a variety of options that you can walk through over at the Splink GitHub page. Designed to be fast enough to link 100 million records, I was pleased to find I could de-dupe and assign unique identifiers to a sample dataset of over 400,000 records in less than 20 seconds.

I’ve made available a short worked example using Splink’s ‘Quick and Dirty Persons’ model which you can find at this link (data used at this link). Apologies, that is a lot of links so far…

Overview

  • 415,987 rows of data on individual jail bookings
  • 266,656 unique persons matched on full name and dob, no unique ID
  • 239,166 unique persons matched by Splink and provided a unique ID
  • 27,490 duplicates found
  • Less than 20 seconds

Further reading

--

--

Iain @routineactivity

Geo specialist. Former community safety, crime & intelligence analyst.