An Александр by any other name

‘Synonames’ helps us investigate people across languages and alphabets

The OCCRP Team
OCCRP: Unreported
6 min readApr 27, 2020

--

By Aparna Surendra

A single name can have many equivalents when transliterated across writing systems or represented across cultures. A Russian named Александр might open a U.K. bank account as Aleksandr, while a German Friedrich might introduce himself to Americans as “Fred.”

An Aleph search for ‘alexander’ produces a few synonames.

These variations arise naturally with cross-cultural exchange, without any malicious intent, and the human brain can usually toggle between them effortlessly. But they pose a bigger problem in data-driven research.

Due to the cross-border nature of financial crimes, a person of interest will often travel across countries and cultural contexts, while leaks and source documents routinely span multiple languages. To better support journalists who use Aleph, OCCRP’s suite of data tools, we expanded search results for names to include variants in over 40 languages and four alphabets, drawing on name data from Wikipedia. The resulting data set is called “Synonames,” and it has become an integral part of how we do searches.

In the past, typing “Alexander” into Aleph would not have returned results for the Greek “Αλέξανδρος,” but now it routinely does. Synonames also helps us account for common variants of English names, so we can turn up results for “Thomas” when searching for “Tom,” and “Brad” for “Bradley.”

Crucially, a synoname is not an alias chosen to conceal a true identity, nor is it a nickname specific to an individual (“El Chapo” is not a synoname for “Joaquin,” and “Bo-Jo” is not a synoname for “Boris Johnson”). It’s a name variant that arises naturally and follows prescribed rules of language: Joseph for Giuseppe, or Carlotta for Charlotte.

In the first iteration of the project, we identified the most common names with variant spellings in Wikipedia’s data, which includes 41 languages using Latin, Cyrillic, Greek, Armenian, and Georgian scripts. A future version of Synonames will support additional languages and alphabets.

Like many OCCRP projects, Synonames is an open-source component available to others who are thinking of building a cross-language search engine.

You can see the full list of Synonames here, or simply try searching common names if you already have an Aleph account.

How We Did It

Not many Aleph users are seeking information about Alexander von Sachsen, the 16th-century elector of Saxony, but Wikipedia’s data for his name has helped us connect searches for the Ukraine-born businessman Alexander Rovt with documents that render his name in Ukrainian, “Oлександр.”

Here’s how. We started by downloading a full copy of Wikidata, the structured data version of Wikipedia. In Wikidata, the articles from every language version of Wikipedia link together to form entities reflecting the subject of each article. For each of these entities there is name data in every language, entered by Wikipedia contributors. We used this data to extract 29 million names (5.8 million unique words) from the article titles that had been linked to 2.3 million entities about people.

Wikipedia source data for the name Argir

An immediate challenge was that Wikidata stores most names as single strings. The raw data for Alexander von Sachsen (URI Q100737), for example, appeared as “Alexander von Sachsen,” “Alexandre de Saxe,” without splitting the name up into its components. We needed a system that grouped “Alexander” as a possible variant of “Alexandre,” recognizing it as the same name across languages — and not with “Sachsen.”

To do this, we relied on a simple assumption: Synonames sound similar. We converted all 5.8 million name parts to phonetic symbols based on their approximate pronunciation, using the Python Metaphone algorithm. This was designed to work with English words, but also begrudgingly encodes names transliterated from non-Roman scripts.

Once we had encoded all the Wikidata names phonetically, we could easily group similar-sounding name parts together using what is known as a Levenshtein distance metric.

In this case, “Alexander,” “Alexandre,” Alessandro,” and “Александр” fell neatly into the same group, while “Sachsen” and “Saxe” fell into another.

Names grouped by phonetics, per entity.

There are advantages and disadvantages to this phonetic approach. It helps with common compound names like Jean-Pierre, which would otherwise get classified as synonames, and improves our ability to navigate non-standardized name data, like entries for the same person that included different titles or suffixes. And it allows us to avoid over relying on rules about the structure of names that might not hold up well across cultures, or even within them. (Consider the Wiki name data for Japan’s prime minister, which includes both “Abe Shinzo” and “Shinzo Abe,” since in Japanese culture the family name traditionally comes first. We don’t want “Shinzo” and “Abe” to be considered synonames for each other. Encoding the raw data phonetically ensures that all the variants of “Shinzo” will be collected in one group and all the variants of “Abe” in another.)

The trade-off is that it also discards some synonames that don’t phonetically match: William and Bill, Richard and Dick. (We’re working on this.) On OCCRP’s data team, we’re still debating whether using the phonetics-matching step makes the most sense, or whether a comparison of recurring names across Wiki pages might get us most of the way to the same result.

Anyway, once we had our groupings, we ran a simple calculation to identify co-occuring name pairs — names assigned to the same Wiki entry — to create a master list of Synonames.

From the initial 29 million article titles, we ended up with roughly 20,000 pairs of names. To reduce the risk of false positives, we removed matches that didn’t co-occur on at least 20 different entities.

For example, “Abdulla” and “Abdullah” are identified as a probable pair of synonames in our dataset, but Wikidata does not have 20 “Abdullahs” who are also labelled “Abdulla.” To be conservative, we pruned Abdulla-Abdullah from the final list of synoname pairs. But Alejandro-Alessandro, with 37 matches, made the cut.

Co-occurring names with frequency counts

Using this threshold allowed us to trim the list of name pairs to around 1,800, yielding a data set with high precision (very few false positives) but low recall (more false negatives). In other words, there are plenty of name pairs that would qualify as Synonames but have not yet been identified as such within our data set, but we rarely identify names as being the same when they are not. We kept the threshold for finding true pairs high to avoid drowning our users in irrelevant search results created by over-eager synonyms

What’s Next:

An Aleph search for a name on our Synonames list will now quietly expand results with all other known forms of the name, using the synonym token filter in Elasticsearch. The next phase of our work might include the following features:

  • More nicknames: Synonames includes a few common nicknames, but we want to add more — and expand beyond phonetically-similar nicknames to include variants such as “Sasha,” a common nickname for “Alexander” in Slavic countries, or the aforementioned “Dick” for “Richard.”
  • Better prioritization: We hope to incorporate probabilistic methods to predict “good” synonames and rank them over less likely alternatives, depending on context.
  • Smarter threshold parameter: We need to learn more about how widely synonames should be defined.

Aparna Surendra was a data science fellow with OCCRP’s data team from winter 2019 to spring 2020. Besides working on our investigative projects, she also worked on the ‘Synonames’ project to bring more machine smarts to the Aleph data platform.

--

--

The OCCRP Team
OCCRP: Unreported

Members of the Organized Crime and Corruption Reporting Project.