Affinity Propagation Hybrid Clustering Approach for Named-Entity Recognition

Jocelyn Lu
Intuit Engineering
Published in
4 min readMar 29, 2019

Named-entity recognition (NER) is a common problem with multiple possible solutions, such as Amazon’s AWS Comprehend and the Stanford Named Entity Tagger. However, these solutions are generally best suited for situations in which natural language is used. In this post, I’ll describe a novel method using affinity propagation and clustering to identify people’s names where standard NER may fail.

Challenges in identifying people’s names

In the space of financial data, users are deeply concerned about personal privacy and data security and compliance. Although most identifiable information is not found in bank transactions, names of individual account holders or payment recipients are often supplied in the description. While a common first and last name are not particularly identifiable, it is understandable that users would want this data to be treated carefully.

Out-of-the-box NER solutions don’t perform well on financial transactions due to a multitude of reasons:

  • Financial transactions’ grammar is different than that of the regular English language.
  • The order of terms (first name, last name, or address tokens) in a financial transaction can be scrambled or incomplete.
  • Individual names, street names, cities, and state names can be abbreviated, misspelled, or truncated.
  • Proper punctuation and capitalization are often missing.

Out-of-the box NER solutions may also fail with entities they don’t recognize, such as ethnic names.

Common NER solutions also may not account for evolutions in spelling. Much like a game of telephone, small changes in a sent message may result in overall cumulative error after a few generations. Consider the example using the common name, “Caitlyn”, below.

Evolution of the name “Caitlyn”

The name starts reasonably and eventually devolves into “Kviiilyn” (or K-roman numeral eight-lyn). This is obviously an extreme example, but it is a real-life illustration of the evolution of names.

Intuit’s solution

It is especially important in the financial and banking space to be able to correctly identify people’s names. The information can be used to more accurately categorize banking transactions, or it can be scrubbed for security purposes.

We take a different approach to NER. Instead of concentrating on tokens or classifier types like conditional random fields, we focus on actively incorporating variation in our named entities, using affinity propagation to identify the names of people within a financial transaction.

We split our methodology into two phases: training and prediction.

Training phase (offline)

  • Create graph with known entities.
  • Calculate similarity across clusters for each entity (responsibility matrix).
    Responsibility updates sent where the responsibility matrix is R, and r(i, k) is the representative comparison. This uses edit distance as the similarity function.
  • Determine available clusters for each entity (availability matrix).
    Availability updates are sent as:

Prediction phase (online)

  • Given each unknown entity, find the best available cluster if it exists.
  • Add each entity to cluster and recalculate best available cluster for remaining unknown entities.
  • Add all entities to relevant clusters as next generation graph.

The results of the prediction phase are fed back into the training phase, so we can accurately predict even future generations of names.

RESULTS

Initial clustered names (training phase)

Generation 1:

Clustered names (prediction phase)

Given a new list of potential entities, we predict which ones are valid names. These are added to the list based on the affinity to both the base and the clustered names.

Generation 2: (new predicted names added in bold)

This process can repeat for multiple generations. In addition, once a critical mass of accurate known names is reached, we can retrain from the offline training phase to maintain accuracy and re-optimize the similarity of names within the clusters.

Evaluation of our method

Comparison with a labeled dataset of 2,000 potential entities, split between 1,500 true individual names and 500 non-names:

Our method beats the industry standard!

Note: The majority of names in our dataset were Anglo-Saxon names. We believe that we can get an even greater percentage point lift in our F1 score compared to the Stanford NER Library when testing on ethnic and oddly spelled names.

This method is not limited to the English language. Any language with an alphabet that we can calculate an edit distance for can be used to generate our propagation graph.

--

--

Jocelyn Lu
Intuit Engineering

Data science @ Intuit. Dancer, teacher, everything eater.