Named entity recognition using spatial clustering

Martin Ostrovsky
3 min readMay 24, 2020

--

Using spatial clustering algorithms can help us infer the meaning of a query with minimal context

Photo by Anne Preble on Unsplash

Here’s a simple task: given the sentence “Any good restaurants near Ballard or Fremont?” figure out what the person is referring to. Where are Ballard and Fremont? What are Ballard or Fremont?

This (relatively) simple task illustrates the challenges of AI and machine learning. How do we train a language model to understand the intent of the query and resolve any entities as accurately as a person would but much faster?

Step by step process

The first step is identify any candidates for entities. There are a variety of ways of doing this from analyzing the grammar of the sentence via a part of speech tagger to training a neural network to detect entity candidates based on a pre-tagged corpus. In our simple example, the two entities we’re most concerned with are “Ballard” and “Fremont”. A part of speech tagger would have tagged both of these as being proper nouns, which are usually great candidates for named entity recognition.

OK now that we have our candidates — what (or where?) is a Ballard and Fremont? The use of the word “near” is a hint that Ballard and Fremont are probably locations and not people (e.g. Glen Ballard). If we look up a list of all locations called Ballard and Fremont, you get locations ranging from Ballard, Utah to Ballard, Ireland and from Fremont, Nebraska to Fremont, Ohio. How do we know the right ones?

Spatial clustering to the rescue

If we plotted all of the locations containing Ballard and Fremont on a map, you’d see a bunch of dots all over the place. Here is where we have to apply a bit of common sense and heuristics and assume the query isn’t asking for restaurant recommendations in Utah or Ireland, but two locations called Ballard and Fremont that are close to one another.

Fortunately, there is an algorithm called DBSCAN which does just that. Given a list of points, this algorithm identifies which points are clustered the closest to one another. Here’s a good rundown of the DBSCAN in practice if Wikipedia is a bit too academic for you. A key difference between DBSCAN and other popular clustering algorithms like k-means is that you don’t have to specify the number of clusters ahead of time. DBSCAN will find however many clusters there happens to be subject to the constraints you supplied. This is tailor made for our problem.

Once you have your list of clusters, sort them by the smallest distance between the points in a given cluster and you have your answer.

Ballards and Fremonts mapped out. Image by author.

To illustrate this in practice: Ballard, California is 423 km from Fremont, California. Ballard the neighbourhood in Seattle is 4.9 km from Fremont, the neighbourhood in Seattle. Our algorithm thus identifies the Seattle neighbourhoods Ballard & Fremont as the most likely entities given the limited context and would return a list of restaurants in the Seattle area.

This relatively simple exercise demonstrates the hidden complexity of machine learning and the great challenges we face in trying to emulate, and surpass, human knowledge.

--

--

Martin Ostrovsky

CEO & Founder at Repustate. Text analytics, machine learning and everything in between.