Entity Linking: A primary NLP task for Information Extraction

Sundar V
Analytics Vidhya
Published in
6 min readSep 14, 2019

I am sure most of you would have come across Named Entity Recognition (NER). NER is a fundamental Natural Language Processing (NLP) task and has a wide range of use cases. This article is not about NER but about an NLP task that is closely related to NER.

Do you know what is Named Entity Linking (NEL)? How does it help in Information Extraction, Semantic Web, and many other tasks? If not, then don’t worry. This article will answer those questions along with a basic implementation of NEL.

Before looking into NEL, we will first understand information extraction. According to Wikipedia,

“ Information extraction is a task of automatically extracting structured information from unstructured and/or semi-structured documents. In most of the cases, this activity concerns processing human language texts by means of NLP.”

In the below information extraction example, unstructured text data is converted into a structured semantic graph. A broad goal of information extraction is to extract knowledge from unstructured data and use that obtained knowledge for various other tasks.

Information Extraction Example [src]

What is Named Entity Linking?

Information extraction comprises of multiple sub-tasks. In most cases, we will have the following sub-tasks. And they are performed in order, to extract the information from unstructured data.

  1. Named Entity Recognition (NER)
  2. Named Entity Linking (NEL)
  3. Relation Extraction

A named entity is a real-world object, such as persons, locations, organizations, etc. NER identifies and classify named entity occurrences in text into pre-defined categories. NER is modeled as a task of assigning a tag to each word in a sentence. Below is an example result from a NER system.

NER Example [src]

NER will tell us what words are entities and what are their types. In the above example, NER will locate “Sebastian Thrun” as a person. But we still don’t know exactly which “Sebastian Thrun” the text is speaking about in the above example. NEL is the next sub-task that will answer this question.

NEL will assign a unique identity to entities mentioned in the text. In other words, NEL is the task to link entity mentions in text with their corresponding entities in a knowledge base [1]. The target knowledge base depends on the application, but we can use knowledge bases derived from Wikipedia for open-domain text. In our above example, we can find exactly which “Sebastian Thrun” by linking the entities to DBpedia. DBpedia is a structured knowledge base extracted from Wikipedia. This process of linking entities to Wikipedia is also called as Wikification.

NEL Example

NEL is also referred to as Entity Linking, Named Entity Disambiguation (NED), Named Entity Recognition and Disambiguation (NERD) or Named Entity Normalization (NEN). NEL has a wide range of applications other than Information Extraction. NEL is used in Information Retrieval, Content Analysis, Intelligent Tagging, Question Answering System, Recommender Systems, etc.

NEL also plays a significant role in the Semantic Web. The Semantic Web is a term coined by Tim Berners-Lee for a web of data that can be processed by machines [5]. A vital issue in Semantic Web is to automatically populate and enrich existing knowledge bases with newly extracted facts. NEL is inherently considered as an essential subtask for knowledge base population [1].

NEL using DBpedia Spotlight

There are many libraries available to implement NEL, But here we are going to use DBpedia Spotlight. Target knowledge base for NEL here is DBpedia. DBpedia Spotlight, a system for automatically annotating text documents with DBpedia URIs, is developed as a step towards interconnecting the Web of Documents with the Web of Data [3].

DBpedia Spotlight is deployed as a Web Service, and we can use the provided Spotlight API to achieve NEL. You can even check the status of DBpedia Spotlight server here. Below is a sample python client that uses Spotlight API to do NEL.

import requests
from IPython.core.display import display, HTML
# An API Error Exception
class APIError(Exception):
def __init__(self, status):
self.status = status
def __str__(self):
return "APIError: status={}".format(self.status)

# Base URL for Spotlight API
base_url = "http://api.dbpedia-spotlight.org/en/annotate"
# Parameters
# 'text' - text to be annotated
# 'confidence' - confidence score for linking
params = {"text": "My name is Sundar. I am currently doing Master's in Artificial Intelligence at NUS. I love Natural Language Processing.", "confidence": 0.35}
# Response content type
headers = {'accept': 'text/html'}
# GET Request
res = requests.get(base_url, params=params, headers=headers)
if res.status_code != 200:
# Something went wrong
raise APIError(res.status_code)
# Display the result as HTML in Jupyter Notebook
display(HTML(res.text))

Output:

My name is Sundar. I am currently doing Master’s in Artificial Intelligence at NUS. I love Natural Language Processing.

As you can see in the above example, DBpedia Spotlight is linking the located entities to DBpedia knowledge base. As a result, we are getting annotated text back. Spotlight supports many languages and multiple response content type that includes HTML, JSON, XML, N-Triples, etc. If you are not comfortable with the Spotlight API, you can use publicly available wrappers written around DBpedia Spotlight’s REST Interface. One such wrapper is pyspotlight. For any significant Spotlight usage, it is strongly recommended to run your own server. Please follow the installation instructions for running Spotlight in your own server.

General Approach

NEL is not a trivial task due to the name variation and ambiguity problem. Name variation means an entity can be mentioned in different ways. For example, the entity Michael Jeffrey Jordan can be referred to using numerous names, such as Michael Jordan, MJ, and Jordan. Whereas the ambiguity problem is related to the fact that a name may refer to different entities depending on the context. Here is an example for ambiguity problem, the name Bulls can apply to more than one entity in Wikipedia, such as the NBA team Chicago Bulls, the football team Belfast Bulls, etc. [4]

In general, a typical entity linking system consists of three modules, namely Candidate Entity Generation, Candidate Entity Ranking, and Unlinkable Mention Prediction [1]. A brief description of each module is given below.

  1. Candidate Entity Generation — In this module, the NEL system aims to retrieve a set of candidate entities by filtering out the irrelevant entities in the knowledge base. The retrieved set contains possible entities that may refer to an entity mention.
  2. Candidate Entity Ranking — Here, different kinds of evidence are leveraged to rank the candidate entities to find the most likely entity for the mention.
  3. Unlinkable Mention Prediction — This module will validate whether the top-ranked entity identified in the previous module is the target entity for the given mention. If not, then it will return NIL for the mention. Basically, this module is to deal with unlinkable mentions.

To know more about each module in detail, please read [1].

Coming back to Spotlight. DBPedia spotlight uses Apache OpenNLP to identify the entity mentions. Disambiguation in Spotlight is performed using the generative probabilistic model from [4]. Please read [2], [3] to know more about DBpedia Spotlight’s implementation.

NEL is an essential NLP task that should be given more importance. Recently people started using deep learning techniques to improve the performance of NEL systems on standard datasets [6][7]. I believe massive Linked Open Data present today provides an incredible opportunity for tomorrow’s Artificial Intelligence. Given NEL’s role in Information Extraction and Semantic Web, we need to work more on topics like these.

I hope this article will be informative and thought-provoking. Thank you :)

References

[1] Wei Shen, Jianyong Wang, and Jiawei Han, Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions (2014), IEEE Transactions on Knowledge and Data Engineering.

[2] Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes, Improving Efficiency and Accuracy in Multilingual Entity Extraction (2013), 9th International Conference on Semantic Systems.

[3] Pablo N. Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer, DBpedia spotlight: shedding light on the web of documents (2011), 7th International Conference on Semantic Systems.

[4] Xianpei Han, and Le Sun, A Generative Entity-Mention Model for Linking Entities with Knowledge Base (2011), 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.

[5] https://www.scientificamerican.com/article/the-semantic-web/

[6] Nikolaos Kolitsas, Octavian-Eugen Ganea, and Thomas Hofmann, End-to-End Neural Entity Linking (2018), CoNLL.

[7] Jonathan Raiman, and Olivier Raiman, DeepType: Multilingual Entity Linking by Neural Type System Evolution (2018), AAAI.

--

--

Sundar V
Analytics Vidhya

Senior Data Scientist @ Siemens | Master’s in Artificial Intelligence, NUS | http://msundarv.com