Who is mentioned where? What is Entity Linking?
Entity linking is the task of automatically identifying named entities such as persons, locations, organizations or products mentioned in text. For instance, in the sentence “President Obama is married to Michelle” the idea is to automatically understand that the fragment “Obama” refers to the US president Barack Obama and that “Michelle” refers to The First Lady Michelle Obama, and not to any other Obama or Michelle. The task is far from being trivial since some names may refer to multiple entities. Imagine, for example, how many people share the name Michelle.
As a quick immersion into the topic, before reading further, we recommend you to try our demo here so you can gain an initial intuition about entity linking. Try our example, or just input a sentence of your own and let the machine discover which entities are mentioned.
Since named entities are likely the most important component in any text, entity linking is at the core of Natural Language Understanding. If machines are to read natural language as humans do, they must unambiguously recognize which named entities are mentioned in any text. After all, there is no story without a named entity. Tweaking a bit a thought by a famous Spanish philosopher (sorry!) it could be said that everything we care about is just entities and their circumstances.
Natural Language Understanding is one of the hottest AI fields at the moment. After many decades of research, entity linking recently graduated from research institutes, evolving into a usable and reliable industrial component. It is being increasingly exploited to provide better applications both to businesses and consumers. Searching, advertising, tagging, data analytics, archiving, translation, digital writing, and digital reading are among everyday applications requiring some sort of entity linking technology.
Both companies and individuals are increasingly aware of the potential of analyzing textual data. Nowadays the great majority of the data generated by humans is in the form of text like news articles, blog posts, research papers, internal company documents, etc. Entity linking is probably the main component to transform that text into measurable and easily accessible data. However, the task is still not extensibly known among technologist outside academia and even less the general public. In what follows, we attempt to provide intuitive explanations about the main components of entity linking. We will go over some of its applications in a future blog post.
Entity Linking: Recognize and Disambiguate
Entity linking requires the machine to perform two crucial steps. The first one (called Named Entity Recognition) attempts to recognize which pieces of text correspond to a named entity. The second step (disambiguation) aims at determining which exact named entity is referred on those pieces of text. In our sentence “President Obama is married to Michelle”, the first task would be to assert that “Obama” and “Michelle” correspond to named entities and not “President”, “married”, “to” or “is”, while, in the second, the machine needs to determine that “Obama” refers to Barack Obama instead of, for instance, Mount Obama, and “Michelle” to Michelle Obama and not to Michelle Pfeiffer .
To solve these challenges the machine needs to understand first, what a named entity is and second, how to determine its correct identity.
Named entities can be quickly defined as objects or things that bear a name. People, locations, organizations or products are typically understood as named entities. A named entity always refers to a specific object. Barack Obama, Greece, United Nations, Batman or iPhone 7 are all examples of named entities.
Strictly speaking, a named entity is an entity that can be uniquely identified. In fact, all existing objects, either physical or abstract can be regarded as named entities. In our sentence, “President Obama is married to Michelle”, “Obama” and “Michelle” are named entities because they refer to specific objects (Barack Obama and Michelle Obama) while “President” is not because it can refer to many (Barack Obama, François Hollande, and all world presidents).
Following this logic, one could argue that the product iPhone 7 is not a named entity because it refers to many distinct iPhone 7 distinct phones. However, for most applications it is useful to treat products as named entities and most entity linking systems (like ours) do. We will leave the deeper discussion to philosophers.
One important characteristic of named entities is that they are usually categorizable. For instance, Barack Obama or Lionel Messi are persons, United Nations and Manchester United are organizations, iPhone 7 and Ford Focus are products, and Batman and Sherlock Holmes are fictional characters.
Each category may be further subcategorized. For instance, a person can be a president like Barack Obama or a soccer player like Messi, or a product can be a smartphone (iPhone 7) or a car (Ford Focus).
This means that the schema of categories has the form of a tree starting from a single universal category to which all named entities belong. This category is subsequently subcategorized as well as its sub-categories. A named entity can be an organization, a product, or a person; a person can be a politician and a sportsman; a sportsman a rugby player or soccer player; a soccer player a goalkeeper or a midfielder; and so on. This aspect becomes particularly relevant when textual data is analyzed. For example, one could understand which type of person is more relevant in a specific news article: are they soccer players? are they politicians? which kind of politicians?
The key difficulty in identifying named entities in text (for humans as for machines) is that a single name may refer to multiple named entities. Entity names tend to be highly ambiguous. Think of common names like Smith, Kim or Müller. Even places could be highly ambiguous. We have a Paris city in France but also multiple Paris in the United States or Canada. Paris can also refer to persons like Paris Hilton or even a mythological figure like Paris, prince of Troy (have a look here how many entities can be referred as Paris in Wikipedia). “Paris is the most beautiful city in the world”, which Paris is it?
Ambiguity is a crucial aspect in entity linking. In fact, the main difficulty rests in understanding which of the many possible entities should be assigned to a given name in text. Consider the sentence “Jimmy played Kashmir on the stage”. More than 1.800 entities in Wikipedia can be referred with the name “Jimmy”. This means that the main task of the entity linking system is to decide which one of the many candidate entities can be assigned to a given piece of text. Given this fact, entity linking is also referred to as named entity disambiguation.
As you might have already guessed the key to resolve this ambiguity is context. As a human would do, the machine needs to understand to which of the multiple candidates a given context suits better. Consider again the sentence “Jimmy played Kashmir on the stage”; there are multiple cues pointing to the right answer. The words “play” and “stage” give the idea that “Jimmy” may refer to a performer (musician, actor, etc). This already significantly restricts the set of candidates: We now know that “Jimmy” probably will not refer to the former US president Jimmy Carter but it could still refer to the comedian Jimmy Fallon.
Another important aspect to exploit is the fact that some entities are strongly related so that the probability of appearing together is high. In fact, named entities are in practice related to only few other named entities (think how many friends you have in Facebook compared to the total number of Facebook users). Going back to our example sentence, the word “Kashmir” is also quite ambiguous: It can refer for instance to the place Kashmir in Asia or to Kashmir, the Led Zeppelin song. However, since we know that “Jimmy” could potentially refer to the musician Jimmy Page, guitarist of Led Zeppelin, then it is very likely that “Jimmy” in our example sentence refers to Jimmy Page and “Kashmir” to the Led Zeppelin song Kashmir. It makes more sense that Jimmy Page and the song Kashmir appear together in a sentence than Jimmy Fallon and the Asian region Kashmir.
This means that the two main aspects that the machine needs to solve is how the entities relate to the context and how they relate to each other. To make the correct decisions, the machine needs access to high quality data about the entities.
A knowledge graph represents the knowledge of the machine and the quality of this knowledge is crucial. The knowledge graph can be thought of as a computer based encyclopedia that machines can access to get information about the entities. A knowledge graph is simply a collection of named entities and information about them. Each named entity will be assigned a set of names, categories and a set of characteristics that will determine its context.
For instance, John Lennon, will have “John Lennon”, “John”, “Lennon” and “Lenny” as possible names and will most probably be assigned categories like British musician and pacifist. He will be also linked to other entities such as The Beatles, Liverpool, Yoko Ono, Paul McCartney, George Harrison, Ringo Starr and even to Mark Chapman, his murderer. The goal is that the machine accesses the appropriate information about John Lennon so that it can properly recognize him in text. This information, for instance, should suffice to correctly process the sentence “John talks to Yoko about his music”. The machine knows that “John” is a possible name for John Lennon and it knows that “music” is a good context for him. It also knows that Yoko Ono is strongly related to Lennon. So even though “John” is a very ambiguous name (think how many Johns there are), the information available to the machine was enough to reach the correct answer. As this information gets richer and more accurate the machine will tend to make better decisions.
It is important to understand that each named entity is assigned a unique identification in the knowledge graph. This is crucial since two entities may share the same name. The father of president Barack Obama is Barack Obama, so to distinguish one from the other each should be assigned a different identifier (e.g., Barack_Obama_2635, Barack_Obama_857). This means that in the sentence “President Obama lives in Washington”, the fragment “Obama” should be linked to the corresponding identifier of the US president and not any other Barack Obama.
If you would like to read technical work about entity linking, here are some suggestions:
- Hoffart, J., Yosef, M. A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., and Weikum, G. Robust disambiguation of named entities in text. EMNLP 2011.
- Ferragina, P. and Sciella, U. TAGME: on-the-fly annotation of short text fragments. CIKM 2010.
- Shen, W., Wang, J., & Han, J.. Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions. IEEE Trans. Knowl. Data Eng., 27(2), 2015.
If you still did not use our demo, we advise you to try it. Play with the named entities, with the context and with the ambiguity, so you have a better understanding of entity linking. Let us know if you have any feedback, enjoy!