How to enhance automatic text analysis with entity linking

… and why you should consider Wikipedia as a knowledge base

Published in

asgard.ai

6 min readFeb 22, 2017

Entity linking consists in automatically extracting the underlying concepts from a text and linking them to a knowledge base.

An example of generic entity linking using Wikipedia as knowledge base

Google has been doing it for a few years now with ‘named entities’

Google has already embraced these kind of technologies. It is using its homemade Knowledge Graph since 2012 to ‘enhance its search engine’s search results with semantic-search information’.

For example, if you google ‘jackson’ you will get a disambiguation card to help you refine your search and get better results.

Google disambiguation card for ambiguous keywords search

And if you google ‘jackson albums’, you will get first a list of cards giving information about Michael Jackson’s album and below a card about Michael Jackson.

Google structured data results for an explicit query

More generally, if your query is explicit enough and if the underlying concepts are in its Knowledge Graph, then Google displays results as links and structured data. That’s one of the first application that intensely used entity linking.

The rise of ‘named entities’ linking APIs

More and more people want to integrate this type of technology into their applications. And with the rise of AI, cloud-computing and large-scale data processing tools, it is no longer only affordable for the tech giants. That is why commercial solutions for entity linking APIs have emerged recently.

However, they mainly focus on the extraction of ‘named entities’ and miss other concepts. Named entities are typically understood as people, locations, organizations or products.

Let’s go back to our example to better understand the limitations. Existing APIs would only extract ‘Yann LeCun’ and ‘CNN’ mentions. They would miss ‘Deep learning’, ‘Artificial neural network’. They also would not disambiguate ‘CNN’ to ‘Convolutional neural network’ because it is not a named entity and they could even link it to the TV channel concept.

Named entity are only a fraction of concepts that can be extracted from text

These commercial APIs are going in the right direction as they already help people build better applications. For example, that is a perfect tool if you want to build an application that monitor what people are saying about a politician or a specific brand. But they can still be limited for more advanced text analysis and for more ambitious applications.

For example, at asgard.ai, we built our own expanded entity linking system. Our main goal is to help people better discover and track technologies and their ecosystems. We are enhancing this experience by structuring information through latest breakthrough in machine learning, natural language processing and intuitive interfaces. With this objective in mind, we are building an AI-powered knowledge graph about technologies and their ecosystem by analyzing a lot of unstructured data (scientific publications, patents, company description, …). When we automatically analyze a text, we want to extract all underlying concepts and find the relationships between them in order to extract knowledge from it. That is why we expanded entity linking to all concepts and not just ‘named entities’.

Why expanded entity linking is crucial for relevant text analysis

Human language is not exact. In particular, if you try to analyze and understand texts automatically, you’ll face a major challenge : how to handle variability and ambiguity.

variability
People don’t always use the same word to refer to the same concept or entity. Yet it could be convenient to get documents in a corpus that refer to a specific underlying concept, whatever the way people choose to refer to them.
ambiguity
Most of the time people are using ambiguous mentions to refer to concepts or entities. It is because the surrounding context of the word or phrase contains enough information for the reader to disambiguate and understand underlying concepts. In linguistics, these hidden information are called contextual cues.

how generic entity linking can help to fix variability and ambiguity in text analysis

Classic approaches using keywords or phrases extraction don’t handle directly these issues. To deal with that, they use statistical models called distributional semantic models. They look at words context in documents to handle word and document similarities. There are two main approaches:

Context-counting models
In these methods we ‘count’ words co-occurrence in documents of a corpus. This is a way to model latent topics or distribution of words that are likely to co-occur and to share (e.g. LSA or more advanced topic models like LDA). It is also a way to compute vector representation of words also called word embedding (e.g. GloVe)
Context-predicting models
These methods try to predict the context of a word (i.e. the surrounding words) to predict a word embedding.

Both approaches work well to rank results, compute document similarities or to achieve document classification. But they are not sufficient for advanced use cases such as doing statistics on a very specific data set or extracting knowledge from texts. That is why adding an entity linking component to these approaches is essential. Because it handles both variability and ambiguity of mentions to underlying concepts.

ps: There is no dichotomy as most advanced entity linking components use distributional semantic models in some part of the system.

Why Wikipedia is a good referential to consider

Wikipedia can be a perfect Knowledge Base referential for entity linking as it is:

up-to-date
It is regularly maintained ’10 edits per second, performed by editors from all over the world ‘. Just look at this page if you are not convinced.
accessible
It is meant to represent general knowledge and it always begins with a lead section that ‘ should be written in a clear, accessible style with a neutral point of view’.
structured
Wikipedia articles belong to categories and portals and are related to other articles through links that constitute a graph of articles. Wikipedia has been used to create DBpedia project: the open, free and comprehensive knowledge base.
a key referential
It is the most popular general reference work on the Internet and it has a broad coverage.
reliable
It has been proven to be reliable at least for non controversial subjects, notably thanks to identification of reputable third-party sources as citations. If this issue concerns you I recommend this interesting article from wired.

Of course, if your application only deals with a specific domain you can replace Wikipedia or enhance it by another and more relevant knowledge base. For example, MeSH is widely used in the health and medical fields.

Example of the concept ‘Dopamine’ in MeSH

Conclusion

At asgard.ai we believe expanded entity linking is a key component for advanced text-analysis. That’s why we developed our own system and in the next articles, we will explain how such a system works under the hood.

In the meantime, don’t hesitate to play with our ‘wikification’ demo tool and to give us feedback :)