A core component in the Strise platform is the Entity Linking (EL) module. In this blog post, you will get an idea about how we utilize our knowledge graph to tackle the problem of connecting entities to text.
Let’s begin by defining the problem. First, we’ll need to know what an entity is. In this context, it refers to things, such as companies (
Apple, Inc.), persons (
Michael Jordan), locations (
Trondheim), or abstract concepts (
bankruptcy). Think of entities as anything that may have a Wikipedia page.
An EL task can be defined as discovering the entities that are mentioned in a text document. The document contains a set of phrases (sub-parts of the document), which in turn, has a set of candidate entities. The candidate entities are entities competing to be predicted as the correct entity for a phrase. The phrase “Michael Jordan” could have a dozen different candidate entities, each referring to a different person. The EL system’s job is to decide which Michael Jordan is the correct entity for the phrase given the context of the document, or none if they are all incorrect. This process is called disambiguation, as it removes ambiguity in the document’s phrases.
To be able to disambiguate a document, we need to identify phrases to be connected to entities. Then, we need to retrieve candidate entities for these phrases. However, in most EL datasets, the phrases along with corresponding candidate entities are given. These steps deserve their own blog posts, so I won’t go more into detail about how that’s done here.
Natural language is complex in nature — there are many ways to express the same thing, and one word can refer to many things, depending on the context. That is, natural languages are both redundant and ambiguous. These traits cause text understanding to be a difficult task for computers. While machines are great at absolute truths, humans are much better at understanding nuances and resolving ambiguity.
Queries like “give me documents about bankruptcies in Norwegian retail companies” would be extremely hard to do on a database filled with text — and if you chose to do it, you would end up with a wall of if-sentences (trust me, I’ve been there). In other words, not a scalable approach. Luckily, EL makes it possible to also store the entities that are mentioned. Thus, we can translate our query to “give me documents mentioning
bankruptcy and an entity
e has a relationship to the
The Strise Knowledge Graph
As a basis for linking entities, we need a set of entities we want to link to documents. For this purpose, Strise has a knowledge graph consisting of over 40 million entities stored in a flexible format. The entities come from a variety of different data sources, such as Wikidata (which can be thought of as a structured version of Wikipedia), company databases, and industry taxonomies. To describe how different entities relate, we connect them with relationships. For instance,
Michael Jordan has relations to his spouse, country, and sport teams. The entities act as nodes, and relations between them are represented with edges.
A Graph-Based Entity Linker
Since our entities are represented as nodes in a graph, it makes sense to utilize this structure during disambiguation. The hypothesis is that related entities tend to occur in the same document: If
Apple, Inc. is mentioned in a document, related entities — such as
Microsoft— are often mentioned as well.
Inspired by Pershina et al. , all candidate entities in a document are put together in a graph structure as nodes. Let’s call this graph a document graph as it is only representing a single document. An undirected edge is added between any pair of nodes that is related in the main knowledge graph. In other words, the document graph is a subset of the main knowledge graph only containing the entities that are candidate entities in the corresponding document.
Based on the document graph, we would like to score the entities based on how well they “fit in” with the rest. To do this, an algorithm based on PageRank is executed. So-called random walks are initiated N times from every node in the document graph. This involves a graph traversal of M steps being started from a node, where a random relative is selected in each step. Finally, by looking at all walks, a score is calculated for each entity, based on how many times it was visited. If an entity is poorly connected to the rest of the entities in the document graph, it will be visited rarely, and thus receive a low score. On the other hand, an entity with many relatives in the document graph will be visited often and receive a higher score.
To densify the document graphs, i.e., add more edges between the entities, we utilize Wikipedia articles: Hyperlinks are used to create more edges between the entities. For instance, if
basketball‘s Wikipedia article has a hyperlink to
Michael Jordan (and both entities are contained in the document graph), we’ll put an undirected edge between them.
Sprinkled with Machine Learning
The graph traversal provides a nice metric for the “relatedness” of an entity. However, we still have some information that can be utilized to better predict the correct candidate entity for a phrase. For instance, we can calculate the probability of a phrase p referring to an entity e, P(e|p), by analyzing Wikipedia links. By counting hyperlinks in Wikipedia articles, we know that the phrase “Apple” refers to the company most of the time.
Additionally, we can estimate the probability of an arbitrary entity e being referred to, P(e), by utilizing the article visit counts from Wikipedia. While the river called “Apple” has about 300 monthly page visits, the company has over a million. This tells us that the company is more popular than the river — and thus it is probably more likely to be referred to.
These, and some other similar features, are fed into a machine learning model. During disambiguation, the score from the model is multiplied with the score from the graph traversal, and a combined score for each candidate is obtained. Finally, for each phrase, we select the candidate entity with the highest score. It should be noted that in some cases, we don’t select any candidate entity at all. In practical cases, we are not always able to retrieve the correct entity. This is either because we lack the entity in our knowledge graph, or because we are not able to find it using the provided phrase.
While EL in a research context is typically performed on Wikipedia articles as a basis for entities, Strise takes it to the next level. Our customers need to be able to track anything, not just the things that happens to have a Wikipedia page. The problem of resolving ambiguity becomes a lot harder, as there are more entities named the same. If you’ve been around in rural Norway, you’re probably going to need all your body parts to be able to count the “Pizza Milano”’s you’ve come across.
Additionally, we run the same EL pipeline on documents of many different languages. This, combined with the fact that we need to be able to process 10s of documents per second imposes some constraints on our system. Nevertheless, the system is on par with state-of-the-art scores on the most popular EL dataset, CoNLL. This shows that the way in which entities are represented is of great importance, and is yet another reason to go for knowledge graphs.
Want to help us improve Entity Linking? We’re hiring! Send us a mail at email@example.com