A case study: How to use a knowledge graph for word sense disambiguation using a Wikipedia dump, Neo4j and Elasticsearch - with code
This article aims to describe the guidelines to perform word sense disambiguation among words by a graph. The project is still in progress, indeed, at the current state of the art it is not possible to classify concepts represented by more than one word. The purpose of this article is to explain the possibility to study the subsistent relationships among objects which value is not yet identified, but which manifests itself in relation to other objects, by using a search engine and a graph database. I have released on github the code used for my study - python microservices.
Please fork my project and help me improve the app. I have tested it only with Italian and English Wikipedia versions. I hope someone can add support to some other languages too. If you already know what disambiguation is you can start to read from the second section. Enjoy the reading!
1. Introduction: What word sense disambiguation is and why to study the relationships among words in a sentence is the best approach to understand their meanings
Word sense disambiguation is the procedure that allows us to identify the meaning of a word with more than one meaning — polysemic — in a sentence. Making a term unambiguous is an operation that we speakers do every day without difficulty, for instance a polysemic word like duck:
- my son threw bread crumbs in the pond to attract the ducks
- my son ducks down to dodge a tennis ball
depending on the words with which “ducks” is placed in the sentence we are able to discern the exact meaning that at that moment the graphic form of the word — grapheme — assumes. We do this immediately, without even realizing it. Indeed this operation is really difficult to replicate for a machine.
This is because the ambiguity of language is not only a syntactic ambiguity as a word with different meanings might be, nor a morphological ambiguity as that of the grammatical person in the conjugation of verbs might be — for instance, but there is precisely a problem of semantic, of connection among the words. A classic example is the metonymy, a shift of the meaning of the words. For example:
- during the Roman holidays, I ate a special Italian dish
- I drink a glass of milk every day
I didn’t eat a dish or drink a glass, but the content of the dish and the glass. We therefore often can’t identify the specific meaning of a sentence through direct conversion of the terms contained within, but the sentence’s meaning changes according to the relationships among the terms.
Certainly the ambiguity of language is not something we always manage to solve: there are some sentences, some ways of saying that remain ambiguous. Avoiding to disturb the poets, the prophets and the oracles, we could take as an example the sentence
- I saw a jaguar cross the road
Assumes different meanings according to the context in which it is dived. Jaguar could refer to both the animal and the car. We are unable to make the meaning of this sentence unambiguous. This is because of the ability to move falls into the area of the meaning of both entities. We are therefore unable to identify the specific semantic field of the sentence.
To interpret which of the two meanings this sentence meant, I should observe the sentences that precede or follow this phrase. So I should link my sentence to other sentences. The best strategy therefore always remains the same: to study the relationships among the words that make up a sentence in order to identify a semantic field, or a common domain of meaning to which the graphemes refer.
2. Methodology and Resolution Strategy
To find a semantic field we need a dataset that relates entities. A knowledge graph is a right choice for this kind of research. With the definition of relationships among entities, it becomes possible to understand whether two concepts affirm each other simply by querying against the semantic network, suitably modeled in a graph shape.
Although linked open data projects like DBpedia or Wikidata provide more information about the relationship between two entities, their completeness is also their limitation. To take advantage of this functionality we would need to map the relationships among categories in their ontologies in the libraries that query against the graph. Furthermore, relating the interactions among entities with the semantic of written language leads to a problem that can’t be easily solved. For instance:
- the English guardian
English as an adjective could refer to very different relationships in relation to the guardian grapheme, such as: citizenship, country of production, geographical origin, version of the product, et cetera. Mapping a linguistic expression to all the possible relationships among concepts that it could express can become a heavy task to carry out.
To the purposes the best choice is to create a graph with an agnostic relationship among entities. Here wikipedia comes in handy, allowing the creation of a has_outgoing_link relationship among entities. Based on the maturity of the Wikimedia Foundation project, active for almost twenty years, we will see how there is an accurate and punctual reference among the pages. Thanks to its dump we do not have to worry about extracting all the links in the Wikipedia pages, as these are already in as a field in the scheme of the db.
Since 2014, the Wikimedia Foundation has used Elasticsearch as a search engine for the back-end infrastructure used in production. A new dump is released every three days for each foundation project in each covered language. Native dumps for Elasticsearch fall under the cirrussearch project. Elasticsearch performs all its power as a search engine by searching for terms of a natural language, as it provides specific analyzers for different languages that transform the input string to improve the quality of the search — like the deletion of the stopwords or to map related words to the same stem — stemming. In this way we will be able to provide a full-text search on the dataset.
We will use elasticsearch to retrieve all possible candidate entities to represent a grapheme in a sentence whose meaning needs to be disambiguated.
I said grapheme and not terms consciously because — as I wrote at the beginning of the article — this feature is not provided in this first release of the disambiguator. The ability to identify an entity with multiple terms is suited to the possibility of logical analysis and the logical dependencies of words. For example the sentence:
- In Detroit tigers were escaped from the zoo
doesn’t refer to the Major League Baseball team Detroit tigers but to two distinct entities, Detroit and tiger. To better identify the syntactic structure of the sentence we will need to implement a POS tagger and a syntactic dependency parser in the libraries. In this first release I want to focus on the use of the graph for the study of semantic fields of concepts/entities. If you can’t wait for the next release you can fork my repository on github — you can find it at the beginning of the page — and develop a solution — in case let me know how it runs!
First we will proceed to make a cartesian product among the entities extracted for each grapheme. If for each grapheme there can be an indefinite number of entities that could represent it, it is necessary to create combinations of each entity in order to identify all the combinations candidate to represent the concepts of the sentence. For each entity extracted from a single grapheme, a relationship will, therefore, be established for one and only one possible entity that could represent the grapheme. For example in the sentence:
- The mouse is fond of cheese
We found three graphemes in the knowledge base and extract the following entities:
- Mouse: [‘Computer mouse’, ‘Mouse’, ‘Mouse (programming language)’, ‘Mouse (The matrix)’, […] ]
- Fond: [‘Stock (food)’]
- Cheese: [‘Cheese (recreational drug)’, ‘Cheese’, ‘Cheese (software)’, ‘Cheese (speedrunner)’, […] ]
in this scenario we can determine the possible combinations with a cartesian product and therefore obtain:
- [‘Computer mouse’, ‘Stock (food)’, ‘Cheese (software)’]
- [‘Mouse (programming language)’, ‘Stock (food)’, ‘Cheese (recreational drug)’]
- [‘Mouse’, ‘Stock (food)’, ‘Cheese’]
- […]
Only one of these combinations represents the true concepts behind this sentence. To understand which is the right one we must identify the semantic context of the sentence.
To achieve the purposes, we need to study relationships among entities of all possible candidates. This could be efficiently done with a graph database, such as Neo4j. I chose Neo4j because it adopts graph storage and processing in its native architecture. This study will be based on the application of an algorithm to the wikipedia graph — shortestPath — and native graph architecture ensures better performance.
Each wikipedia page and each redirect represents a node of the graph and there is only one relationship, has_link with which the hypertext links among the pages are represented. So the has_link relationship is a representation of an outgoing link from a wikipedia page to another one, and this relationship is a direct link inside the graph.
The redirects among the pages are indeed represented as if they were their entity. In this way the initial model is slightly distorted because in the graph a redirected entity and the redirecting entity are represented as two different pages, when in wikipedia instead they belong to the same entity. However, the hop in the graph among one entity to its redirect is minimal, so the distance is somewhat increased but the computational complexity benefits from this choice. To achieve full agreement between my model and the native structure of wikipedia it would be necessary to introduce redirection as a property of a node, but this would have made it expensive in terms of resources during the query phase. Maybe in a future version of the work we will change the model of representation adopted.
The postulate on which the analysis will be based is that in the knowledge graph the distance of related entities in the same semantic field is shorter than the distance of unrelated entities. From the initial entity we need to walk a longer path inside the graph to reach a not related entity. So, for example, the entities:
- Dog (as Canis lupus familiaris)
- Animal
Will have a closer distance in the graph than unrelated entities, such as:
- Dog (as Canis lupus familiaris)
- Nuclear fusion
if this concept may appear obvious at first glance, it actually depends on two factors:
- the completeness of entity definition in wikipedia, of course
- the punctual presence of the outgoing links on the pages, so there is a link for most of the concepts on a page that takes you back to that specific concept
Only with these two prerequisites the graph will be semantically rich enough to demonstrate the initial postulate.
As a further point that could hinder the success of the purposes, it would be impossible to identify morphemes that do not correspond to the basic form of a phrase. For example the suffix “s” to the third person singular of a verb form makes it divergent from the form written in wikipedia. This could be easily solved with a lemmatizer, that allows to group together the inflected forms of a word, but at the moment it has not been implemented yet.
To verify the shortest path among a set of nodes, neo4j comes against us with an implementation of the algorithm we sought in the graph-algo library: shortestPaths. In this way, given a set of initial nodes/entities, you could easily identify the degree of distance among them. For the study we will consider relationships up to the third degree of distance. Given a set of entities such as:
[‘Pen’, ‘Pencil’, ‘Paper’]
With a specific cypher query we can draw the subgraph corresponding to this set of entities and identify how many and which are the connections up to the third degree of distance among them. For the set we have four relationships of the first degree, so they have four direct links among them. We can, therefore, infer that these entities share the same semantic field. This a graphical representation of their subgraph:
At the moment, to compare all the possible combinations of entities candidate to represent graphemes, we do not use any specific algorithm. We simply count the number of 1st, 2nd ,and 3rd level relationships among them, which already express the distance in the graph. In the future it is planned to implement also a weighing of the relationships based on the page_rank of the pages. The greater the page_rank of a page, therefore its centrality in the wikipedia graph, the less important the relationships it establishes with the other pages. This feature is still under study.
3. Demo
The time has now come to verify by induction the initial postulate on the semantic context or rather the entities that share the same semantic field have a shorter distance in the graph. To do this we will take the python grapheme and formulate 3 different small sentences in which it appears in as many different entities:
- Apollo killed Python in Delphi
- I coded python microservices for this app
- A python is eating a mouse
For all of these sentences, as discussed above, we will verify possible candidate entities. In the first sentence we could find three graphemes in wikipedia: Apollo, Python and Delphi. We can extract these entities:
‘apollo’: [‘Apollo (storeship)’, ‘Apollo (System Copernicus)’, ‘Apollo’, ‘Apollo (1962 automobile)’, ‘Apollo Crews’, ‘List of Saint Seiya antagonists’, ‘Apollo (1812 EIC ship)’, ‘Adobe AIR’, ‘Olympians (Marvel Comics)’, ‘Apollo (spacecraft)’, ‘Apollo (journal)’, ‘Apollo (Timebelle song)’, ‘Apollo (Nebula album)’, ‘Apollo (crater)’, ‘List of Star Trek characters (A–F)’, “Jim Henson’s Pajanimals”, ‘Apollo (quintet)’, ‘University of Cambridge’, ‘Apollo (horse)’, ‘Apollo (Hardwell song)’, ‘Banco de Gaia’, ‘Apollo (magazine)’, ‘Apollo (Paris)’, ‘Apollo (butterfly)’, ‘Apollo (comics)’, ‘Apollo (1798 ship)’, ‘Apollo (1910 automobile)’, ‘Apollo (ship)’, ‘Adrian Frutiger’, ‘Apollo (cable system)’, ‘Apollo (ballet)’, ‘Apollo (1906 automobile)’, ‘Apollo (Battlestar Galactica)’, ‘Porno Graffitti discography’, ‘Apollo (Michelangelo)’]
‘python’: [‘Python (mythology)’, ‘Monty Python’, ‘Launched Loop (Arrow Dynamics)’, ‘Python (Ford prototype)’, ‘Python (film)’, ‘Python (automobile maker)’, ‘Python molurus’, ‘Python (Efteling)’, ‘Python (programming language)’, ‘PYTHON’, ‘Python (Monty) Pictures’, ‘Miguel discography’, ‘Python (missile)’, ‘Python (painter)’, ‘Python’, ‘Python (genus)’, ‘Python (Coney Island, Cincinnati, Ohio)’]
‘delphi’: [‘Delphi’, ‘Delphi, Indiana’, ‘DELPHI experiment’, ‘Object Pascal’, ‘Pantheon (Marvel Comics)’, ‘Morlocks (comics)’, ‘Delphi (comics)’, ‘Delphi (modern town)’, ‘Delphi (online service)’, ‘Delphi (software)’, ‘Delphi, County Mayo’, ‘DelPhi’, ‘Aptiv’]
With the extractions we will compose the candidates and check the relationships among them. Below is a sample:
[‘Python (film)’, ‘Apollo (spacecraft)’, ‘Delphi, County Mayo’]
╒══════════════╤══════════╕
│”length(path)” │”count(*)”│
╞══════════════╪══════════╡
│3 │570 │
└──────────────┴──────────┘
[‘Python (programming language)’, ‘Apollo (cable system)’, ‘Delphi (software)’]
╒══════════════╤══════════╕
│”length(path)”│”count(*)”│
╞══════════════╪══════════╡
│3 │451 │
├──────────────┼──────────┤
│2 │64 │
└──────────────┴──────────┘
[‘Python (mythology)’, ‘Apollo’, ‘Delphi’]
╒══════════════╤══════════╕
│”length(path)”│”count(*)”│
╞══════════════╪══════════╡
│1 │6 │
└──────────────┴──────────┘
As you can see, unrelated entities still have a 3rd degree connection. As you probably already know, from the theory of six degrees of separation and the experiment of the small-world, in a connected network just a few hops of distance are enough to connect all the nodes of the network. It can also be noted that, in this case, among the three correct entities, each of them has a direct connection, incoming and outgoing, with the other two. This is the maximum possible absolute connection among three entities in a directed graph with only one relationship among nodes, and therefore allows us to assert without any doubt that these three concepts are related and share the same semantic field.
For the second sentence it is easier to identify the semantic context as the microservices grapheme has only one possible candidate and it is necessary to build fewer candidate combinations to ascertain the result. Here are a few examples:
[‘Microservices’, ‘Python (painter)’, ‘App (film)’]
╒══════════════╤══════════╕
│”length(path)”│”count(*)”│
╞══════════════╪══════════╡
│3 │88 │
└──────────────┴──────────┘
[‘Microservices’, ‘Python (genus)’, ‘Amyloid precursor protein’]
╒══════════════╤══════════╕
│”length(path)”│”count(*)”│
╞══════════════╪══════════╡
│2 │6 │
└──────────────┴──────────┘
[‘Microservices’, ‘Python (programming language)’, ‘Application software’]
╒══════════════╤══════════╕
│”length(path)”│”count(*)”│
╞══════════════╪══════════╡
│2 │198 │
└──────────────┴──────────┘
As you can see in this case there are no direct links among the entities and it is necessary to rely on the number of second degree links to identify the correct combination. Again there is no doubt regarding the correct combination which has an extremely high number of second-level connections. This suggests latent unrelated entities, which do not represent parts of the same concept, but rather that are related and related concepts in a given area.
As for the third sentence, a speech similar to that made for the previous sentence can be made, since the grapheme eating has only three possible candidates. Let’s see some combinations for this sentence:
[‘Competitive eating’, ‘Mouse (set theory)’, ‘Python (film)’]
╒══════════════╤══════════╕
│”length(path)”│”count(*)”│
╞══════════════╪══════════╡
│2 │3 │
├──────────────┼──────────┤
│3 │46 │
└──────────────┴──────────┘
[‘Eating (film)’, ‘Computer mouse’, ‘Monty Python’]
╒══════════════╤══════════╕
│”length(path)”│”count(*)”│
╞══════════════╪══════════╡
│2 │12 │
├──────────────┼──────────┤
│3 │80 │
└──────────────┴──────────┘
[‘Eating’, ‘Mouse’, ‘Python (genus)’]
╒══════════════╤══════════╕
│”length(path)”│”count(*)”│
╞══════════════╪══════════╡
│2 │30 │
└──────────────┴──────────┘
Also in this case the correct combination is the one that has the greatest number of second degree connections. We can therefore say that we have inductively demonstrated that the starting postulate is correct and that the graph approach can be a valid strategy for word sense disambiguation. If you want, you can test my algorithm with the code released on github. Please contact me if you improve my algorithm, maybe inserting some linguistic algorithms of which at the moment my solution lacks.