WHICH JAGUAR DID YOU MEAN?

Label unstructured data using Enterprise Knowledge Graphs 2

How to correctly link entities from text to your Knowledge Graph

This is the second part of the series about word sense induction and disambiguation (WSID) with knowledge graphs (KGs).

In this part we design a robuster approach that can work with a small corpus and induce senses quite quickly. The approach uses deep learning models, in particular pre-trained language models. In this article you will find a description of the method including illustrative examples, some analysis, code sample to reproduce the results and quickly start with your own task — if you have one. There are also a few interesting references and a link to our challenge!

We start from a quick recap of the problem statement from Part 1. If you are not interested in the problem statement you may proceed to the analysis of the method introduced in Part 1. If you are not interested in the analysis either you could jump over to the new method.

If you are new to the topic of (enterprise) knowledge graphs and would like to start building a KG for your organization I highly recommend the new book “The Knowledge Graph Cookbook” by our CEO Andreas Blumauer and our COO Helmut Nagy. You will get guidance and advice based on 20 years of experience implementing KGs in enterprises. Follow the link and get your free copy!

Acknowledgement

This work is part of the Prêt-à-LLOD project with the support from the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 825182.

Table of Contents

= Introduction: Looking for concepts behind words

= = Running example — “Jaguars”

= Analysis of the WSID with co-occurrence graph

= New method

= = Clustering

= = Related work

= Code

= Conclusion

= = Next steps

= = References

Looking for concepts behind words

Simple examples like the one you will find below demonstrate that string matching with linguistic extensions is not enough to understand if a word represents a resource from the Knowledge Graph. We need to disambiguate words, that is to discover which concepts stand behind these words.

Given 1. a text, 2. a word of interest (or target word), 3. a Knowledge Graph — decide which resource from the Knowledge Graph does the word of interest represent. Here is an example:

BMW has designed a car that is going to drive Jaguar X1 out of the market.

This is what the typical formulation of the disambiguation against Knowledge Graph task, also called entity linking, would look like:

Typical Entity Linking task statement

However, this formulation is only suitable for large Knowledge Graphs like DBpedia or Wikidata when one is only interested in disambiguating between the senses represented in the Knowledge Graph. For enterprise Knowledge Graphs, this task should be posted differently. Enterprise Knowledge Graphs are smaller than DBpedia and usually highly specific to their domains. They also often do not contain resources that can be found in public Knowledge Graphs; or contain general purpose words in domain specific meanings, for example as a name of a new game. Therefore, this would be a more suitable formulation of the problem statement:

Disambiguation with Enterprise Knowledge Graphs

Running example — “Jaguars”

[{1: “The jaguar’s present range extends from Southwestern United States and Mexico in North America, across much of Central America, and south to Paraguay and northern Argentina in South America.”},
{2: “Overall, the jaguar is the largest native cat species of the New World and the third largest in the world.”},
{3: “Given its historical distribution, the jaguar has featured prominently in the mythology of numerous indigenous American cultures, including those of the Maya and Aztec.”},
{4: “The jaguar is a compact and well-muscled animal.”},
{5: “Melanistic jaguars are informally known as black panthers, but as with all forms of polymorphism they do not form a separate species.”},
{6: “The jaguar uses scrape marks, urine, and feces to mark its territory.”},
{7: “The word ‘jaguar’ is thought to derive from the Tupian word yaguara, meaning ‘beast of prey’.”},{8: “Jaguar’s business was founded as the Swallow Sidecar Company in 1922, originally making motorcycle sidecars before developing bodies for passenger cars.”},
{9: “In 1990 Ford acquired Jaguar Cars and it remained in their ownership, joined in 2000 by Land Rover, till 2008.”},
{10: “Two of the proudest moments in Jaguar’s long history in motor sport involved winning the Le Mans 24 hours race, firstly in 1951 and again in 1953.”},
{11: “He therefore accepted BMC’s offer to merge with Jaguar to form British Motor (Holdings) Limited.”},
{12: “The Jaguar E-Pace is a compact SUV, officially revealed on 13 July 2017.”}]

The example contains twelve contexts featuring the target word “jaguar” in different senses. The first six contexts speak about the “jaguar” as animal, the last five mention “jaguar” as a car manufacturer. The seventh context refers to both senses as it describes the etymology of the word. Consequently, our desired output would be two senses, the first senses expressed in first five contexts and the second sense expressed in the latter five. The representation of the senses is dependent on the method and the word sense disambiguation procedure. The senses could be represented as a (weighted) list of words describing the sense, such as synonyms.

Analysis of the WSID with co-occurrence graph

The method introduced in Part 1 relies on constructing the co-occurrence graph and finding clusters in that graph. In order to create such a graph we need a corpus of texts.

Technical limitations

  1. If we pre-compute the co-occurrence graph of some large corpus, for example the whole Wikipedia, we will only be able to induce global senses, i.e. those senses that are represented well in the whole corpus. As the method is not perfect it will miss the senses that our less represented, like domain-specific senses. If we want to induce domain-specific senses we would need to compute the co-occurrence graph of a domain-specific corpus.
  2. We absolutely need all the co-occurrence of the target sense to disambiguate. It does not suffice to have just a few, because we might not find them in the context. Hence we need to compute the whole co-occurrence graph in order to induce and then disambiguate. Moreover, though the hypernyms and/or synonyms do influence the induction process, we cannot take the hypernyms and/or synonyms into account directly, without performing the induction step.

Basically, the co-occurrence graph clustering can only be performed when enough time can be spent on preparation: collecting a nice corpus that fits the intended usage, then computing the co-occurrence graph and clustering. We are not satisfied with this solution, because in industrial settings such a preparation seriously limits the range of applications.

We would like to be able to induce senses with a smaller corpus that would not require a specific preparation. Moreover, we would like to relax the requirement of having very reach information about the sense, like the weighted list of all co-occurrences. It would be better to use the already provided information — synonyms, hypernyms — without the need to induce this given sense. However, if this information is not provided or is noisy, we should be able to induce the sense. Yet, the disambiguation method should work equally well with induced senses and with already provided hypernyms / synonyms, no matter where the information comes from.

New method: clustering predictions of language models

Recently pre-trained language models have become a very popular tool for a very wide range of NLP applications. One can get a quick over of possible tasks solved with the help of such models from the General Language Understanding Evaluation benchmark; if you also have a look at the leaderboard you notice that almost all of the top ranked models are based on transformers.

We observe that pre-trained language models are a very natural fit for performing the Word Sense Induction task. Let us use our example: “BMW has designed a car that is going to drive Jaguar X1 out of the Car market.” and check the predictions of the model — we can do it even online using a great online demo from AllenNLP.

Table 1. Example of predictions obtained from pre-trained language model

As we can see from Table 1 the model is able to predict pretty plausible substitutes. For the second input the model predicts “animal” as the top 3 result, this prediction is clearly a hypernym of the used sense of “jaguar”. However, for the first input we do not find neither a hypernym nor a synonyms, though “BMW” and “Audi” give us a good clue. It is curios that the model actually also “guesses” the correct substitute, i.e. “Jaguar”. Unfortunately, this does not help us in induction.

In the next tiny experiment we include a few more contexts about cars.

Table 2. Predictions of language model for the “car” sense of Jaguar. Notice that top substitutes repeat over different contexts.

As the number of contexts grow we can see some emerging patterns: “BMW”, “Ford”, “Ferrari”. And in the online demo we only see the top 5 predictions, however, to induce clean senses and reduce noise in the predictions we take a significant number of predictions (between 20 and 100) for each context and then cluster these, i.e. we try to find patterns in predictions across different contexts.

Clustering

Now we need to choose a clustering algorithm. The choice is broad. After several experiments we settle on an algorithm from Formal Concept Analysis. We outline the main ideas here and refer the reader to [3] for further details.

The predictions of the language model are represented as a binary matrix: columns are the different contexts, rows are the different predictions, cross if the prediction is given for the particular context. As predictions repeat over different context, at least some columns have several crosses — on practice many.

Table 3. The binary matrix of substitutes. The columns are different contexts from the running example, numbering is preserved. The rows are the different substitutes as predicted by our language model. A cross means that a substitute is predicted by our language model for the given context. For example, “leopard” is a predicted substitute for contexts 2, 4 and 5. The number of substitutes is manually limited to fit the page.

Next we try to identify large (maximum) rectangles that are full of crosses (up to reordering of rows and columns). The task is known to be difficult, however, we do not need all of them. We only aim at inducing a minimal number of such rectangles that cover at least 60–80% percents of the crosses in the whole table. Luckily, this problem is well investigated and we implement an efficient algorithm, see [3].

The clusters induced for our example are:Contexts = [2, 4, 6]
Substitutes = ['frog', 'lizard', 'monkey', 'mouse', 'turtle']
Contexts = [10, 12]
Substitutes = ['ford', 'honda', 'renault']
The clusters have little contexts and are, therefore, quite fragile. Moreover, the second factor only gets 3 substitutes. These disadvantages are due to the fact that the example is quite small and the number of substitutes for analysis is chosen to be 25 in order to fit the page. On practice, if the amount of contexts is small one would choose a larger number of substitutes and vice versa.

As we can see from our tiny example we are able to induce nice senses even from a very small dataset and only taking short contexts into account (as opposed to co-occurrence graph calculation where we needed the whole texts).

Related work

I do not provide a real overview of the related work, but would like to mention a few remarkable efforts that are quite related to the method described here. In [1] authors use a very similar approach. The difference is that we do not use any language patterns and we use a different clustering algorithm. In [1] authors report the best results for SemEval2013 Task 13 that I have seen ~25.5 — geometric mean of Fuzzy NMI and Fuzzy B-Cubed. By the way, the score of our method is around 24.5.

The work [2] is interesting because in the original SemEval challenge it has achieved probably the best results among all systems. Moreover, an interesting clustering approach is used. I also tried some implementation of this clustering method, but did not get any good results. I attribute this to the fact that the clustering method in [2] takes the similarity between substitutes into account, however, these similarities are already use implicitly when the predictions are generated. It might be undesirable to account for the same factor twice.

Code

This section covers an example of using the method to produce the results from the previous section.

Conclusion

With pre-trained language models we are able to induce senses from a small number of contexts with the usage of the target word — even better than our initial goal of having a smaller corpus. The senses are represented with the predictions from the language model and are often a mixture of synonyms and hypernyms. However, sometimes the sense is described with “similar” words, like a competitor as for Jaguar.

The disambiguation works quite well with the induced senses. However, we cannot easily take multi-token synonyms into account as we cannot easily compare between one-word and multi-word predictions of language model. On practice the tokenizer uses word pieces and we cannot use single token words that are out of vocabulary.

To be continued

We still see a room for improvement, for better approaches that would allow us to disambiguate the sense in even broader range of applications. Namely, with the current approach we still need to perform the induction step in order to disambiguate afterwards. If we are not interested in the senses per se, the induction step is an additional burden.

In the third part we put a special focus on disambiguation to make it even more flexible. We want to find such models that can disambiguate quickly and reliably without the need to induce at all, even if the sense inventory is incomplete, i.e. if only a single sense is known. Together with our great collaborators Jose Camacho-Collados, Mohammad Taher Pilehvar and Kiamehr Rezaee, Anna Breit and myself have launched the Target Sense Verification for Words in Context challenge.

The prepared task is different from conventional WSD and EL benchmarks for it being independent of a general sense inventory, making it highly flexible for the evaluation of a diverse set of models and systems in different domains.

Following the link you will find a detailed description, the dataset and the paper describing the challenge. We invite you to submit your results!

References

  1. Amrami, A., & Goldberg, Y. (2018). Word Sense Induction with Neural biLM and Symmetric Patterns. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 4860–4867.
  2. Baskaya, O., Sert, E., Cirik, V., & Yuret, D. (2013). AI-KU: Using substitute vectors and co-occurrence modeling for word sense induction and disambiguation. *SEM 2013–2nd Joint Conference on Lexical and Computational Semantics, 2, 300–306.
  3. Belohlavek, R., & Vychodil, V. (2010). Discovery of optimal factors in binary data via a novel method of matrix decomposition. Journal of Computer and System Sciences, 76(1), 3–20.

--

--

--

News from the world of graphs, semantic web technologies, and Semantic AI

Recommended from Medium

Phase 2 test airdrop

My mother says it was clear Andy loved me from the first time we met. I don’t know about that

Trustless Translation

My Recurring Existential Crisis as a Software Engineer

Don’t let the existential dread set in

VSCode create Big Data Maven/Java Project on WSL2 Ubuntu on Windows

AWS DEVELOPER ASSOCIATE EXAM — MY EXPERIENCE, FEEDBACK AND TIPS

Compiler — Demystifying the Tech Industry, One Question at a Time

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Artem Revenko

Artem Revenko

PhD in applied math and CS. Semantic Web, NLP, Information Extraction, Machine Learning and friends

More from Medium

EMNLP 2021 | Empirical Methods in Natural Language Processing Review

Content analysis to explore Electric Vehicle Ecosystem

Using Elastic and NLP to Explore the Journeys of Veterans

News on Google Discover — finding out what really works with topic modelling.