Is Word Sense Disambiguation outdated?
How Target Sense Verification improves domain-specific and enterprise disambiguation settings
Dealing with ambiguous words has a long tradition in Natural Language Processing (NLP). However, in this fast-moving field, the original task formulations that solve this problem do not meet the requirements of modern domain-specific and enterprise settings.
In this post, we will take a deeper look at
- the shortcomings of current Word Sense Disambiguation (WSD) and Entity Disambiguation (ED) task formulations
- how Target Sense Verification (TSV) can improve this situation
- a real-world setting of the TSV task
- how a disambiguation model solving this task could look like
Disclaimer: This post is based on a recent publication at EACL 2021 titled “WiC-TSV: An Evaluation Benchmark for Target Sense Verification of Words in Context”, published by Artem Revenko, Jose Camacho Collados, Taher Pilehvar, Kiamehr Rezaee, and — well — my name equivalence with the first author of that paper is no coincidence :)
Ambiguous Words
Words in natural languages mostly consist of three different aspects: their spelling, their pronunciation, and their meaning. Ideally, for each distinct meaning, we would have a distinct written surface-form and a unique pronunciation. However, this would result in an incredible vocabulary size we would have to remember in order to communicate with each other! To make things more simple for us (and keep some brain space for non-language stuff), we reuse spelling and pronunciations, resulting in Homographs and Homophones.
Homographs describe words that have the same spelling but mean something different, like the metal lead, and the lead that detectives search for in order to solve a case. Homophones are words that have the same pronunciation but mean something different, like breakfast cereal and serial killer.
Now, as our focus will be on written text rather than on speech, we will take a closer look at homographs, and how they influence the performance of downstream NLP tasks.
For example, let’s take a look at word embeddings:
The aim of word embeddings is to create vectors for words such that similar words are close together in vector space. If we ignore the presence of homographs, this means that for every string, exactly one vector will be produced.
If we take a look at the vector for mouse it will be close to other animals such as rat and rabbit, which makes sense, as mouse is a rodent. Also, the vector for mouse will be close to other computer terms, such as keyboard or computer, which also makes sense, as mouse refers to a peripheral device. However, following this paradigm, rat (a rodent) will now also be very close to keyboard (a peripheral device) which doesn’t make any sense anymore. These skewed placements of vectors in the vector space have of course great influence on downstream tasks, e.g., finding synonyms and analogies, as for these methodologies, animals now seem similar to computer terms.
If you need an easily digestible introduction to word embeddings, I suggest this post, whereas if you want to know more about Word, Sense, and Contextualized Embeddings this post gives you a nice overview.
Word Sense Disambiguation and Entity Disambiguation
Now that we have seen how ambiguous words can influence NLP tasks, we want to find a way to correctly disambiguate them.
The original formulation of this task is called Word Sense Disambiguation (WSD). In WSD, the task is to find out which sense of a word is used in a given sentence. Herefore, normally the word in a given context is compared to all possible senses in a so-called sense inventory, from which the most suitable sense is picked. The most prominent example of such sense inventory is WordNet.
WordNet covers a great number of general domain terms, including all their senses, synonyms, and hypernyms (i.e., “parent terms”). However, WordNet does not cover any Named Entities, such as specific persons, locations, or organisations. Such Named Entities on the other hand are stored as entries in a Knowledge Base.
Accordingly, Entity Disambiguation (ED) describes the task of linking entities that are mentioned in a text to their corresponding entry in a Knowledge Base. General domain Knowledge Bases such as DBpedia cover a huge span of entities and are therefore quite suitable for serving as “sense inventories” in different disambiguation settings.
However, both these task formulations also come with disadvantages.
As we always try to find the most suitable sense or entry in a given sense inventory, systems need to model the senses according to this underlying inventory.
First, this reduces the flexibility of the models that try to solve the presented task. More specifically, the models are dependent on the sense inventory, as changes of the inventory require changes to the model.
Secondly, this formulation assumes the availability of all senses. However, as it takes a lot of effort to maintain huge general sense inventories such as WordNet they often lag behind in being up-to-date, yielding to the absence of novel terms and term usages. Furthermore, the coverage of domain-specific terms in general sense inventories is quite limited, while domain-specific sense inventories are rare and in most cases incomplete.
Last but not least, when modeling entire sense inventories, all senses have to be taken into account, when in reality, only a quite small amount of senses will be interesting in a specific use case, opposed to all possible senses.
Therefore, the current disambiguation task formulations are not suitable for many domain-specific and enterprise settings.
All the letters in ALPHABET ….
To showcase the limitations of the current disambiguation task formulations, let’s take a look at a concrete example.
Let’s assume we want to know more about information technology. Therefore, we are crawling different web pages, and want to find mentions of technology companies, e.g., of Alphabet, the parent company of Google.
And during this search we stumble upon the following sentence:
“I had a very interesting interview with Terno Schwab, the CEO of Alphabet.”
For some context, Terno Schwab is not the CEO of Alphabet, the technology company, but of Alphabet, the vehicle fleet management company.
So how would we correctly disambiguate this mention of Alphabet? The term will not appear in WordNet or other general sense inventories, as it is a Named Entity. The entity will also not appear in a general domain Knowledge Base such as DBpedia, as it is too domain-specific. Furthermore, it will also not appear in any domain-specific resources that we might have used, as our use case revolves around information technology, and this specific Alphabet clearly has nothing to do with technology companies. So, even if we model all available senses both from general domain and domain-specific resources, we will still not be able to match this specific mention of Alphabet to its corresponding sense, as we will still be missing the correct sense.
And before someone comes up with the idea of “Well, then just add the missing sense to one of the resources”, I would like to ask: Do you have any idea how many Alphabets are out there in the world? We would also have to add this Alphabet, and this Alphabet, and this Alphabet, and this Alphabet, and this Alphabet, and this Alphabet, and this Alphabet, and this Alphabet, and this Alphabet, and … you get the point ;)
To summarise:
- Current disambiguation task formulations aim at finding the most suitable sense of a word in a given sentence.
- This formulation requires systems to model the entire sense inventory, which makes the system inflexible and assumes the availability of all senses, which is not always given.
- Therefore, these task formulations are not suitable for many modern domain-specific and enterprise settings.
But how can the disambiguation task formulation be improved?
Target Sense Verification to the rescue
From the before-mentioned shortcomings, we can extract two aspects that need to be improved.
Independence of sense inventories: In order to get rid of the dependencies on sense inventories, we have to rethink the way of defining a sense. More concretely, a sense should be able to exist on its own, without the need for any inventory. One way to achieve this is to define sense indicators that describe the sense, e.g., a definition and hypernyms.
Number of required available senses: As we have seen that it is not realistic to always have all senses available, it is necessary to minimise the number of senses that are required / assumed to be present. Ideally, we would only need to know one single sense.
With these two simple enhancements, a new task formulation can be formed: Instead of comparing a word in a context against all possible senses, it is only verified against one target sense. The target sense is the sense we are interested in (in our use case) and is represented by generalisable sense indicators. Simplified, the task formulation is translatable to the binary question “Is this word in that sentence used in this specific sense?”.
This new task is called Target Sense Verification (TSV).
The advantages of this formulation in domain-specific and enterprise settings are obvious:
- Existing enterprise and domain-specific senses can be easily used as target senses for this task, regardless of their current representation in the use case environment. As sense indicators are quite generic, they usually can be easily generated from all kinds of knowledge representations.
- When creating a resource for your domain-specific senses, there is no need to take care of out-of-domain senses. Just focus on the things you are interested in!
- Pre-training and domain adaptation can be more easily exploited, as the requirements within your domain are minimal.
A real-world Use Case!
Of course, the sole definition of the task will not solve any problems. We need trained models applied to real-world use-cases! In this regard, let’s take a look at the recently published WiC-TSV benchmark, which provides us with a concrete setup.
The Data
This benchmark is suitable for demonstrations in two ways. First, it provides a training and test set, on which the generalisation capabilities in a concrete use case can be estimated. Second, while the training set only consists of general domain instances, the test set also contains domain-specific instances from three different domains (computer science, cocktails, and medical domain). Like that, the domain adaptation capabilities can be tested.
The examples of this benchmark follow the TSV task description: a context with a marked target word provides information about the intended sense, while the target sense is represented by its definition and hypernyms. Each example can be either True or False.
Below, you can see some example instances from the dataset, as well as colour-coded whether these are positive (green) or negative (red) examples. (BTW, in the original dataset, the domain information is not provided.)
The Model
Now that we know how the task looks like, how could a model able to solve this task look like?
As many studies showed that transformer-based BERT models are very capable of interpreting contextualised meaning, for this example, we will build a disambiguation system with BERT-large.
To build this model, we concatenate all the textual input that we have (word in context, definition, and hypernyms) and feed it into our BERT model. From the output embeddings, we select certain features (specifically, the [CLS] token representation, the representation of the target word, and the average representation of the tokens in the definition and hypernyms), which form the input to a simple binary classification layer, predicting the final label.
The Performance
Finally, we can take a look at how well our model is able to generalise, and also, how well it can adapt to new domains!
To get a better feeling of what is possible, we will compare the accuracy of the model to the human performance on the same task, which is seen as some kind of upper bound. In the figure below, we can see on the left-hand side the human performance, and on the right-hand side the performance of our model (with an overlay of the human performance). In grey, the overall performance is shown, while the colored bars represent the accuracy on the general domain instances (green) as well as on the domain-specific instances (blue=cocktails, yellow=medical, red=computer science).
For the general domain instances (green), the performance of the BERT-L model is about 5 percentage points worse than the human performance, which is not bad, though there is of course room for improvement.
When we want to estimate the abilities to adapt to new domains, we have to evaluate the performance of the domain-specific instances. We can see, that — at least for humans — the domain-specific instances are easier solvable than the general domain ones. When compared to our prediction model, the gap to human performance is quite big in these instances. However, when comparing the model's accuracy on general domain instances (the domain our model was trained on) to the model's accuracy on domain-specific instances (to which the model had to adapt), the performance seems quite stable. This is a good sign, as it indicates that this model has the potential to be trained on some general domain dataset, and then used for a domain-specific or enterprise setting!
Of course, the presented model just serves as an example implementation of a disambiguation model that can efficiently tackle the TSV task. Hyperparameter optimisation and improved model architecture can lead to even higher performances. The currently best performing models can be found on the codalab competition page as well as on paperswithcode.
Closing and Further reading
If you want to know more about Target Sense Verification and the WiC-TSV benchmark, I recommend reading the original paper of the benchmark. If you want to get some inspiration on more performant TSV-model architectures, I can suggest roaming around in the proceedings of the SemDeep2020 workshop, especially the papers from Moreno et al, and Vandenbussche et al. If you want to try out a TSV-model for your own use-case you can find an implementation on Github. And finally, if you feel like you have an idea for a way better model, you can try to establish a new state-of-the-art model on the WiC-TSV benchmark at codalab.
TL;DR: We have seen how the traditional disambiguation task formulations are not suitable for many domain-specific and enterprise settings, as they restrict the flexibility of disambiguation systems and assume the availability of all senses. Target Sense Verification (TSV) is a sense-inventory independent reformulation of this task, where only one sense needs to be known. We have seen an example implementation of a TSV-model and tested it against the WiC-TSV benchmark.
Remark: all unreferenced graphics are created by the author.