NLU with disambiguation

Word Sense Disambiguation (WSD) is known as an unsolved problem in Natural Language Processing (NLP) and therefore, it is a problem we have a solution for in our Natural Language Understanding (NLU) system.
While many define the problem as ‘computational’, computational solutions today tend to be statistical guesses which are often wrong in context.
By not using statistics, a true solution with context is possible. Our approach is to treat it as another pattern to be matched and resolve it at the correct level — syntax, meaning or context — as shown in the Role and Reference Grammar (RRG) model.
As usual, you can follow our companion video on YouTube at: https://youtu.be/zNt2rYRHMz4.
In NLU, WSD is the selection of valid dictionary definitions in context. While words are ambiguous out-of-context, in context they aren’t. In human discourse, ambiguity is clarified with questioning by participants. Equally, speakers avoid ambiguity in practice.

With NLU, ambiguity can often be resolved by providing a single word that completes the question sentence: leveraging the concept of narrow focus from discourse-pragmatics that fits an answer into the narrow focus slot. In the example above, where possible answers are in red, the word ‘flying’ confirms that the intended phrase meant ‘the flying of planes’, not ‘planes that fly’.
Why? Because ‘the flying of planes’ means a kind of ‘flying’, not a kind of plane.
WSD requires context for its complete resolution. Any time resolution is attempted out of context, the result will be the inability to resolve NLU as seen in today’s mainstream engineering.
While sentences can usually be interpreted in many ways, in context they can be resolved unambiguously. Current techniques tend to look to statistics to resolve meaning based on words used in another context. Real NLU is impossible with that model: not only do unknown words have no solution, but context relates to the current topic, not one from another conversation!
Using “context” from another conversation is the definition of out-of-context!
Computer scientists tend to confuse the issues of NLU by reusing common words in their problems. Context in computer science has embraced the distributional hypothesis, while in linguistics it means what people expect. In the diagram below, real context relates to who, what, where, when, how and why something is communicated.

This model is required to understand like a person. Even a sentence from a movie, or a line from a song, retains the influence of the source context. Remember “I love you.” Responded with “I know!” (Han Solo). Or “…Just shut up. You had me at hello.” (Jerry Maguire)
Will data science and its algorithm refocus on meaning in the future?
The proposed solution behind most of today’s deep learning/machine learning implementations are statistical, manipulating meaningless words (strings of characters). Tools like word2vec or GloVe use statistics from data sources like annotated corpus or plain text files to form the core of formal NLP systems. This isn’t real context as we know it.
The arbitrary symbol that makes up a word has no meaning by itself, of course, so the ‘reading’ of documents by machine learning cannot acquire the meaning either. On the one hand, that technology has gathered that kings are like queens, but it has also introduced the bias that doctors are men and nurses are women. Experience introduces opinion, while meaning is independent of context.
A man is male and a woman is female. Those definitions are independent of context. A king is male and a queen is female (in the sense I mean, not chess etc.). But a doctor is a person and so is a nurse. In the end, generalization is needed based on meaning, not based on probability from experience.
Machine learning systems that rely on a corpus as its context are fundamentally not working with context as people understand it. When you think of it, the effort to create a dictionary for a language is vastly less work than that needed to create annotated corpora for each and every conceivable real context.
How much data is needed?
But don’t all the big IT companies tell us that they just need ‘a little more data’? Let’s look at what they are asking for to see how that approach is astronomically expensive. It requires a vast volume of data, but worse, it still lacks the meaning that humans use to generalize in language learning.
In my experience in completing the Facebook AI Research “bAbI tasks”, adding a few words of vocabulary and dealing with new considerations for context will provide a solution that is easier to scale than one that constantly needs new data sources constructed. Syntax alone, without semantic involvement, is quite simple. And as meaningless words don’t directly reveal their meaning, no matter how many sample sentences are provided, the need and cost for human annotations will not go away.
A king is a male head of state. A queen is a female head of state. Those 2 sentences provide very accurate definitions. Without your knowledge of English, imagine how many sentences containing those words are needed? How many are needed for ‘head of state’. If John is a king, and the king is the head of his class, does that mean he is a head of state? If he has a big head, is that why he is king? And don’t all kings develop a big head?
Now let’s look at what data is needed to cover a simplified, single predicate in a sentence.
In the diagram below, you can see 1.15 x 10^14 sentences are needed to cover the permutations once for the semantic expression of “mammals kiss mammals”. Just once. And the order of magnitude for real English would be much larger once the full range of phrases for each actor/undergoer was added with the myriad of modifiers allowed (pace like ‘slowly’, manner like ‘sloppily’, evidentials like ‘apparently’).

And while that volume of corpus would provide the sample that could be used for a future sentence recognition with machine learning, spoken human language doesn’t have the luxury of proof reading and correction. What happens when a slip of the tongue is introduced? Humans easily understand speech errors, and so should our machines.
Given the sentence: “The man missed, no kissed, the donkey”, the number of permutations to say this is estimated at 8.6 x 10^21! Again, that’s an obviously low number to provide one sample of each, and again, even with one sample, meaning is impossible as a word’s meaning doesn’t come from the meaningless words, but from associated semantics.
The meaning of a word in a sentence is determined by the meanings of the other words in a sentence. We’ve seen previously how words can combine as names into strings of words, but even then, the meaning of the string determines its application to a sentence, not just the constituent words.

Are the IT companies right? Do they just need ‘a little more’ data?
Well, no. There are many other variations in word order to address. In the next diagram, you can see some of the repacking of phrases to convey the same message, using the Qualia property of human language where the meaning of a word comes from the associations of one of the other arguments in the sentence.
In the case of such sentences, it is unclear what solution is possible in theory for machine learning systems. The meaning comes from unspecified words because a native speaker knows what referents are for and how they are created.
Qualia — the killer observation supporting meaning
Professor James Pustejovsky wrote about the use of Qualia in The Generative Lexicon[i] that he had developed in the years from 1988.

You can see that the elements relate the meaning of the referent to other referents and predicates. The constituent parts broadly[ii] use the relation (predicate) have’ (as in John has a body: have’(John, body)). The formal connects the referent with predicates (like attributives, previously known as adjectives, and location which was split between the syntactic prepositional phrase and adverb categories). The telic connects to a predicate’s actor, undergoer or other semantic role. And the agentive connects to the achievement predicate exist’.
The question for data scientists is the process that can create these associations. Without knowing these general, and possibly inherited properties of qualia, NLU by machine learning is fundamentally limited.
Worse for data scientists working on NLU is that these concepts also apply to a wide range of related phrases. In Patom theory’s NLU engine, these different elements are consolidated into sets prior to applying the RRG linking algorithm so the different forms are resolved with a common method. Data science would need to treat each form differently (or lose the accurate meaning conveyed). Examples of the range of forms is shown below.

Further Reading on WSD
I like the explanation of WSD in the 1999 Ide and Veronis paper[iii] because it isn’t skewed towards machine learning as the solution to the degree more modern analyses are. The 2009 survey of WSD by Navigli[iv] (10 years later) takes the perspective that WSD is a computational task to solve an AI-Complete problem. In the companion video, we look at the subsequent machine-learning-only model from 2013 by Nasiruddin[v].

In the extract here, WSD has been converted to only consider computational models, without any possibility of learning a new word in real time (as machine learning loads and processes words separately to incorporate corpora).
We see: “The boy leapt from the bank into the cold water” concludes “the edge of the river is intended” (for bank). Context is more sophisticated than that as can be seen in the slide below.

Ambiguity must be addressed in context, not assumed. The solution is always context to resolve meaning, not word proximity. Here’s another trivial story to reinforce my point: “When John was a boy, he went to the bank to deposit money. A fire broke out on the first floor, trapping him above the icy river below the open window. Thinking quickly, ‘The boy leapt from the bank into the cold water’.
Clearly my intended meaning of bank relates to the building, not the bank of the river.
As our target is to understand language, we can’t compromise and solve meaning out of context. Meaning must be ambiguous when we don’t have a better answer, and resolved when an answer is needed.
The modern view of forcing computations into NLP problems is like converting every problem into a nail when the only tool available is a hammer.
The last thing we will look at regarding WSD is the issue that relations are a related issue that should be determined with the word senses. Who cares what the meaning of a word is, if we don’t know what phrase it belongs to?
In ‘Beth picks the milk the milkman dropped there up’, there is no relationship between Beth and the milkman in the two phrases, meaning ‘the milkman dropped the milk there’ and ‘Beth picks the milk up’. Any statistics relating those words would seem odd, other than the context link that it is the same milk.
Most recently, the thought that statistics is the only answer for NLU, or its cousin, artificial neural networks, seems to be embraced by the mainstream. Worse, NLU is sometimes being simplified to mean ‘get the right answer’ when clearly no understanding has taken place in a human, NLU-like way.
NLU recognizes the meaning of words in a sentence based on the meanings of the other words in a sentence. ‘Blah blah blah blah weather blah Alexa’ doesn’t always mean “What is the weather here today, Alexa?”
Aiming at the real target
We get back to the question of focussing on the target we want to solve, not an intermediate problem we can solve, but that doesn’t help get to the target.
The goal of NLU is to recognize the semantic representations for the language. Where there are embeds and pro-forms, those should be resolved as well.
Next time, we will look at how the meaning of a sentence is found, using the approach discussed today.
[i] James Pustejovsky, The Generative Lexicon, The MIT Press, 1995, P76.
[ii] The constitutive role includes its material and weight which are better modelled with a different predicate to have’.
[iii] Ide, Nancy & Vronis, Jean. Word Sense Disambiguation: The State of the Art, 1999.
[iv] Roberto Navigli, Word Sense Disambiguation: A Survey, ACM Computing Surveys, Feb. 2009.
[v] Mohammad Nasiruddin, https://arxiv.org/abs/1310.1425 2013.
