ML is not Relevant to NLU: Replies to some ‘worthy’ Objections

Walid Saba, PhD
ONTOLOGIK
Published in
9 min readAug 19, 2021

Right after my article “Machine Learning Won’t Solve Natural Language Understanding” appeared in The Gradient [1], I started receiving private messages and have since seen ongoing discussions online about the article. Many of those messages and comments were positive, but there were also many that objected to my argument — I didn’t know the ML/DL runaway train is so wild!

While my argument was not intuitive, but scientific; with one of the arguments even being an outright proof, there were still many who rejected the thesis that machine learning (ML) cannot solve (and in my opinion is not even relevant to ) natural language understanding (NLU). There were many objections that — as one might expect — are not worthy of a reply. But there were also many that are worthy of a reply, and so here I summarize those into buckets and I will try to answer all of them.

The first worthy objection concerns my remark that language is not just data. The objection was, essentially, that language is data since everything that is an input to the NLU system is, of course data. But that objection misses the highlighted keyword ‘just’ — yes, language is data, but not just data.

Perhaps a better phrase than “language is not just data” would’ve been “linguistic corpora does not contain all the data needed for the linguistic corpora to be properly understood”. That is, to understand linguistic communication we need to have access to the missing and implicitly assumed information that we tend to leave out form our everyday spoken language. This point was elaborated in detail when I discussed the Missing Text Phenomenon (MTP). So while every input to the system is data, in linguistic communication the textual data is not enough — it does not contain the common knowledge needed to fill-in all the missing information. You cannot find something that is not there (in the data) and that common knowledge that is needed to understand linguistic communication is never uttered — so it is not in the data, it is in fact in our heads. Until the machine has access to that body of common knowledge we cannot have NLU.

Now, what is the nature of that common knowledge needed to understand language is another question that I leave for another place and another time, but it is certainly not the gigantic knowledge that would be needed for general AI. To appreciate this point, think of a 4-year old that can converse with adults indefinitely although the child does not possess any specific or domain knowledge to speak of.

Here I will not answer simplistic comments like “John saw the man with a telescope” is ambiguous — of course that sentence would be said to mean John had the telescope that he saw the man with, or it was the man John saw that had the telescope. Without a full context that is a genuinely ambiguous sentence and in those situations we usually seek clarification. There’s nothing new here. Even a simpler statement like “John bought that online” needs clarification since “that” must be resolved in some context. These arguments are trivial and are not worthy of a reply. The more serious objection is that even when the context is fully defined, language can still be ambiguous — that is, a linguistic communication can still be interpreted differently by different people (or interpreted differently by the same person in different contexts). This is very true, but let me explain the subtle implication of what I meant by “understanding” what a linguistic communication means is, in the end, a binary decision. The explanation requires quite a bit of careful pondering.

Let us say someone uttered U, which could be a question, a directive (a command) or a declarative sentence. Usually, U is uttered either to solicit an answer to some question, or to have something done, or to convey a thought. For the sake of simplicity let us assume that there are only two listeners, Alice and Carlos. The picture we now have can be depicted as shown in figure 1 below.

So someone makes an utterance, which could be a question, a command or just a thought being conveyed. Accordingly, this translates into either asking Alice or Carlos some question, instructing Alice or Carlos to fulfil some command (e.g., “open the door, please” or “please pass the salt”, etc.), or the utterance could be translated into some thought that the speaker is trying to convey to Alice or Carlos. In turn, this means, respectively, Alice or Carlos will fetch some answer (if any, and from ‘somewhere’), an answer that will later be conveyed to the speaker. Alternatively, and if the utterance was a command, Alice or Carlos would try to fulfil the command as they understood it, or — in case the utterance was a thought being conveyed by the speaker, Alice or Carlos will include the new information in the “memory” of that conversation.

What matters here is this:

even if Alice and Carlos understood the utterance differently, in the end, they must decide on a single interpretation so that they can fetch some answer or act on some command or add the information implied by the thought behind the utterance.

What all of this means is this: while it seems that language is ambiguous, in the end, we must decide on some interpretation, and that decision is binary. We do not interpret a command as many commands, nor do we interpret a question as many questions and fetch many answers, etc. We decide in the end on one interpretation — even if it may turn out later on that our understanding was wrong! So different people might arrive at different interpretations. Sure. Even the same person might, at different times, arrive at different interpretations of the same utterance. Also true. But, in the end, a decision is made on one final interpretation, and not a probably approximately correct interpretation. Thus while linguistic communication might ‘seem’ ambiguous, understanding is not. Ergo, PAC learnability is inconsistent with NLU.

One of the most repeated arguments I keep hearing is the argument that dismisses the importance of the fact that language is an infinite object — or, equivalently that the number of thoughts we can have and attempt to express in language is infinite. I find this misunderstanding puzzling, especially when it comes from scientists and engineers.

In my argument against a machine learning system ever being able to learn the infinite object we call natural language the “infinity ”is not about the number of utterances one makes (or hears/reads) in a lifetime. It goes without saying that we have a finite life and that we hear or utter a finite number of statements in a our lifetime (I think I read somewhere that we hear/utter around 20,000,000 sentences in our lifetime). But that is not where “infinity” comes in.

Here’s where infinity comes in: how many unique python programs is a python compiler ready to interpret? I owe an ice-cream for everyone that answered “a python compiler is ready to interpret and execute an infinite number of python programs”. The point? Given a valid python program, a python compiler will never come back saying “sorry, but I cannot interpret this specific program”. Anyone with basic training in logic, formal languages and automata theory knows that recursion is one powerful way to represent infinite objects in a finite specification. Like formal languages (such as Python, Java, etc.) natural language — which is even much richer, is infinite. The infinite is important here not because we mortals will ever utter or hear an infinite number of sentences, but because the natural language compiler in our heads must be ready for an infinite number of sentences and cannot say “sorry, this specific sentence is not in my hashtable”. We are ready for “any” and “any” is where the infinite is !!!

I hope the argument of “infinity” is now clear. If it is, then again ML is not relevant to NLU because no matter how much data an ML system digests, billion, or trillion, or — like my daughter used to say — gazillion divided by infinity is ZERO. This is basic math, basic logic, and even basic commonsense. If you want a slogan, here it is: Sorry, but you cannot memorize an infinite object.

Another puzzling objection was made against my argument on the fact that many linguistic utterances cannot be resolved (not even approximated) by statistical significance. There were two points that were made here that show complete misunderstanding (or lack of understanding of the issue at hand). The argument was made against the famous example used in discussions related to the Winograd Schema challenge:

The point here is that (1a) and (1b) are statistically equivalent since ‘small’ and ‘big’ occur in the same contexts with equal probabilities (so do all antonyms/opposites). Thus, while what ‘it’ refers to in (1) is quite obvious, even for a 4-year old, in data-driven statistical approaches there is no difference. I then went on in an attempt to promote the ‘machine learning’ solution that could possibly be provided but when you work out the numbers it will turn out that to establish statistical significance (or to “learn”) how to resolve references such as those in (1) a data-driven/ML system would need to see about 40,000,000 examples, and just to learn how to resolve a refence in structures like that in (1). Why? Because ‘trophy’ could be replaced by ‘ball’ or ‘laptop’ or ‘camera’, etc. Similarly, ‘suitcase’ could be replaced by ‘briefcase’, or by ‘bag’, etc. Also, ‘did not’ could be replaced by ‘did’ and ‘because’ by ‘although’, and so on. If you work out the combinatorics, you will easily conclude that a person would have to live 2 or 3 lifetimes until they “learn” how to resolve references as those in (1).

One argument was made that ML systems can easily establish the relatedness between ‘suitcase’, ‘briefcase’, and ‘bag’, etc. Also this is well known. Simple word embeddings might have a good similarity between these words. Sure. But that’s the catch. Vectorized words have a similarity measure, but cannot handle directed relations — similarity is symmetric (undirected), but the “IsA” and the “PartOf” relationships are directed relationships (transitive, but not symmetric). The similarity between ‘bag’ and ‘suitcase’ is perhaps as high as the similarity between ‘bag’ and ‘shop’ — but you cannot make a decision as to which is true IsA(‘bag’, ‘container’) or Isa(‘container’, ‘bag’). Vector similarities are helpless in the resolution of references as those in (1). To prune the huge multi-dimensional space, you need structural (and directed) relationships that vectors/tensors do not capture. The gestalt similarity measure is useful in pattern recognition, but it cannot comb your hair.

I was puzzled, also, by the fact that very few tried to even answer the argument of intension and the fact that data-driven approaches cannot model intensions. Without symbolic structures one cannot represent intensions, and so NNs cannot model intension which is rampant in natural language. Lacking any symbolic structure, NN’s only representations are through tensors. But it has be argued back in 1988 that the linear composition of tensors (using ‘*’, ‘+’, etc.) is not reversible — in fact the decomposition of a tensor into its original components is undecidable.

More than ignoring the technical problem of intension, I was also really puzzled by the fact that several have tried to deny a mathematical proof that establishes the equivalence between learnability and compressibility. I guess the reason of this denial is that this is the strongest technical (theoretical) argument that proves ML is not even relevant to NLU. Just for the record, here’s the argument again: the equivalence between learnability and compressibility has been mathematically established. Language understanding, on the other hand, is about decompressing — it is about uncovering all the missing information that we usually leave out and safely assume is available for the listener as it is part of our “common” knowledge. Simply put, ML and NLU are trying to accomplish contradictory tasks.

Sorry, DL’erners — but in arguing that ML is not even relevant to NLU, I only stated scientific facts.

References

[1] Walid Saba, “Machine Learning Won’t Solve Natural Language Understanding”, The Gradient, 2021. (here)

___
ONTOLOGIK — Medium

--

--