Memorizing vs. Understanding (read: Data vs. Knowledge)

Walid Saba, PhD
ONTOLOGIK
Published in
7 min readJun 7, 2020

--

So how can I get the result of the arithmetic expression, e?

e = 3 * (5 + 2)

Well, there are two ways: (i) if I’m lucky, and lazy (think: efficiency) I could have the value of e stored (as data) in some hashtable (a data dictionary) where I can use a key to pick-up the value of e anytime I need it (figure 1);

Figure 1. A data dictionary with key and value of arithmetic expressions.

(ii) if I do not have that option then the only other alternative to get the value of e is to actually compute the arithmetic expression and get the corresponding value. The first method, let’s call it the data/memorization method, does not require us to know how to compute e. In using the second method, however, we (or the computer!) must know the details of the procedures of addition and multiplication, shown in figure 2 below (where Succ is the ‘successor’ function that returns the next natural number).

Figure 2. Theoretical definition of the procedures/functions of addition and multiplication

That is, if the value of e is not memorized (and stored in some data storage), then the only way to get the value of e is to know that adding m to n is essentially adding n 1’s to m and knowing that multiplying m by n is adding m to itself n times (and thus ‘multiplication’ can be defined only after the more primitive function ‘addition’ is defined).

Crucially, then, the first method is limited to the data I have seen and memorized (i.e., stored in memory), while the second method does not have this limitation — in fact, once I know the procedures of addition and multiplication (and other operations) then I’m ready for an infinite number of expressions. So we could, at this early juncture, describe the first method by “knowing what (is the value)” and the second method by “knowing how (to compute the value)” — the first is fast (not to mention easy) but limited to the data I have seen and memorized (stored). The second is not limited to the data we have seen, but requires detailed knowledge (knowing how) of the procedures. The first, if you like, is data-driven, and the second, is knowledge-based (Do these terms sound familiar?). Note another crucial difference between “knowing what” and “knowing how” — if I know how then I know what, or if I know how to compute the value of e then I can always store/save the value of e, but knowing the value that is stored somewhere does not mean I know how to compute it!

Having the knowledge (procedure, algorithm) to perform some computations implies (or subsumes) having access to the (data) result of the computation, but the reverse is clearly not true.

If this was the end of our story then there would be no debate in AI as to which approach (between the data-driven and the knowledge-based), is the right one, since the data-driven approach — thus far — seems to be very limited to the data we have seen and would require us to memorize and store potentially an infinite number of values, which is of course implausible. But the story is not, of course, that simple.

But What if I Saw Lots and Lots of Examples?

If the data-driven (‘memorize and store’) paradigm was that simple it would not have gained such attention — even becoming the dominant computing paradigm! On the contrary. The data-driven approach has a slick, coherent and seemingly intuitive story to tell, and it goes like this: if I saw lots and lots of arithmetic expressions, coupled with their values, I would, over time, and using a ‘smart’ (patter recognition) algorithm, figure out the pattern of computation and I would thus be able to get the values of unseen expressions from there on. In short, I would essentially learn (from the data) how to do addition and multiplication without explicit knowledge of the details of the corresponding procedure/function — or rule (by the way, the functions given above are essentially rules. For example, the definition of ‘addition’ could’ve been written as shown below, where (a => b) is read as if a then b):

But can this be achieved? Can we ‘learn’ (or discover) procedures by seeing enough sample data paired with their expected outputs? Well, yes and no, depending on the type of that data and on the precision required.

Data-Driven Approaches Learn how to APPROXIMATE functions

Yes, we could learn a procedure/function from seeing lots of examples, but this ‘yes’ has a couple of important qualifications. First, if the object under consideration is an infinite object, then the best we can hope for is an approximation, and not the exact function. Second, the only functions we can approximate are continuous functions because ‘‘learning’’ is essentially finding a hyperplane in an n-dimensional space that would cover all (most!) the data points we have seen — a hyperplane that (hopefully) would also cover new and unseen data points. In simple terms, what we essentially ‘learn’ is an approximation of an infinite hashtable (data dictionary) like the one we showed in the introduction (that’s what a function is, by the way — an finite table/relation). That virtual and infinite data dictionary (that ‘knowledge’) would essentially be stored in the weights of our optimized network (optimized, perhaps, by Backprop/SGD). But, again, what we learned is an approximation of our functions, and not the exact functions themselves.

So, Where is the Problem?

There are two problems. The data-driven approach essentially memorized (let’s say learned) how to compute the value of new expressions based on lots of examples it saw. But this is the main limitation of this paradigm:

when dealing with infinite domains, we could never memorize (and store in the optimized weights) an infinite function, and thus all of our computations are approximations. While approximations are accepted in some domains — image or speech recognition, for example — in many other domains we cannot tolerate this noise and thus the values required can only be obtained by knowing how (i.e., by explicit knowledge of the procedures/rules).

So, yes, infinity is the main villain in data-driven approaches (elsewhere I labelled the data-driven approach to natural language understanding by the “chasing infinity” paradigm).

Now how is this Relevant to AI and Understanding?

Now here’s the implication of all of the above to AI. The main conclusion of the above is that there can only be approximations but no real understanding in the data-driven approach, and here’s why, by way of an example. I claim

no student can really understand how 30 * (21 + 5) comes out to 780, for example, without knowing what the answer to, 3 * (4 + 60) is.

In other words, if you know how to compute 30 * (21 + 5) then you know how addition and subtraction work, and if you know how addition and subtraction work then you would know how to compute 3 * (4 + 60). The opposite of this argument is this: if you know that 30 * (21 + 5) comes out to 780 but you don’t know what the value of 3 * (4 + 60) is, then you don’t really know what the value of 30 * (21 + 5) is, but you perhaps have the value memorized/stored somewhere — that is, there is no real knowledge or understanding. The logic is quite simple, and I expect it is not controversial.

Now let us extend this argument to other domains — say to natural language understanding (NLU). Prominent cognitive scientists, logicians and philosophers of science have long used this obvious observation to make the case that real understanding of natural language works the same way — this is often referred to as systematicity of language/thought and that property is related to compositionality. Here’s the corresponding argument in natural language:

no one can entertain the thought (or equivalently, truly understand the meaning of) “John loves Mary” without being able to entertain the thought (or equivalently, understand the meaning of) “Mary loves John”, “John loves John”, “Someone loves John”, Mary loves someone”, etc.

Thus, much like the case of arithmetic expressions, if one truly understands how the composite thought (or the composite meaning) of John loves Mary was determined from the components, then they can apply the same procedure on another construction that has the same (types of) components, ad-infinitum. And, conversely, and like the case of arithmetic expressions, if one (claims to) know the meaning of John loves Mary but does not understand what Mary loves John is, then, they do not really understand what Mary loves John means, but they must have that meaning stored somewhere (i.e., memorized — perhaps because it was seen in previously encountered data).

Basically, it has been proven long time ago that understanding is systematic: if I know/understand what some expression/syntactic pattern means, then I must know what infinitely similar expression/syntactic pattern means — and if I “know” one but not the other, then I have simply memorized one but not the other and thus there is no real understanding.

In short, if the paradigm in NLU is not understanding how X, but observing and memorizing X, then we can never hope to really ‘understand’ language this way. Understanding language means knowing how, and knowing how in infinite domains requires knowing the details of a number of procedures/ rules.

Incidentally, GPT-3 just came out, and it has now scaled into 175 billion parameters (weights, that store/memorize knowledge). I suspect that by the time GPT-2000 is out we will still, of course, not succeed in chasing and memorizing infinity, but we might very well have a major ML-induced carbon emission problem :)

So, where do I stand? Well, I never liked to memorize data — I always preferred knowing how to get it!

___________________
https://medium.com/ontologik

--

--