Are NNs Just Fuzzy Hashtables? A Revealing Experiment on MNIST Data

Walid Saba, PhD
ONTOLOGIK
Published in
6 min readMar 4, 2021

Are neural networks just large lookup tables with fuzzy/approximate keys, as many cognitive scientists and analytic philosophers (of mind and language) have long concluded? If so, then on small domains there should be a simple scheme that performs as well as any neural network. In this post we report on testing this hypothesis.

Neural Networks as Large Lookup Tables: Small Domains

In scenarios where the number of weights (erroneously called parameters, in the DL literature) is much larger than the training data points, the network will simply converge with an acceptable error rate by memorizing all the training samples. In small domains, therefore, the network should (almost) perfectly classify any new input, since it has (nearly) memorized all possible input-output pairs. To test this hypothesis I ran a simple experiment, starting with the MNIST data (hand-written decimal digits, available here). In this small domain there should be some scheme that employs a hashtable with fuzzy/approximate keys: we construct a key for each of the 10 digits and construct a ‘matching’ function that would match a new input to these keys. In theory, the performance should be equivalent to any neural network, regardless of its architecture. Indeed, that was the case, at least on this simple domain. Let us discuss below what are these fuzzy keys, and what are possible lookup functions.

Fuzzy Keys of MNIST data

There are several methods whereby one can create fuzzy keys for the 10 MNIST digits. In a sense, for every digit we are trying to create some ideal archetype (the Platonic ideal, if you like). But what is an ideal ‘6’ or a ‘6’ that resembles all other ways of writing a ‘6’ — or, more accurately, a ‘6’ that would match any new ‘6’ much more than it would match any other ideal! It was not as simple as I initially thought, but also not that difficult. Below are some sample ‘6’ images from the MNIST data, and one way to create an ‘ideal’ that represents all of them:

One way to create an ideal ‘6': overlay a number of transparent instances on top of each other

The above fuzzy ‘6’ can be thought of as an ideal ‘6’ that would, hopefully, match well with any other unseen ‘6’ — at least, it will match an unseen ‘6’ much better than the unseen ‘6’ will match any other ideal. Unfortunately, that alone did not produce a good classification accuracy on the set of 10000 test instances (this alone produced 58% accuracy!). However, what if we create a number of ideals, using different methods of merging: overlays, union, intersection, shift and rotate and then also apply fuzzy union, fuzzy intersection, etc. (scipy’s ndimage and numpy’s np have very good image functions: shift, rotate, zoom, etc.) Here are some of the ideals created for ‘6’:

For each digit 10 such fuzzy ideals where created from 100 instances that were randomly picked (thus you can say that our method used ‘100’ training examples to create the ideals — or, our child saw one hundred ‘6’ before she learned what a ‘6’ looks like!). This was not the complete key, however. That alone jumped the accuracy to 86%. The final addition to the key did the trick. In addition to 10 ideals for each digit n, there were 10 random digits that were picked-up to be part of the key. The key for each digit then was the following:

A news instance that matched the best on the set of ideals of n and its corresponding random instances, was classified as n. Several matching functions were used: cosine similarity, Jaccard similarity, Pearson correlation, etc. Few runs made it obvious that a weighted function using several matching/similarity functions produced better results (some similarity measures did better on some ideals than others). Needless to say, a few hundred lines of code (about 200 to produce the ideals; and about 400 for all the matching functions and the final weighted score function) produced an accuracy rate of 92%. Not bad for a small program, and without using regression to optimize the weights.

What About Other Domains?

The small experiment on MNIST data was encouraging: it seems that one can model the fuzzy lookup table a neural network creates with simple algebraic methods: by creating fuzzy keys for the classes, along with fuzzy matching functions to classify. But is it that simple? Unfortunately not!

First, I tried to extend this simple approach to hand-written characters. Below are sample instances of ‘p’ and six ideals created using the same methods used to create the digit ideals (the characters dataset is the one available from the Center for Machine Learning and Intelligent Systems at the University of California, Irvine).

Unfortunately, the simple code and simple logic that worked on MNIST digits did not do as well, although it did produce 73% accuracy. I’m sure this can be pushed much further. What is interesting is that when I tried to extend this to other real-life images, I started to really appreciate the power of neural networks. Consider the ideals of a bus that I created, again using the simple method I used on the MNIST data:

A quick run showed how complex the problem quickly gets. What is an ideal bus image? And what is the matching function? A neural network discovers all these features automatically, and creates (admittedly unexplainable) keys — the different weights scattered all over the network — that in the end will classify unseen examples with amazing accuracy. However, the complexity in hand-crafting the fuzzy ideals (that correspond to the weights in the network) was revealing. For the neural network to cover buses seen from different vantage points and different angles, etc. and indirectly create fuzzy keys that are the exact weights on thousands (or even millions) of nodes in the network, it is also covering (in the same manifold) unwanted and irrelevant instances, which is what has been recently labeled as the ‘underspecification’ problem in NNs (which will also result in adversarial examples). How did I conclude this? I could have relaxed the creation of the ideals and the matching functions, maintaining a good recall, but then I would classify a ‘p’ as a ‘9’ (does the term ‘adversarial examples’ ring a bell?)

What NNs do automatically is amazing. However, they seem to do a lot more than they need to (decent recall, bad precision). It seems that what NNs need (at least for pattern-recognition problems) is to devise methods whereby semantic constraints can be added to the model, in such a way as to exclude the vast majority of irrelevant data points that will also be covered in the highly underspecified predictor. Semantic constraints, as research has confirmed, are not only relevant in language, but also in image understanding.

In any case, if you want to appreciate what a neural network can do automatically, I suggest you try this experiment. That will certainly make you appreciate the magic Backpropagation and SGD are doing. So I, for one, will never try to classify images using hand-crafted algebraic methods.

Having said that, and like I always maintained, while these approaches (NNs) are not only good for pattern recognition problems, I can’t even imagine any other method that is as effective — but, I also maintain that they are irrelevant to problems that involve high-level reasoning (e.g., language understanding, planning, etc.) That’s why animals, that are also good at pattern recognition, will never reason.

Just to be clear :)

___
ONTOLOGIK — Medium

--

--