Analogy as Functors
What does intelligence mean? I’ve for some time been thinking about a framework within which one could treat questions on intelligence — whether of human or AI or corporation or evolution — in the spirit of Leibniz. Instead of rambling discussions where everyone brings lots of extravagant baggage, I wish to some day be able to think through (even if not calculate) answers under the framework and ask, do you have something better? (This is not rhetorical, if you do then I can stop wasting my time right away).
The framework starts from intuition, guided by experience of what framing is most fruitful when the task is principled generation of hypotheses. The framework is also grounded mathematically — I achieve this by starting from computational principles and working backwards to what basic math adequately expresses the concepts of interest. Although the framework is broad; based on computing theory and the link between information theory/thermodynamics, I’ll zoom in and focus this article on analogy because I was able to perform an experiment around it.
One of the most important aspects of intelligence is analogical reasoning. Analogical reasoning is when concepts from one space are mapped to concepts in another space. For example, I might ask: what is the biological construct most similar to this physics concept? Or, I might say: what do I know that is like this new thing I am seeing?
By noticing how one thing is similar to another thing, I accomplish two things: first is that I pave the way towards being able to code both concepts in terms of something more general and thus achieving a better compression of both. Second is that by leveraging knowledge of the known structure, I can more quickly learn, probe and design experiments to better understand the new structure. The most creative individuals are really good at exploiting connections between things. Analogies do have the downside that, if the structure is not actually shared across the full space, you end up with a worse understanding of the new space than if you had not tried to accelerate by leveraging analogy.
Analogies as Functors
Thinking about analogy, the most obvious representation is that it’s a structure preserving map from one space to another (although in reality, either in the brain or practically in an AI, I doubt it is ever so direct). In category theoretic language, an analogy can be thought of as a Functor from one category to another. If you know A and wish to learn about B, then you must first learn F, a functor from A→B
and then using knowledge about A, and your guesses about the structure of F, you design experiments with F(A) to further learn about F and B (and how B diverges from A).
From this inspiration, I considered a simpler scenario for a computable experiment. Instead of functors and categories, which I have only a rudimentary understanding of, I use vector spaces and linear transformations. Then we can consider, suppose you have a vector space model A, of one topic and you learn a model B for another topic, then what you want is a some X such that AX=B. In this model scenario, the linear transformation X is the analogical mapping/functor and can be solved for easily enough using QR factorization. The problem now is how to construct said spaces and then testing if X really is an analogical mapping.
This too is easily done, there are a multitude of vector space models: Latent semantic indexing, random indexing, GloVe, Holographic Reduced Representations and word2vec. For my own purposes; across the requirements of speed, memory use, data requirements, online (learning cleanly on the go) and accuracy, random indexing dominates all other choices. The curious thing about random projections is that they are “anti-fragile”, they actually turn the curse of dimensionality into a boon. There are even some neuroscience models that hypothesize the brain makes extensive use of random projections. In our case, because random vectors in high dimensions are almost orthogonal to each other, even small nudges are enough to get useful results. In the case of small data this is useful.
Digression: When word2vec analogies are not useful
You might have heard of word2vec, a shallow neural network based method that approximately “reduces/factorizes” point wise mutual information on words, using co-occurrence stats; its ability to handle analogies using vector math seems impressive at first but upon closer inspection, is found out as ultimately a narrow gimmick with limited applicability.
The word2vec analogies fail quite often beyond well represented examples e.g. king:man :: woman:… vs baseball : sabermetrics::basketball:…. They work best for obvious i.e. 1 step analogies on single words which in general, is not useful. Those analogies are too simple and limted to serve as a demonstration of analogies as mappings across (sub)spaces.
(Aside: All vector space models based on context cannot learn to distinguish antonyms from synonyms.)
Digression 2: When Big Data is worse than Small Data
Big data is not always best — for example, when I wish to narrowly focus on two topics, I do not care for vectors that have been trained on the entirety of Wikipedia. They will fail to surface the connections I am most interested in. Consider the case when I want to find analogies between learning and evolution, even when I simplify the problem to cute SAT style analogies, the results I get are completely useless.
The analogies I build are based on mapping from one space to another space. The source of the vectors can be anything (including word2vec, my criticism is for the analogy demos not the quality of generated “concept vectors”) but I use random indexing vectors since, as previously stated, they are pareto-optimal for my requirements (e.g. speed). For this demonstration, I proceed by extracting and building vector space models of the wikipedia page for evolution and reinforcement learning separately. The vectors have dimensionality 250. After this, I now ‘solve’ for the analogy F from reinforcement learning to evolution which is represented by a 250x250 matrix.
Because the vector dimensionality is 250, I need at least 250 sample vectors in order that the equation not be underspecified. Then, if I want to look at concepts in evolution in terms of reinforcement learning or vice versa, I simply multiply the vectors by the appropriate solved for Matrix. To my surprise, it worked (why was I surprised? because I generated a matrix randomly filled with 1,0 and -1s, performed a slipshod approximation of sparse matrix multiplication against co-occurrence frequencies, ran QR factorization on that, multiplied the solved for matrix with my random matrix and it worked out interesting analogies! As I will emphasize again and again, “Human unique” Intelligence is over-estimated). You can scroll to the bottom to see some examples or check the full results here and here — it’s almost all interesting.
There are a couple impressive examples I’d like to highlight. Selection (from natural selection) is mapped to search, iteration, improvement, gradient and evaluation. Those all are extremely on point! Survival is mapped to states, actions, regret and moves. One of those is a non-trivial insight as it was a link only recently suggested (more in part 2). It also maps survival to non-episodic, this is correct and a link I’d never made before. Genes too are well mapped, with history a low scoring (appropriate that it should be a low) match.
In the other direction, from reinforcement to evolution, the map is also has some stand out samples. Policy in particular is another observation I’ve never considered before. Policies are what an RL agent learns in order to behave optimally at each particular state. What’s the most appropriate match? The genome and the phenotype of course — alleles, survival and gene are also all linked and pretty good matches. Reinforcement is matched with breeding, existence and elimination; learning to adapted and surviving; exploration to species, individuals, population and mutation.
Function too, is a curious one and highlights what I mean by advantages of small data. F(function) is judged as very close to hypothesis, idea and observations. In reinforcement learning function is in the sense of value function or action value function, and indeed value functions are much closer to hypothesis and observations than to purpose, graph or chart — which is what something based on more data would have reckoned.
These are all rather impressive and together, especially when one considers cosine similarity values and vector dimensionality, virtually impossible to have all occurred by chance.
So what am I to make of this? Haphazardly multiplying co-occurrence counts from a couple wiki pages with a random matrix is enough to make associations that would give many humans trouble. Why? I hypothesize two reasons working in conjunction. Text is not random, and is structured for the purpose passing along information — I wonder if a concept as Potential Intelligence inspired by potential energy is appropriate. The other, I think, is that this is another instance of the moravec paradox. If you list the things that you intuitively would think require the most computational power and intelligence, you’d probably have written that list upside down with respect to the correct order. If you want to understand intelligence — in animals, in humans and in AI, you would do well to take a look at Moravec’s Paradox. After you’ve done that tell me if you think the conventional presentation of the “Clever Hans Effect” makes even a tiny bit of sense.
In this post I explicitly computed an “analogy” as a map between two spaces. But although — to my surprise — the method worked, I doubt that such a map is ever explicitly made in the brain. In Project Int.Aug, this method is not used, it’s too rigid, it works only on two concepts when, usually I’m considering multiple concepts at a time. Secondly, although the map works alright, especially for well represented words, it outputs nonsense in many instances. In a future post I’ll show the practical method in use, including various ways of constructing subspaces and also composition, that go beyond single words and two concepts at a time while also possessing increased accuracy.
In my next post I talk about how AI and intelligence amplification differ in their treatment of A,F and B. I’ll go over my choice of these two topics (and why not others? Well, one aspect is that a recurring theme in my writing is evolution is as learning and so I’m usually trying to amplify my understanding of related topics by e.g., mapping between reinforcement learning <=> regret <=> evolution).
Meta: I hope that if before, you had not a clear idea of reinforcement learning but had knowledge of evolution (which is more widely known), the analogies (not mine!) have put some of the concepts in context.