If Humans Spoke in Vectors…
Would we be as successful as we are now? I’d say no.
What are Semantic Vectors
But first, what does it even mean to communicate with vectors? Communication can be defined as the transfer of an idea from one person to another, and a semantic vector essentially tries to capture an idea in a numerical representation. Vectors have grown vastly in popularity in Natural Language Processing in the past decade, and now virtually all research in the field revolves around a basic assumption that vectors can effectively convey ideas.
The first commonly used semantic vector model, generally known as word embedding (as in embedding a word in a vector space), was word2vec in 2013. Word2vec learns vector representations of words by encouraging words that occur in similar contexts to have similar vector representations. So, for example, if we have a corpus with examples like “the dog chased the cat”, “the canine chased the cat”, and “the rat feared the cat”, word2vec would learn to represent “dog” and “canine” with similar embeddings. The generated vectors are generally hundreds of dimensions, but the embeddings for “dog”, “canine”, and “rat” in our example might look something like [0.3, 0.1, 0.8], [0.3, 0.2, 0.8], and [0.4, 0.6, 0.7], respectively. Notice the similarity of “dog” and “canine” due to occurrence in similar contexts in our corpus.
These were known as “distributional embeddings”, because they were found to distribute semantic categories across vector dimensions. A commonly cited example points out how if you subtract the word embedding for “man” from “king”, and then add “woman”, you get “queen”. So these high-dimensional vectors that learn word representations just from context are able to pretty accurately understand the relationships between words. Sure, they might not know what a king looks like, what he does, and his historic relevance, but they do know that he’s similar to a man in some ways and similar to a queen in other ways.
Word Relations and Différance
These relations are what many of us will resort to when asked about the structure of language. A dictionary, after all, just refers us to other words when asked for any definition. The implicit conclusion here that without grounding in the real world, all words are purely relational, bears striking resemblance to Derrida’s Différance. The core of Différance, which became fundamental to Deconstruction and the Postmodernism that now dominates humanities, is the idea that words gain meaning only through their difference with (and deference to) others. Only in the real world, with speech, outside of the realm of paper, would Derrida acknowledge that an uttered symbol can present real meaning.
But virtually all Machine Learning research in Natural Language Understanding focuses on written text, so can any real meaning be derived? Again, I’d say no. All the colossal Deep Learning architectures that seem to dominate NLP these days don’t actually understand language-they’ve just learned correlations between words in manners ultimately quite similar to the original word2vec. This is quite obvious with OpenAI’s GPT-3, which is able to produce text that sounds very realistically written by a human, but doesn’t form any coherent meaning. The model has essentially memorized all the relationships between words from a huge corpus of internet text, but it can’t know what those words mean without some kind of grounding in the physical world.
Some meanings can be largely summed up in these simple vector relationships, but others are quite a bit more complex. The relationship between “predator” and “prey”, for example, is abstract enough that I’d pretty confidently say their meanings can be captured in some kind of vector representation. But the relationship between “knife” and “onion”? You as a reader probably know how those words are related, but I can’t even begin to capture the nature of their relationship in language. Yes, we can “cut an onion with a knife” (and GPT-2) is actually capable of predicting “knife” here), but how does this action actually work? What is the purpose it serves? What do the words really mean?
But we might be confusing two things here. Though these language models learn the meanings of words through their relationships and co-occurence with one another, we can think about definitions on their own as well. A classic example from Indian philosophy is that of the pot. What does it mean to be a pot? As humans, we’re great at generating Platonic Forms from our real world experiences. Once we’ve seen a few pots, our brains are able to construct a pretty accurate generalized Form for pots, and when we see one again, we’re easily able to recognize that it’s another instance of this Form. This representation our brain learns is so much more information dense than what can be gleaned from language. It’s imbued with an innate understanding of visual and physical attributes.
Google’s BERT might read an article about pots and then tell us that pots are often made of clay, can be used as cooking vessels, and have been found in China from 20,000 BC. But ask BERT whether a pot with holes would be able to hold water and it’d have no idea. This could certainly be construed as a language issue instead of a grounding issue. Maybe we could find some more textual data that relates pots to holes and holes to water so that BERT can learn better semantic representations for them. The real solution to me though, would be to ground the language model in the physical world, perhaps with some kind of Reinforcement Learning strategy. If a picture is worth a thousand words and a video is worth a million, the ability to interact with a physical environment must carry a tremendous amount of semantic information.
Composing and Comprehending Meaning
While grounding is vital for learning accurate word representations, something even more elementary to me is the compositionality of language. Compositionality is the idea that the meaning of a sentence is a unique, synergistic result of combining the constituent words within it. It’s closely related to Noam Chomsky’s Recursion, which he asserts is the fundamental element underlying all human language. Recursion is the ability for us to infinitely nest expressions in language, much like this very sentence, where I can just continue chaining on clauses, again and again, until I desire to stop, at which point I may place a period in writing, or a pause in speech, and then continue on to present yet another idea.
And upon the conclusion of that sentence, there is a moment of understanding, where the meaning that I intend to express bursts forth in its entire form in your mind. Bhartrhari, an Indian linguistic philosopher of the 5th century, termed this “bursting forth” as “ sphoṭa”. The symbols, whether as letters on paper or sounds in speech, agglomerate together to create a unified meaning that is entirely comprehended in a single moment. Later on, the Mimamsakas built on this idea to emphasize that a sphoṭa is not indivisible as originally conceptualized, but rather is the result of the hierarchical composition of sounds, words, and clauses through grammatical structures.
The ability to compose words to generate coherent moments of understanding is vital to language in my opinion, and I’m skeptical of the ability of word vector embeddings to do this effectively. There has certainly been progress on this front in ML with Transformer architectures. Previously, RNNs dominated the NLP landscape. These neural nets linearly process language one word at a time, feeding the collective representation of the sentence so far forward until an entire representation is outputted at the end.
The problem here, of course, is the severe loss of information due to the lack of syntax trees. While you may not actively think about parsing syntax trees of sentences, it’s an essential prior for proper understanding of any language. Some languages like Japanese put their verbs at the end of the sentence (“cat mouse chased”) whereas others like English put our verbs between the subject and object (“cat chased mouse”). It might seem trivial, but syntax has wide-reaching consequences for language understanding. If you were reading a text in an Object-Verb-Subject language (“mouse chased cat”) without knowledge of the syntax and the consequent relations constructed in the sentence, you’d be very confused.
An interesting side-note here is that speakers of left-branching languages (which put the verb at the end of a clause) have been shown to have better short-term memories than right-branching speakers. This is theorized to be because speakers must maintain the multiple elements in their memory for longer before their related together by the verb. In a right-branching language, the sphoṭa builds from a subject, to the subject’s relationship, to the subject’s relationship to the object. But with a left-branching language, one first perceives the sphoṭa of the subject, remembers it while perceiving the sphoṭa of the object, then composes those sphoṭas in the final verbal relation.
The true nature of a sentence is more like a graph/network or a tree, and the fact that we communicate it linearly is more of an evolutionary hack we’ve developed to speak out of our single mouth. So when attentional models and Transformers came along, they blew RNNs out of the water because they were implicitly capable of understanding tree structures. The attention mechanism of a Transformer is much what it sounds like, it learns to pay attention to the right things at the right time. When given a sentence “the cat chased the mouse”, for the first “the” it might attend to “cat”, and for “chased” it might attend to “cat” and “mouse” in unique ways (which are the verb’s subject and object). Attention learns how to correctly identify the syntax relationships for sentences, thereby arriving at better semantic representations.
Logic or Statistics
Transformers, though, are very clearly not just syntactic models. They have a strongly statistical element that allows them to say that “chased” 0.7 attends to “cat” and 0.9 attends to “mouse”. We often think that this ability to compute statistically on floating points gives the model greater power, but might this actually be a hindrance?
There’s an elegant duality here between logic and statistics, algebra and linear algebra, one discrete and one fluid. Logic can tell you for certain whether something is true or not, whereas statistics will give you a bit more nuanced of an answer. For much of history, philosophy was almost exclusively conducted in terms of black-and-white logic.
For Indian philosophers, the knowledge one can attain from the perception of smoke was quite contentious. Most agreed that one could use logical inference to arrive at the conclusion that there is a fire. But the Carvaka school rejected inference as an epistemological basis because we can’t know for certain that smoke is always caused by fire. The statistical parallel of inference, correlation, would quite readily clear up this dispute though. We can admit that yes, there might sometimes be other reasons smoke appears and that we obviously can’t account completely for that which we have yet to perceive. We then empirically calculate something like a 0.95 rate of correlation between smoke and fire, and substitute this for the dispute-causing boolean. I’d be quite interested in discussing this kind of statistical inference with a Carvakin, but the school went extinct about 900 years ago.
Thinking statistically is also often a useful tool for sorting ideas. When confronted with a new duality, I’ll often try finding where it fits in my immaterial-material duality. Rather than trying to immediately jump to conclusions, I plot the traits of each side on the immaterial-material axis in my head. Depending on how linearly separable the new duality is on this axis, I’m able to determine the degree of correlation that it has with immaterial-material. Another way to think about it is performing PCA on the traits of the new duality to determine whether one of the principal components is the immaterial-material axis.
The Power of Symbolic Structures
We’re often told that it’s better to think statistically, which could totally be true. But some recent findings in Graph Neural Nets suggest that symbolic models, which discard much of the statistical nuance in exchange for simple algebra, are far superior at generalization while also improving explainability by eliminating the black box neural net. In the paper, they use symbolic regression to generate simple symbolic formulas (algebra trees) from their neural network (linear algebra vectors). When applied to astrophysics, they generated accurate physical formulas that were capable of explaining interstellar phenomena.
And what more is language than formulas for explaining phenomena? With these symbols of words, and structure of syntax, we communicate theories for how we think things work. Combined with our human ability to learn Forms for real-world objects, language is a powerful tool for generalization that allows us to express ideas comprehensible by almost anyone else. Simultaneously, and just like with the aforementioned paper, language provides improved explainability of our theories of world. You might have even noticed this in yourself, with ideas in your mind sometimes being vague and confusing, but upon explicit articulation in language (either internally or spoken) become clear and insightful.
Our own thoughts, and perhaps many animals’ thoughts, might be internally computed with a vector-based mechanism similar to modern neural nets. But there must be a system somewhere that then goes through the process of discretizing those ambiguous meanings into generalizable words, constructing them into a compositional syntax tree, and then linearizing it for output as language. On the receiving end, the mind might first construct a Laplacian matrix or some kind of graph embedding from the predicted syntax tree, fetch vector representations for individual words, then feed them sequentially into the matrix to produce comprehension.
A bit more concretely, syntax can be thought of as a way to compose ideas and perhaps their semantic vectors in compositionally effective ways. When you hear “cat chases mouse”, your mind recognizes “chases” as the verb, “cat” as its subject, and “mouse” as its object. It can then compose these symbols to produce a distinct meaning from “mouse chases cat”, much like the attention mechanism of a Transformer.
The verdict seems to be in, then. If humans were to communicate entirely in vectors, barring the physiological concerns, our ability to compose ideas, generalize, and gain deeper understanding of our thoughts would likely be quite diminished. Even though we might be able to articulate very specific meanings with vectors, like that exact smell you remember from 3rd grade, it’d become difficult to combine ideas in novel ways, communicate ideas that are broadly true, and dissect our own and others’ opinions. Next time you’re thinking something, speak it.