Language processing in humans and computers: Part 3

Semantics: The Meaning of Language

How can I ever know what you mean?

Dusko Pavlovic

--

1. What does it mean to mean?

1.1. Meaning as organizing

You arrive from a faraway place (be it as a baby or as an alien) and you ask:
What is a hammer?” Someone shows you a hammer. “This is a hammer”, they say. But you have never seen one before. “What is that?”, you ask. The question of meaning of a word has become the question of meaning of an object. It would be the same if you first asked “What is this?” and they said “This is a hammer”. Great, thank you very much. Holding a hammer in your hand, you guess that it should be tasted and that “hammmer” probably means tasty. Then they show you how to hammer nails in the wall…

The meaning of a hammer as an object and of “hammer” as a word

Many years later, you read this article, and there is a picture of a hammer as the meaning of the word “hammer” and a picture of hammering as the meaning the hammer. The figure itself refers to the concept of meaning. It abstracts a general concept from concrete instances. Then there is a paragraph which explains this general process of assigning meanings by referring to concrete instances of that general process…

Everything means something. We organize the world by using signs to assign meanings. Meaning is the process where signs assign meanings to each other. The process is recurrent: signs refer to signs, meanings to meanings.

Language is said to be the characterizing property of our species¹. Another proposed characterizing property is that we use tools. Both properties refer to the same practice.

Many animals use signs and tools. Meerkats have a word for a snake and a word for an eagle, and the difference becomes a matter of life and death when it comes to seeking shelter up in a tree or down in a burrow. Crows use sticks to poke grubs out of holes. But crows throw away their tools when they are done, whereas humans put their tools in tool boxes. They reuse and combine tools. The same tool means different things in different contexts. Tools are signs, signs are tools. They refer to each other and both obey syntax. The instrument is an extension of the musician, the musician of the instrument. Our bodies and our tools evolve together, our mind and our language. They cannot perform without each other. The driver is one with the car, the surfer with the board, the lover with the lover. Language is the process of becoming one through meaning.

1.2. Meaning as abstraction

What do you mean when you say “cow”?

“Pizza”?

“Warm”?

“Love”?

The simplest concept of concept is that it is an abstraction of some concrete things. Your concept of cow would then be the memory of all times when you observed a cow, faded by time and mixed together. By “observed a cow” I mean: you saw or heard or touched or milked a cow, or saw a picture of a cow, or sketched her portrait. Cow’s shapes, smells, and sounds mix with the shapes and the sounds of the word “cow”, with the contexts where it occurs, with the menu choices in a video game or in a food court. The ongoing process of abstraction decomposes concepts and it projects away some of their aspects. Here are two ways to abstract a cow:

Picasso, Toros y toreros (Museu Picasso, Barcelona). Roy Lichtenstein, Bull profile (Roy Lichtenstein Foundation)

And what you now read is an instance of abstraction of abstraction. Meaning evolves through abstraction.

So from which direction should we then approach meaning to make sense of it?

1.3. From vectors to streams

Towards noncommutative geometry of meaning. Every science is characterized by its space. Classical physics is mostly developed in the 3-dimensional Euclidean space. Special relativity made it into the spacetime by adding the 4-th dimension of time. General relativity lives in the curved space of Riemannian geometry. Computer science lives in the space of bitstrings. Biology lives in the space of genes. Static semantics is mostly developed in multi-dimensional Euclidean spaces. Dynamic semantics adds the dimension of time, but differently from special relativity. Static semantics is commutative in the sense that Alice and Bob are the same people like Bob and Alice. Or to say the same thing in terms of cows: the meaning of the word “cow” and the meaning of the corresponding animal, and the meaning of that meaning, etc., they may evolve, each on their own, but they remain related in the same way, statically.

Dynamic semantics, on the other hand, is noncommutative because if Alice speaks before Bob, the terms of reference may be initiated differently from the terms of reference that Bob would establish if he were to speak first. In terms of cows, the reason would be that Alice and Bob may view cows from different angles.

Plan. We first spell out some basic ideas and methods for presenting the static meanings of words, phrases, objects, as vectors. Then we review the basic ideas and methods for presenting the dynamic processes of meaning, threading through sentences, messages, protocols, channels…

2. Static semantics: words in space

When you enter unfamiliar space, you first note where things are, how many are there, how they are related, how far from each other. Then you try to understand what is what. Static semantics works something like that. In the first part of this section, we retrace the Vector Space Model (VSM), which interprets words as tuples of numbers and uses them as coordinates to place words in space, where their meanings are related as angles or distances. In the second part of this section, we review the basic methods of semantic analysis: how to extract concepts from words.

2.1. Vector Space Model: counting words

The Vector Space Model is not a language model. The methods of vector space modeling were introduced in computer engineering by Gerard Salton, in the 1970s, to address the tasks information retrieval from databases. The problem of information retrieval was described by Vannevar Bush in 1946, in the Memex Memo. Writing at the time when the new world order and the first computers were being built, Bush described the idea of a World Wide Web, some 50 years before it arose from behind everyone’s back and changed the wide world. While information retrieval was the hard part of keeping the Web and other the large datasets navigable, it is the easy part of the process of information supply, which is the main purpose of natural languages. The critical discrepancy is that information retrieval is an operation, whereas information supply is a process. This is why information is in principle retrieved on unordered carriers, and the traditional information measures are invariant under permutations of the carrier elements, whereas it is supplied along channels, where the information flows are ordered in time, and by code, and syntax. We look at that in the next section. First the unordered part.

Vector representations. Although the vector representations were formalized and standardized as data architectures for computer-based information retrieval, they have been around for a long time, as data architectures evolved through natural-language-based information exchange. E.g., a cooking recipe is a vector representation.

Pizzas are viewed as vectors of their ingredients. A real cooking recipe, of course, also describes the preparation steps, but a bigger matrix would accommodate that. A slightly bigger example is provided in Appendix A. Although cooking is a dynamic process, the recipes are presented as static lists because the cooking culture interpolates the operational details: how to cut a sausage, on which side to lay the pizza it in the oven, etc. For centuries, all kinds of tacit vector representations have been parsed according to unspoken cultural grammars, using lexicons of common knowledge.

Once the vectors are noticed and placed in space, a geometry of concepts emerges, including the geometry of the pizza concepts. E.g., what is the angle between the White and the Hawaiian pizza? The answer depends on the frame of reference in the relativistic world of pizza. If we just take into account two types of ingredients, vegetables and meats, we get something like this:

Pizzas in space: as vectors of their vegetable and meat ingredients

If we write the pizza vectors in the form

then we can calculate the length of the projection of each of the vectors on the other² one using their inner product

Comparing the inner products of different pizza pairs provides a geometric insight into the pizza semantics. The way such inner products are related with the lengths of vectors’ projections on one another is explained in Appendix B. To factor out the impact of the vector length, corresponding to the pizza weight, and only take into account the angle between the vectors, measuring the similarity of the ratios of the pizza ingredients, we could divide out the vectors with their lengths, and compute their similarity as the cosine of their angle, which is the inner product of the obtained unit vectors:

Beyond the cooking recipes. The upshot of our pizza investigations is that the vector representations arose around natural language, as data exchange formats, just like numbers and arithmetic arose as tools for counting and trading and bookkeeping. People developed the basic algorithms to count livestock, measure harvest, and exchange their goods. They developed the vector representations and tables and matrices to transmit cooking recipes and building plans, to coordinate their projects. Each generation probably thought vector semantics was invented yesterday, by the inventors of the most advanced technologies of their time, like we do. But counting, weighing, and the vector semantics evolved with nearly every culture and every natural language.

The next table lists some of the application domains where vector representations are in standard use nowadays.

In all cases, the items i of type 𝓘 are presented as tuples of features u of type 𝓤. A 𝓘-tuple of 𝓤-tuples is presented as a 𝓘×𝓤-matrix. A 𝓘×𝓤-matrix can thus be viewed as a 𝓘-dimensional vector of 𝓤-dimensional vectors, or equivalently as a 𝓤-dimensional vector of 𝓘-dimensional vectors. Either way, it is an assignment C:𝓘×𝓤⟶𝖵𝖺𝗅, written equivalently as a matrix or a linear operator mapping 𝓤-vectors to 𝓘-vectors:

where each value ⟨i | C | u⟩ = Ciu quantifies the feature u in the item i. Appendix B explains how a linear operator C maps the 𝓤-vectors, as mixtures of features, to the 𝓘-vectors, as mixtures of items.

Among the above application domains, text analysis is the simplest, while concept analysis and recommender systems show how the static vector semantics really works. We will get to them in a moment, after the shortest of summaries of text analysis.

Counting and weighing words. Text analysis starts from a set of documents 𝓘, usually called a corpus. Each document i∊𝓘 is viewed as a bag of words u∊𝓤 and represented as a vector

of occurrences of each word u in the document i. Hence the matrix C again. The frequencies of each u in the corpus 𝓘 on one hand and in the document i and the other are the ratios

where #𝓘 is the number of documents in the corpus 𝓘, #|Ci⟩ is the number of words in the document i, and

is the indicator of the presence of the word u in the document i. The popular word weight measure TF*IDF is the product

of the logarithmic versions derived from the above frequencies:

where 𝖨𝖣𝖥 is the Inverse-Document-Frequency and 𝖳𝖥 is called the Term-Frequency, although it is not normalized into a frequency. The logarithms are used to mitigate the heavy biases of word distributions. If the terms were normally distributed through documents, then dividing the counts C with 𝖣𝖥 above would give a realistic weight measure. But the word frequencies famously obey the power law, as demonstrated by George Zipf in his 1930s analyses. The power law is numerically close and often indistinguishable from the log-normal distribution. By replacing frequencies with their logarithms, the information retrieval systems even out the expectations and simplify calculations.

Words, terms, tokens? Transforming a document into a bag-of-words is easier said than done. Ignoring the order is easy enough. But what should we count? It is not always easy to decide what is a word. Should we count commas, question marks, and colons, as words? Should we count “Cow”, “cow”, “Cows”, and “cows” as the same word? The word “earth” may or may not mean the same thing as “Earth”. Many different approaches to such questions were proposed. One of the solutions was to call terms whatever seems of interest for counting and to count terms instead of words. The perfect solution, adopted in large language models, is to count tokens instead of words or terms. The solution is perfect because a token is whatever you might want to count. A bit like term, but much better.

2.2. Concept Analysis: mining for meaning

Looking at the vector representations of words as tuples of numbers might suggest that we didn’t make much progress towards understanding meaning. Words denote concepts, not numbers.

The view of a word as a vector reduces its meaning to a combination of the meanings of its vector components. A pizza recipe is a tuple of numbers, provided that we know which numbers correspond to which ingredients. A pizza is a mixture of its ingredients, and the pizza vector is a mixture of the basis vectors.

The meaning of a word is a concept. A concept is a mixture of basic concepts. The vector representing a word captures its meaning as a mixture of the basis vectors corresponding to the basic concepts. — But if a word is a linear combination of concepts, then the concepts are linear combinations of words.

Concept mining. The goal of concept mining is to extract from a corpus a concept basis: a minimal complete set of concepts. A set of concepts is complete for a corpus when all corpus words can be expressed as its linear combinations. A trivial complete set can be obtained by taking all words as concepts and letting every word refer to itself. The goal is to find a minimal complete set.

The gold vein of concept mining is that every corpus has a canonical concept base. To explain this, we step away from vectors for a moment and sketch the idea of Formal Concept Analysis (FCA) of relational data. Then we return to the vector representation and sketch the idea of Latent Semantic Analysis (LSA) of numeric data. For concreteness, all is presented in terms of a familiar application.

Recommender systems. The task of a recommender system is to predict and recommend the items that Alice will like but has not yet tried. The predictions are derived by aligning Alice’s and Bob’s feedback about the items that they both used, and then extrapolating from Bob’s feedbacks Alice’s likely preferences. Asking Alice and Bob what they like is of little use, since they do not assign the same meanings to the same words, and the alignment would be too crude. They might both say that they like action movies, but Alice’s concept of action movie may be different from Bob’s, and there may be no movies which they both like. The quest is for an objective concept basis, independent on individual semantical assignments.

Plato’s Problem. How can Alice and Bob communicate in natural language at all if every word means one thing for Alice and a slightly different thing for Bob? Where is the common ground of meaning where people stand together? Where do the shared concepts come from? This is what Chomsky called Plato’s Problem. Plato’s answer was that the shared concepts come from a “place beyond heaven” (topos hyperuranios), where ideas reside before they are projected into our minds. Chomsky’s version was that our semantic, concept-forming capability was innate, just like our syntactic, sentence-forming capability.

Netflix Problem was the topic of a developer competition that ran 2006–2009. The movie rental company Netflix $1,000,000 for improving their recommendation algorithm by 10%. The goal was reached by several teams around the same time, with the winning submission followed by the first contender within 20 minutes. The input was a 𝓘×𝓤-matrix again, this time with a set of movies 𝓘 and a set if users 𝓤. The entries of the matrix were the ratings assigned by each user to the movies that they watched. Just like the pizzas were reduced to vectors of ingredients and documents to bags of words, the movies were reduced to tuples of ratings. A toy example or raw input with 5 users’ and 4 movies would be something like:

Counting the stars would give a table of integers. Replacing stars by 1 and the empty fields by 0 would give an interaction relation. Averaging and normalizing many integer tables, to factor out users’ rating habits (that some mostly give 1 or 2 stars, some mostly 4 or 5) would extract a numeric table with a lot more information, but always sparse, since most users have only rated a few movies.

Question. Which concepts of user taste and which concepts of movie style induced the given ratings?

Formal Concept Analysis (FCA) is the lattice-based mining approach due to Rudolf Wille. The ★-table is first reduced to the interaction relation

where 1 means that the user has assigned a rating to the movie, 0 that the rating is empty. This relation can also be viewed as the bipartite graph.

Alice is a, Juno is j and the relation R(Alice, Juno)=1 is represented by the graph link a➛j. And so on, for all users and movies. The idea is to define

  • a user taste as a set of users who like the same movies; and
  • a movie style as a set of movies liked by the same users.

But a set X⊆𝓤 of all users who like the same movies and the set Y⊆𝓘 of all movies that they all like determine each other. There is a one-to-one correspondence between the user tastes and the movie styles. The interaction relation R induced by a data matrix determines a single family of concepts, interpreted in two isomorphic ways as user tastes and movie styles.

A pair ⟨X,Y⟩⊆𝓘×𝓤 where all u∊X like all j∊Y spans a complete subgraph of the bipartite graph, with every node of X linked with every node of Y and vice versa. The user tastes and the movie styles can be identified as the complete subgraphs of the bipartite interaction graph. A concept induced by the interaction relation R is thus in the form 𝛾=⟨X,Y⟩, where X and Y are the nodes spanning a complete subgraph. All complete subgraphs of the above graph are easy to spell out:

They form the concept lattice induced by the given matrix of ratings. It is ordered by inclusion of the sets X of users, and by the reverse inclusion of the sets Y of movies. The concepts higher up in the figure above have bigger sets of users and smaller sets of movies.

The concepts explain the relations between the users and the movies, in the sense that each link in the bipartite graph factors through an element of the concept lattice:

The four diamonds correspond to the four concepts 𝛾=⟨X,Y⟩ spelled out above. A user-movie couple is R-related if and only if there is a a concept which links them:

where uRi abbreviates R(u,i)=1 in the interaction relation table. The user feedback u➛i is thus justified by a concept decomposition u➛𝛾➛i relating u’s taste and i’s style.

This mining idea behind the Formal Concept Analysis (FCA) is closely related to the one behind the Latent Semantic Analysis (LSA) that we will need to understand in the next subsection. To connect them, I now briefly formalize the FCA construction that you just saw, and do the same with the LSA one later, so you see where they both get the concepts from. If you are not interested in formal details, this is the moment to skip to the next section, and the bigger pictures :)

The formal construction of the formal concepts is an instance of a Galois connection between the lattices of sets of users on one hand and of movies on the other:

For consistency with the next section, the subsets X⊆𝓤 and Y⊆𝓘 are viewed as characteristic functions (predicates) X:𝓤⟶{0,1} and Y:𝓘⟶{0,1}, respectively. The components X and Y are obtained as the fixpoints of the closure operators derived from the Galois connection R*R∗ as follows:

The one-to-one correspondence between the user tastes and the movie styles is now witnessed by the equivalence

Latent Semantic Analysis (LSA). While the FCA approach is easy to understand and implement, it has the obvious drawback ignores the actual ratings. To provide useful information, the raw ★-ratings like those in the toy table above, need to be preprocessed to filter out the irrelevant user habits, e.g., that some users never use the ratings below 3 stars, whereas others never use the ratings above 3 stars, and so on. Depending on the type of information that is of interest, the raw data are normalized, multiple samples are averaged and preprocessed, leading to a fractional rating table, something like this:

The task is now to predict the missing ratings, which will allow us to provide useful recommendations. The matrix now leads to a bipartite graph weighted by the ratings.

The Latent Semantic Analysis is the concept mining method based on linear algebra, introduced in a series of papers³ by Deerwester, Dumais, Landauer and collaborators. The idea, instantiated to the Netflix Problem again, is that

  • a user taste as a linear combination of users who like the same movie style; and
  • a movie style as a linear combination of movies liked by the user tastes.

There is again the obvious correspondence between the two, this time established by linear operators. The missing ratings will be derived from the concepts, and the concepts will be derived from each of the fully inhabited minors of the given matrix. These full submatrices induce the linear operators which map the user tastes to the movie styles, and vice versa, like before:

The concept factoring on the right this time corresponds to the Singular Value Decomposition (SVD) familiar from linear algebra. Applied to the upper left minor L of the given rating matrix, the SVD would give

where the left factor is E, the right factor is P, and the diagonal matrix packs the singular values, which are the roots of the eigenvalues of each of the hermitians obtained by composing L with its transpose. The hermitians play in the LSA the role of the closure operators in the FCA. In general, the LSA concept mining process is analogous to the FCA process:

The matrix L thus first determines an adjoint pair of linear operators, their composites determine a pair of hermitians with shared eigenvalues 𝜆. The adjoint operators establish a one-to-one mapping between the corresponding eigenspaces. Hence the SVD. The concepts are this time the pairs of eigenspaces |x,y|. For the above L, there are two such pairs, corresponding to the eigenvalues 3 and 1. They determine the two latent concepts, that can be interpreted as basic tastes or styles. One is 3 times stronger than the other.

Intuition. If you have a moment, it may be worth spending a thought on the idea of concepts-as-eigenspaces. It is better if you think on your own, but here is how I think of it. The linear operator induced by the rating matrix maps every vector of users to a vector of movies. Each user’s ratings determine a mixture of movies. A linear mixture of users determines a linear mixture of movies. A user vector to a movie vector. The same rating matrix also induces a linear operator other way around, in the same way. Each movie is mapped to a mixture of users, according to their ratings, and a linear mixture of movies goes to a linear mixture of users. A movie vector to a user vector. The composite of the two linear operators maps a user vector to a movie vector and back to a user vector. The eigenspace of a hermitian is a space where each vector is mapped into itself, multiplied by the corresponding eigenvalue. A taste concept is thus a mixture of users which is mapped to a mixture of movies, a corresponding style concept — and back to itself. It is invariant under the linear transformations. Such eigenspaces form a canonical basis of the vector space of users on one hand and of movies on the other. Every user is a unique mixture of the basic taste concepts; every movie is a unique mixture of the basic style concepts. Concepts are invariants. The solution of Plato’s problem.

Particles and waves of meaning. While the FCA relatess a user and an item through some concept, the LSA measures how related they are by adding up their relationships through all concepts. If a user and an item are thought of as nodes in a road network, then

  • the FCA views the traffic from the standpoint of a driver, picks a route, usually one of many, and follows it, whereas
  • the LSA views the traffic from the standpoint of an urban designer or an engineer, and measures the capacity of all routes to transmit the waves of traffic.

Meaning also seems to travel in particles and in waves.

3. Dynamic semantics: language production

The moment of truth. So far, we pursued static semantics as assignment of words to objects, reduced to points in space. The word meanings were presented as relations between vectors. But meaning is not an assignment of words to objects.

Right: “The Treachery of Images” by René Magritte (Courtesy of LA Museum of Art). Left: Dall-E’s response to the prompt: “Please attach labels in the style of Magritte to a cow, a chair, and a glass of beer.”

It is a process where words assign meaning to each other, objects to each other, words to objects, objects to words, everything to everything. We discussed this in the Introduction (``Who are chatbots?’’). Meaning is not a hammer that nails signs to things, but a process that makes everything into a sign. But making everything into anything sounds a little chaotic, doesn’t it? Let us try to make sense of it.

The most useful aspect of static semantics is that it provides the vectors that carry the dynamics of dynamic semantics. Since meaning is a process, dynamics is the essence of semantics. The notion of “static semantics” is almost a contradiction in terms. Language as the process of meaning, communication, information transmission, is the paradigm of dynamics. Just like the spaces of physical phenomena are metric and the spaces of social phenomena are networks, the space of language is the time. If “history is one damn thing after another”, language is one word after another, one sentence after another, one context after another: the history of meaning.

Language originates from communication and the theory of communication is first of all a theory of channels. Dynamic semantics is clearly concerned with channels but they are barely mentioned in linguistics. Why is that? It turns out that the theory of language and the theory of channels met early on, and parted in confusion. It’s a funny story, and may be useful later, so here it goes.

3.1. Language production as predictive inference

What is the next word that I am going to…

What was the next word that I was going to say? If the possible answers to such questions are written as conditionals

then their respective conditional probabilities can be written

In the usual notation, the first line would be written

and the other two similarly. Here we won’t write them like that. Changing standard notations is seldom a good idea, but they are seldom this bad.

Frequencies and conditional probabilities. Conditional probability is defined as the ratio

of the frequencies

where #D is the size of the document and # {blah} is the number of occurrences of the phrase “blah” in it. The probabilities of what I am saying then boil down to products of the probabilities of what is the next thing that I am going to say:

and so on. To see why this is true, we need a couple of generalities about chances, which may not be easy for everyone to remember (since they go back to the XVIII century) but they are certainly easy to understand if you give them a chance.

Conditional generalities. We just need the general definition of conditional probability and one property of the conditional probability. We are given a family of observables S, a space Ω = {a, b, c, . . . ⊆ S } of events, and a frequency or probability distribution [−] : Ω ⟶[0, 1]. An event is a measurable sets of observables, and the frequency distribution tells how likely it is that an event will happen. The basic properties of events and their distributions listed in Appendix C. The conditional probability is defined⁴

where [ab] abbreviates [a b], as we do whenever confusion seems unlikely. It follows directly from the definition that conditional probability is transitive:

Setting a = S gives the probabilistic modus ponens as a special case:

Word chain predictions. Getting back to the next thing that I was going to say, the probabilistic modus ponens and the transitivity say that the chance that I was going to say precisely that it is the product of the chance that I start the way I started and the conditional probabilities that I go on the way I went on:

In general, the chance of a sequence of, say, 4 events is

This is the idea of the general rule of text generation:

Chain rule. The chance that a chain of N events will occur in order is

N-grams. Although the chain-rule is nice and simple, the context conditions get long when the phrase is long and lots of probabilities need to be multiplied. On the other hand, the impact of the words far down the context is much smaller than the impact of the words closer to the one that needs to be predicted. So truncating the context makes the prediction a little less precise but much simpler.

A context of length N is called an N-gram. If the contexts of what I was going to say are truncated to 2-grams, then it would be generated like this:

Note the approximation sumbol . Since the remote contexts are ignored, the chance of the phrase on the left is only approximated by the product on the right, and not equal as it would be if the full chain rule was used.

Origins of information theory. The N-gram approximations were discovered by Claude Shannon and described in his 1948 paper on ``A Mathematical Theory of Communication’’. This paper laid the foundations of information theory. Here is the page with the N-grams:

A page from “A Mathematical Theory of Communications” and its author, Claude Shannon (Photo courtesy of the MIT Museum)

The demonstration how increasing the N generates text increasingly resembling English left a deep impression. It also brought to the surface a striking semantical phenomenon: The more likely phrases are more likely to appear meaningful. Beyond sampling the N-grams, Shannon proceeded to the conditional text generation, which he presented as Markov chains on the next couple of pages.

Subsequent pages from Shannon’s “Mathematical Theory of Communications” (Public domain)

This was proposed this as a novel application of a well-known formalism on the language production. What Shannon did not know was that Andrey A.Markov⁶ originally introduced the his chains for that very application: to analyze frequencies of chains of morphemes. Here is a page from Markov’s 1913 study of Pushkin’s epic poem “Eugene Onegin”, a classic of Russian literature.

A page from “Example of Statistical Investigation of the Text of Eugene Onegin Concerning the Connection of Samples in Chains” and the author A.A. Markov (Public domain)

The crucial idea of Markov chains was that the conditional probabilities, construed informally since the earliest Pascal-Fermat discussions about probability and formally since Bayes, should be viewed as state changes. Shannon’s drawings above suggest that. This idea is familiar in games of chance. In the simplest games of dice, illustrated on the left, the dice are thrown repeatedly, and the payoffs are always the same.

Illustrations by Dall-E

On the right is gentle Dall-E’s view of a game of dice where the payoff’s vary with the state, represented by the position of the figures on the board. Getting a 6 at one position sends you to jail, at another wins the game. In a Markovian model, the board positions are abstract states, and the chances are summarized by a probabilistic state machine. Remarkably, Markov’s probabilistic machines preceded the deterministic machines (used by Turing in 1936 and by McCulloch and Pitts in 1943) by nearly as many years as Markov’s application to language preceded Shannon’s.

But Shannon introduced information theory as a study of a much more general concept: the channel.

3.2. Language as a communication channel

Shannon’s mathematical theory of communication has been used for analyzing information processing and synthesizing communication channels for almost 80 years. Language is our main tool of communication and information processing. Why is the theory of language not a part of the theory of communication? Technically, the two barely intersect and most researchers view them as unrelated. Why is that?

Noam Chomsky argues that one is a science, the other a branch of engineering. He argues that linguistics strives to explain a natural process in human mind, whereas information theory builds codes to optimize functioning of antennas and networks. Such optimizations use statistics, whereas human mind “does not play dice”⁷. Peter Norvig⁸ argued that it does and that empiric sciences use statistics in any case, justifying statistical learning and predictive inference.

However, the empty space between the theories of language and of communication may not be a consequence of such principled standpoints as much as of some honest mistakes. People lose not only their dice, but also their marbles. Nothing wrong or unusual about that, but retracing some steps may be worthwhile.

The last batch of basics :) Given a frequency distribution [−] : Ω⟶[0, 1] as above, a random variable X : Ω is a sample from Ω according to the given distribution. A source is a sequence of conditional probabilities generating a seqence of random variables:

At each step, the next sample is thus conditioned by all the previous samples from the same source. A channel is a sequence of conditional probabilities is a sequence of input variables X and output variables Y conditioning each other:

The intuition is that the inputs X are produced by the Environment, the outputs Y by the System. In information theory, the inputs are messages, the outputs their encodings. In discourse, the two can be thought of as questions and answers. In semantics, the Xs are concepts, the Ys the phrases that express them; or the signified items and the signifying tokens in language production.

Feedback and feedforward. Using the transitivity of conditional probability from the preceding section, a channel is decomposed as

Using the chain rule, it can be further decomposed as given as the product of the pair of sequences in column (0) of the following table.

Column (-1) contains the sequence of conditional probabilities where the outputs Y do not depend on the previous Ys: the channel outputs are not fed forward and the channel is feedforward-free. The column marked by (1), on the other hand, contains the sequence of conditional probabilities where the Xs do not depend on the Ys: the outputs are not fed back and the channel is feedback-free. The mnemonic symbols under (-1) and (1) are meant to suggest this, and the marking (↔) under 0 says that both the feedback and the feedforward are present. In columns (-2) and (2), the dependencies are restricted to the immediate predecessor inputs. In the former case the channel is memoryless, in the latter the inputs form a markovian source.

Memoryless feedback. Column (-1) suggests that the outputs Y of a feedforward-free channel depends on the input source X alone. For memoryless channels, this boils down to the predecessors, and in one of the earlierst information theory textbooks, R. Ash⁹ provided the elegant, convenient, and very useful characterization of memoryless channels:

However, the channel dependency cannot be restricted to the source X by condition (⥇) alone if there is feedback from Y to X, and X is therefore not a source. That detail was missed in a good part of information-theoretic research for a good number of years. Feedback was thus tacitly precluded without anyone noticing. Even the careful analysis of the issue in James Massey’s 1990 paper on directed information¹⁰ barely dented the authority of the early source.

Mamoryless channels do satisfy Ash’s formula if they are feedback-free. Using the channel rule and transitivity, the distribution of any channel can be reduced to the products

On one hand, memoryless channels satisfy

On the other hand, for feedback-free channels we have

Substituting into the above products yields

which is equivalent to Ash’s formula, by definition of conditional probability.

Mamoryless channels with feedback may not satisfy Ash’s formula. Consider the following channel:

The diagram suggests that the transition probabilities are

The two sides of Ash’s formula have different values:

Upshot. The process of language production is a communication channel. It is a feedforward channel because the next word that I output depends on the previous outputs that I previously produced. It is a feedback channel because the concepts that I am still trying to convey vary with what I have said already. In terms of Xs as questions and Ys as answers, in a feedforward-free conversation, the answers are oblivious on the previous answers. In a feedback-free conversation, the questions are oblivious on the answers already given. Eliminating the feedback from a channel concept limit its applicability on language.

Chatbots are channels with feedback. In the next part of this series, we will see how they use the feedback to learn. But let us first summarize.

4. Scaling up

The Markov-Shannon view of language presents N-grams as states and language production steps as state transitions. When I speak, what I am going to say next is determined by my state of mind. After I say it, my state of mind progresses to the next state and it determines what I will say after that. The process surely proceeds something like this — for a sufficiently flexible notion of state of mind.

The problem is that the notion of N-gram is the least flexible notion of state imaginable. A state if mind reduced to a register of length N doesn’t scale up. If I were to speak my mind from a tiny lexicon of 1,000 words¹¹ and if my attention span is so short that I can only ever remember the last 10 words that I said, then just storing the transition probabilities would require 1000¹⁰×1000 memory cells, which is more than the number of atoms in the observable universe (estimated to be about 10⁸⁰). So I am certainly not storing an N-gram table in my head. What might be the flexible version of state that overcomes this scalability problem and supports the feedback and the feedforward references? In the next lecture, we will see how the state dependencies are realized as type dependencies. I close the this one by the example from the previous lecture. Here is a bird’s eye view of a channel production of Groucho’s elephant sentence, as descending down a tower of dependent types.

Attributions

Unless stated otherwise, graphics and illustrations were created by the author.

Appendix A: Bigger pizza vectors

Pizza recipes including not only the ingredients but also the crust preparation steps, separated from the toping additions, and the baking temperatures.

Appendix B: Basic notions and notations for vectors

Vectors are tuples of numbers. In a basic algebra course, you would write the veg-meat pizza representations from the Pizzas in space figure as column vectors:

If you remember how to multiply matrices, then you could compute the inner product by writing one of the vectors as a row and calculate

Transposing the column w as the row w* = (30 170) reinterprets the vector as the operation that inputs vectors and outputs numbers, i.e. scalars. They will turn out to be the squares of the lengths of the projections of the input vectors on w. In the VSM context, the equivalent notation for the same vectors is often more convenient:

The transpose w* = (30 170) is now simply

where ⟨veg | and ⟨meat | are, respectively, the projections on the basis vectors | veg ⟩ and | meat ⟩. The assumption that vectors | i ⟩ and | j⟩ form a basis means that they are mutually orthogonal and of unit length, i.e. that the lengths of their projections are

The assumption that | veg ⟩ and | meat ⟩ form a basis thus means that ⟨veg | veg⟩ = 1 = ⟨meat | meat⟩ and ⟨veg | meat⟩ = 0 = ⟨meat | veg⟩. The inner products are calculated for vectors expressed in a given basis, but the values remain unchanged when the basis is changed.

When is a basis, then the inner product boils down to ⟨x | y⟩ = ∑ xi yi. The algebra of inner products encodes the Euclidean geometry, from the Pythagoras’ Theorem to basic trigonometry. The ket-notation | v ⟩ for a column vector v and the bra-notation ⟨v | for the corresponding row vector was introduced by Paul Dirac, by decomposing the earlier bracket-notation for the inner product ⟨v | w⟩ = ⟨v | · | w ⟩.

Appendix C: Basic notions and notations for probabilities

The events in Ω are assumed to be closed under intersections and complements:

where ¬a means that a does not happen, and a b means that both a and b do happen. The obvious consequences are that

Intuitively, a b means that either a or b happens, ∅ is the impossible event and S the certain event. The frequency distribution [−] : Ω⟶[0, 1] is usually assumed to be at least finitely additive:

which is equivalent to requiring

The basic properties of conditional probability, as defined in the text, can be proved as easy exercises:

Notes

¹René Descartes points to language as the main distinction between humans and animals in his Discourse on Method. That part can be read as a preamble to Turing’s test for distinguishing humans from computers.

²For any pair of vectors x,y, the length of the projection of x to y is equal to the length of the projection of y to x.

³One of the papers was called ``A Solution to Plato’s Problem’’.

⁴The usual definition leaves conditional probability undefined when [a] = 0. The assumption [∅ ⊢ b] = 1 means that after the impossible event ∅, every event b is almost certain. This may be questionable intuitively, but it preempts many irrelevant side-conditions.

⁵There are exceptions. We saw one at the end the Syntax part, where the word “barge” at the beginning of the paragraph determined how a key could be tipped into a lock by a foot at the end of the paragraph. A sentence at the beginning of a novel may be a key of the turn of events 800 pages later. Remote semantical impacts play a crucial role in large language models.

⁶Andrey A. Markov Sr. lived 1856–1922. His son Andrey A. Markov Jr., also a prominent mathematician, lived 1903–1979.

⁷Just in case, let me clarify that this is not Chomsky, but a paraphrase of Einstein’s “God does not play dice” objection to the statistical foundation of quantum mechanics. Interpreting this statement in the context of the black hole theory, Hawking quipped: “Not only does God play dice, but he sometimes throws them where they cannot be seen”. That is at least as true of human mind. A recent summary of Chomsky’s standpoint is in his NYT article with Roberts and Watumull.

⁸Norvig is a prominent AI researcher. In the golden era of Google, he was their director of research. See his blog post for his side of the argument.

⁹ Robert B. Ash, Information Theory. University of Michingan and Interscience Publishers (1963–65). Sec. 3.1.

¹⁰J. Massey, Causality, feedback, and directed information. Proc. of Intl. Symp. on Inf. Theory and Applications held in Honolulu, HI (November 1990)

¹¹An average English speaker uses about 10,000 words, and understands about 20,000.

--

--

Dusko Pavlovic

Prof at University of Hawaii. Home page: https://dusko.org/. Book: "Programs as diagrams: From categorical computability to computable categories".