The Semiotic Logic of the Perceiver-IO Architecture

Carlos E. Perez
Intuition Machine
Published in
5 min readAug 4, 2021
Photo by Marina Vitale on Unsplash

It’s not known to many that the first person to recognize the universality of NAND and NOR gates to logic was first articulated by C.S.Peirce. However, C.S. Peirce has always contended that logic was a subset of what he called semiotics (see: https://en.wikipedia.org/wiki/Semiotic_theory_of_Charles_Sanders_Peirce )

In recent days, DeepMind has unveiled a new neural network article coined Perceiver-IO (see: https://deepmind.com/research/open-source/perceiver-IO ) that is a general architecture for any kind of input and output. In other words, it is an artificial manifestation of a sign processing or semiotic engine.

What I would like to explore here is to provide a semiotic interpretation of how Perceiver-IO works. A problem with many explanations of Transformer based architectures is that it is easy to fall into the trap that understanding mathematics leads to understanding. But as Von Neumann has said to his students complaining about the inscrutability of quantum mechanics:

Young man, in mathematics you don’t understand things. You just get used to them.

But if we are going to conjure up better neural network architectures, we have to use the correct level of explanation so we can better reason about the architectures we build. Every field has an appropriate level of explanation. Semiotics is that better level of explanation for semiotic engines like transformers.

So here’s a sketch of a block of the Perceiver-IO architecture.

It’s based on a previous architecture, known as Perceiver, that also exhibited the capability of handling any kind of input.

Note how the input array (in green) is fed back into multiple layers as a ‘cross attention’ in the previous diagram. That cross attention is similar to how you would tie an encoder with a decoder in the standard transformer model:

Note that in a QKV transformer set up, the K and V are routed in pairs. K and V in fact have the same dimension, while Q may have a different dimension. A good mnemonic is to think of Q as a query that usually is of a smaller size than the values. The effect of a cross attention level is to provide input information to a decoder so that it can incrementally generate its sequence. It provides information that may be relevant in the decoding process.

The Perceiver-IO block differs from the Perceiver block in that there is this additional “output query array” that it merges into another cross attention block. The authors argue that the innovation here is that this additional block maintains the richness of the outputs. The original Perceiver architecture was confined to classification task. Note the right side of the architecture diagram where the ends converge to a narrow set of logits.

But let me explain what is happening here in semiotic terms. In simple semiotics, there are three kinds of signs (i.e. icons, indexes and symbols). In the conventional feedforward network, similarity is computed between the inputs and the weights. That is, an object is compared to its icon. Conventional Feedforward networks or Multi-Level Perceptrons are adequate for tasks for recognizing the similarity of signs with objects.

A transformer block is however much more complex. It requires at least 3 similarity operations. The keys and query inputs undergo an iconic transformation. This is followed by a correlation between these two iconic signs. It is this correlation, which is also a sign, that undergoes a similarity transform internal weights for the final output. The final transformation is an index. Sequences are signs that are tied together in an indexical manner.

In semiotic logic, a transformer block is an implementation of a natural proposition (i.e. Dicisigns that do not depend on human cognition). A proposition as you may recall involves a predicate between two objects (i.e. subject, object, predicate). A multi-head transformer block captures a collection of parallel natural propositions, that leads to a final argument.

The implementation advantage of the original Perceiver architecture is that it has a lower-dimensional latent space. This allows for deeper pipelines and hence more semiotic transformations. This is analogous to longer proofs. It maintains its integrity by syncing back via cross attention with the original object. The consequence of this is that each layer of the Perceiver can attend to different aspects of the original object without having to carry features across the entire semiotic pipeline.

Expressing this differently, a Perceiver network attends back to the original object to capture more information that it may have not captured in earlier layers. It is an interactive form of perception quite reminiscent of saccades. It makes sense for brains to perceive only what is relevant and then to fetch back what is needed in its subsequence calculations.

The utility of the Perceiver network is that it works universally across all kinds of sensory input. Unlike fine-tuned architectures like CNNs, it does not need to use a hardwired network architecture to exploit invariances in the input data. CNNs assume that adjacent inputs have relevance. But this doesn’t hold true for inputs other than images. Languages have greater complexity. In fact, it is likely that images have complexities that CNNs seem to be blind to (see: Adversarial networks).

The Perceiver-IO network inherits all these features but adds a twist that in that instead of only iteratively syncing back to an external object, it syncs with an internal sign. We’ve seen previously the advantage of having a world model to drive neural networks. This is related to this idea, but perhaps more like having an internal language model that can stand in for the world. One can think of this as the network is being influenced by its output (action in the biological sense). One consequence of this output block is that the Perceiver-IO can be used recurrently.

To summarize, Perceiver-IO executes a recurrent semiotic process that is checkpointed by symbols in the Peircian sense.

*Note: A Dicisign is a sign whose truth value involves both a descriptive and denotative of the same object.

--

--