Code Duality in Diffusion and Transformers

Published in

Intuition Machine

12 min readOct 15, 2022

It’s been a few years since I was introduced to Hoffmeyer’s “code duality” in Biosemiotics. It’s frustrating to me that this metaphor is never used to explain deep learning networks’ astonishing generative capabilities. (see: Why Stylistic GANs are So Deceptive )

I believe a historic bias favors a dynamical explanation over an explanation that embeds linguistic elements. This is odd when we know that biology requires DNA to maintain its long-term stability. Yet researchers persist with methods that are absent language features.

Biology and computers share in common the nature that both are rate-independent. Present-day computers are completely devoid of time constraints. Computers are decoupled with actuators and sensors that translate the digital and continuous domains.

Biology, however, doesn’t have the luxury of a mind that crisply delineates the discrete from the continuous. As a consequence of evolution, biology is gratuitously more complex in its organization. But biology moves in discrete fashion as a consequence of molecular transitions.

Environments, however, feel continuous. Minds that evolved interact in this world through a medium of continuous movement. The complexity of reality is a consequence of alternating layers of discreteness and continuousness. But it is continuousness that leads to predictability.

We can arbitrarily partition perception into two buckets, one of reducibility and another of irreducibility. Irreducibility implies an inability to perceive shortcuts that lead to predictability. Curiously, Newton invented calculus to reason about continuity.

The benefit of a continuous reality is that its predictable with analytic mathematics. This predictability has its limits. Physics has n-body problems that have no analytic solutions. Reality is obviously an n-body system.

Man invented simple machines to overcome their physical limitations.

Have you observed your world recently and realized that there is a multitude of thingamajigs that makes events more likely without expending energy?

There are things in this world that curve the probability landscape such that certain kinds of events become likely. They lower the energy requirements for an event thus making them more likely to happen.

These things do it in a way that is repeatable and requires no energy loss on their part. They shape probability by their mere existence. They are static things and static things do not expel energy.

It does sound like a riddle, what are things that do work but don’t require energy? In the biological world there are proteins that are called enzymes. They are reusable catalysts that make more likely the conversion of molecules.

It sounds like a violation of the 2nd law of thermodynamics, however, there are plenty of analogies of these kinds of things in the macroscopic world. These things are known as simple machines and humans have used them for eons.

There are 6 classic simple machines identified since the Rennaisance.
1. Lever
2. Wheel and axle
3. Pulley
4. Inclined plane
5. Wedge
6. Screw

Simple machines change the direction and magnitude of force. They allow work to be performed with less energy. A biological enzyme doesn’t just rearrange the application of force, it reconfigures molecules.

They are not mere force translation machines, they are symbol translation machines. They are the basis of a term rewriting system. Hence they are part of the building blocks of a biological computer.

Humans have always used machines as metaphors for living things. They imagined that organisms were built of simple machines (i.e. levers, pulleys and wheels). The loom that mechanized textile production was an early inspiration for computers.

**Programming patterns: the story of the Jacquard loom | Science and Industry Museum**

As technology progresses, humans began using clocks as metaphors for biological systems. Charles Babbage’s mechanical computers were based on the same technologies that you would use to construct a clock.

It’s unfortunate that if Babbage had known of the invention of electrical relays (invented in his tine) he could have built an economically viable computer and humanity would have had electrical computers a century earlier.

Physics studies things that transform energy. What physics does not study are things that transform information. This is despite energy and information being intrinsically intertwined.

Simple machines and enzymes gain their functionality as a consequence of their shape. They change the world because they redirect energy. Said differently, they change the world because they introduce constraints into this world.

What enzymes do is that they employ a physical constraint as a means of applying a symbolic constraint. They are the bridge that biology employs to achieve computation. Biology means of information storage is through molecular combination.

Humans in their creativity, eventually began creating machines that “seem to think”:

The commonality with these machines is that they translate discrete action or sensing into continuous machinery. Only through the invention of digital computers could we scale “programmability” to higher complexities.

Reproducibility and repeatability, a reason behind the utility of computers, is a consequence of their discrete nature. Thus if we seek robust systems, they must be discrete in nature, but their perception must be continuous.

The Church-Turing theory of computation implies that discrete systems may be unpredictable. The irreducibility of reality itself is a consequence of its discrete nature. We have predictability because of the short-cuts that arise out of the emergence of continuous behavior.

Evolution selected solutions that combined the robustness of digital computation and the interpretability of continuous sensing. At the core of living is what Peirce identified as the interpretant.

The dynamics of life and as a consequence mind is the interplay of Peirce’s triadic relationship between objects, their representations (i.e. signs), and their interpretation. Biology distinguishes itself from physics because it involves signs and interpreters. Yet, science has physics envy, so many stubbornly cling to formulations that are physics-like (i.e. IIT, FEP, etc) and not something in the domain of biology.

Biology is dependent on a cycle that involves Von Neumann’s universal constructor. That is, DNA is copied and interpreted by the cell to construct another cell. It involves a process that copies the DNA and a process that interprets the DNA. It’s a constraint closure.

Apoptosis or “cell suicide” is perhaps the reverse cycle of this. Biology balances mitosis and apoptosis for healthy organisms. A duality between cycles causes growth and destruction.

So it’s as if there’s an anti-matter principle here that for every constraint closure driven process, there is an anti-process that drives the process in reverse. The Krebs cycle generates the components of life. When driven in reverse, it serves as cell respiration.

Thus there’s a dual utility of cycles. The presence of a constraint closure implies a propensity for preservation. If this closure is reversible, then this reversibility may drive another emergent phenomenon.

The universal constructor cycle is possible because the interpretation of DNA is repeatable. But I would like to explore the “reverse” of this process in a higher information regime where mechanical interpretation is flawed.

Hoffmeyer introduced the idea of “semiotic freedom” to depict the creativity that arises from the alternative interpretations of signs. The lossy interpretation of instruction leads to greater creativity. Life is diverse because its interpretive mechanisms lack absolute precision.

Hence there is always value in re-reading your own writing. The same consciousness will interpret the same words differently in time. No man ever steps into the river twice. For it’s not the same river and it’s not the same man. (see: The Cognitive Flywheel that is Writing )

Cycles are a kind of reversibility. A cycle implies that one can return to any state. But if every process is never the same, how can reversibility be possible?

Reversibility, reducibility, predictability, and symmetry reflect the same thing. But it’s interesting that reversibility is rarely associated with the other three. The notion of equivalence in Category Theory is precisely this characteristic of the reversibility of mapping.

Reversibility is fundamental. It’s fundamental even in the laws of physics at the elementary particle level. At this level of reductionism, the complementary principle is also fundamental. A principle that is often associated with observer coupling.

Ddiffusion models are as big a breakthrough as transformer models. It’s a rare development when an architecture requires fewer compute resources than previous proposals. (see: What are Diffusion Models?)

The intriguing bit about diffusion models is how it employs a numerical solver to calculate the reverse flow. It’s rare that you have this level of computational control (see: https://yang-song.net/blog/2021/score/)

The stable diffusion process takes this to a new level by employing ideas from StyleGAN to control each layer of the reconstruction process.

**High-Resolution Image Synthesis with Latent Diffusion Models — Machine Vision & Learning Group**

This is next-level in because the control variables are not non-parametric distributions but rather raw text. The richness of semantics is immensely greater in textual models. It’s mindboggling we can control these massively parallel systems.

It’s serendipitous that we reveal the immense capability of diffusion models due to their fusion with transformer models. It appears that the utility of deep learning hinges on its ability to employ language models as input.

Transformer models have been around since 2017. But diffusion models are a little over a year old. Its uptake will be much faster given it can leverage all the new computing and software innovations from the last 5 years (see: Attention Is All You Need).

But are diffusion models something that brains do? Is this how brains retrieve context for subsequent cognition?

Diffusion models are more effective than Generative Adversarial Networks. They can be applied to any kind of modality and can transform any space into any other space.

It’s the ease in reversibility that is shocking. It goes against the 2nd law of thermodynamics. It’s Maxwell’s demon instantiated in software.

This reversibility has been perplexing to me since it was demonstrated in 2015 by @jaschasd and @SuryaGanguli (see: Deep Unsupervised Learning using Nonequilibrium Thermodynamics). In fact, this method highlights the flaw of employing non-parametric distributions as the mechanism for latent representation (see: VAE). Latent representations can be anything! autodesk.com/research/publi…

The mind-bending insight behind all of this is that it’s an error to believe that latent representations can objectively mean something. It is the process that renders meaning to a representation, not the other way around!

It is analogous to biology. How is it that DNA evolved to its present encoding to render life? Is there an objective universal machine code, or is this code a consequence of a subjective process? Said differently, it’s the evolutionary process that renders an interpretation.

Just as diffusion models like #stablediffusion are able to recreate images from just a seed (and its original prompts), DNA functions in the same way. The only requirement for repeatability is that the seed can be stored in a robust form.

Ever since DNA was discovered, it has been a perplexing thing as to how biology could render its code into DNA. Yet here we have with a diffusion process a reproducible process of how that may come about. Is this not, in fact, revolutionary from an abstract framing?

In this framing, let’s review deep learning history again. DL is, at its core, curve-fitting. Backpropagation (i.e. the chain rule) solved the problem of constructing covariant representations across many layers of related representations.

Covariant representations makes possible complex iconicity. This sets the stage that leads to the recognition of complex indexical relationships. This was discovered oddly enough through research in the symbolic space. Transformers was the breakthrough needed for indexicality.

In parallel, through the development of GANs, deep learning was discovered to be extremely competent in recreating images with uncanny precision. This is known not because we can measure it mathematically but because we can see the images. (see: Why Stylistic GANs are So Deceptive)

From here, the combination of the two (i.e. Transformers and StyleGANs) and the scalable diffusion models led to today’s image generators controlled by human language textual prompting. We know this works because we see the results.

Progress is being made not because we have good mathematical measures to tell us how good one network is over another. It is good because we see with our eyes that it is better. In the old days, it was believed that rendering detail out of VAEs meant lowering the variance.

In today’s image renderers, you simply prompt that asks for more detail (i.e. “high detail 4K”). The encoding in the latent space is simply irrelevant. This reveals that the latent encoding and the outputs do not render any discernable objective (or mathematical) meaning.

This is a strong argument for the anti-representation stance of enactivist psychology. The intuition is correct, but it is incomplete. It is the semiotic process that is critical. (see: Deep Learning, Semiotics and Why Not Symbols)

I’m particularly fond of Hoffmeyer’s biosemiotics formulation of code-duality. Not many are aware of the concept. But it’s biologically inspired and serves as a good explanation of the recent use of language models to condition diffusion models.

However, Hoffmeyer’s formulation was a hypothesis requiring empirical validation. Diffusion models are reproducible and repeatable experiment that reveals the validity of code-duality. But it will take a while to show this happening in the messy realm of biology.

Deep Learning, in its essence, are computational tools that evolved out of differential and integral calculus. But unlike calculus, which employs infinitesimal objects, DL employs approximations whose errors are compensated on a massive scale.

Calculus enabled the mechanical calculation of properties of non-straight objects (i.e. curves). It leverages the precision of discrete logic to derive symbolic relations between continuous systems. DL leverages computation to do the same thing.

But it does so without the limitations of pure analytic solutions. We know in physics that analytic solutions are impossible to find in real problems. We forget that physics models are driven by reductionism, whose formulations are ideally disentangled from complexity.

Reality of course, is messy and complex, so just as we have discretized space using finite element numerical models to approximate physical simulations, we do the same in DL to approximate constraint satisfaction of predictions.

Thus one should interpret Deep Learning in the same spirit as Fermat and Leibniz (incidentally, both not formally trained in mathematics) instead of the perspective of Descartes and Newton.

Thus calculus and deep learning are useful in our reality because it is a reality that permits processes to converge toward stable processes. This idea is a superset of the anthropic principle (see: Anthropic principle — Wikipedia).

But Euler’s identity also has the property that it converges to equality. It’s an equation that is independent of any universe. So it should not be a surprise that the analytic equations of physics employ these converging numbers!

e and π are numbers that converge to a limit. e for a limit of compounding growth and π also a limit on the circumference of a circle. i simply relating rotational symmetry with linearity. i is an artifact of using complex numbers, but it’s not needed: https://geometry-of-relativity.net/rotations-in-space/eulers-formula-and-geometric-algebra/

Nature is somewhat predictable because processes converge either linearly or periodically. The utility of calculus is a consequence of the ability to express change in terms of formulas that converge. That is, computations that halt.

Deep Learning computation is evaluated using observations of their convergence. That is, a non-convergent network (often a consequence of bad initialization) isn’t a useful network. So both the observation of reality and reality itself require convergent mechanisms.

Which leads to the question of how biological processes have convergent properties? Life maintains its robustness as a consequence of DNA. Certainly, there can be other forms of life that don’t require DNA, but none that lead to the complexity we find on Earth.

This leads us to the inevitable conclusion about Deep Learning systems and discrete information. The surprising capabilities of artificial fluent systems are a direct consequence of code duality. Transformers and diffusion models are useful because they exploit this architecture.

Code Duality in Diffusion and Transformers

Written by Carlos E. Perez