Diffusion Models Are Like Episodic Memories

Carlos E. Perez
Intuition Machine
Published in
9 min readOct 16, 2022

--

Generated in MidJourney

Before humans invented computers, memory was already a metaphor. Memory is a process that is deeply embedded in our sense of being. It is not a thing. Yet we think of it as a thing because of computers (see: “Nothing Erased But Much Submerged”)

Is our use of memory in cognitive science a problematic construct? If so, is there an alternative, and will it be better? (see: Memory metaphors in cognitive psychology)

Memory, in brains, are the product of a brain’s function, unlike in computers, where we’ve engineered specialized memory circuits. Memory is an emergent phenomenon and is not a cause of cognition.

The brain is full of self-referential cycles like this. So it’s too easy to make a mistake that an effect leads to a cause. A useful way to think of memory as being a constraint. Brains create constraint closures, and these constraints are memory-like.

Memory-likeness is pervasive in the brain. Intuition is a memory-like kind of learning. You can’t think of learning in the absence of memory. But it is accurate to frame learning as memory-like and how the brain preserves experience through a memory-like process.

The brain is composed of memory-like processes, each kind serving roles for overall cognition. The difficulty of memory-like processes is that they are non-linear. In physics, any process that has memory is modeled as non-linear (ex. Non-newtonian fluid).

It remains a mysterious thing how memories are stored in brains. But perhaps Diffusion models reveal a clue. There is a certain kind of holographic robustness in billions of trained parameters.

If memories are sparsely distributed across the brain, then it’s more process-like than thing-like. Although we know how to store memories robustly (see: digital memories and DNA), we are unaware of how brains do so.

When we think of memories as things, we imagine impossible ideas, such as transferring our consciousness into other minds. Memories as processes make such an idea completely preposterous.

When we undergo anesthesia, we have no memories of being under. If we think of memory as a thing, how can the brain be shut off and return to its normal memories? Anesthesia does not shut off the brain, it just disrupts the flow of interactions between neurons in the brain.

The brain is a process that is in constant interaction with itself. However, it is heterogeneous, so damage in parts of the brain affects different capabilities and hence our access to our memory-like processes. We recall because we can interpret these processes.

Thus we have few memories of our infanthood. The memory-like processes of that time still exist, but they have evolved considerably that any remnants are difficult to interpret.

Unfortunately, similar to deep learning diffusion models, the encoding is noise-like and cannot be reconstructed without a corresponding interpreter. This is analogous to a cell not being able to interpret foreign DNA. We remember based on cues. Humans use symbols as a tool to cue (or queue) their recall. This explains why there is no locus for memory and often we rely on external triggers to get us going.

Two beams of light create a hologram, one bouncing from an object and another that is a reference frame. The interference pattern between the two is recorded in film. To recreate the object, a reference frame light is shined through the film.

But unlike photography, a portion of the film can recreate the original 3D render. Holography is an example of a physical distributed representation. What is not to like about this metaphor? What other kind of recording has this kind of distributed encoding? That said, it’s perhaps not a good enough metaphor because you cannot capture multiple images in the same film. Holographic storage is needed, but I don’t think it is common enough to be a good metaphor.

Do people who have practiced how to see the world (i.e. photographers) have richer episodic memories? Do people who have musical training have richer memories of performed music? Do people who are fluent in a language have a more accurate memory of what was said than a person who isn’t fluent? In short, do people remember things that they do not understand?

Why do we forget our memories of our infanthood?

As we grow, each distinct self outsources its cognition to the unified collective self. In so doing, these selves become habituated to behavior that assumes the unified self is present. Like driving a car on the highway, you often forget that you are driving.

There are two kinds of forgetting. The forgetting because you can no longer frame memory in a long-lost language and the forgetting that habits introduce. But what are habits other than frozen expressions?

To recall something very far into the past, we must continuously re-interpret our memories in our changing neuro-language. Some can remember some infanthood memories, but they continually remind themselves of these memories.

This framing reveals a problem. A mind that learns is discovering new languages of expression. This implies that the old memories expressed differently are more likely to be forgotten. This is why experts forget how it is to be a beginner.

This is also why the language of experts is difficult to grasp unless they are expressed in the language of beginners. This is also why one cannot understand unless one derives in their internal language a new concept. To understand is to discover new languages of expression.

But old memories may not be gone forever. Alternative states of mind can lead to mental framings that can interpret these lost memories and translate them to our current neuro-language. Language is like a prism that requires the right orientation.

The most vivid of memories are those that are most relevant to us at the time of experience. These are the kinds that most strongly map to our past identities. To access these, we must enter a reference frame of that past self. This is not obvious how this change in reference can be consciously controlled. The key is knowing that humans change reference frames all the time when they change *who* they are conversing with. We can tell who someone is speaking on the phone by their change in mannerisms.

Humans maintain multiple selves and they emerge from the background when we have conversations. That’s because conversations are empathic activities that adjusts your framing to include the thoughts of another. This framing also influences your perception and thus your self.

We are who we are because of memories of conversations. Conversations that have occurred while you were in the womb of your mother. Your mind is more than the encoding in DNA. It’s a consequence of your development process. We are who we are as a consequence of the conversations that we’ve experienced.

Our memories are grounded in the illusion of our self. To see the illusion of a self demands the capability of seeing the unity or wholeness of things. That is to integrate information into a single whole. An illusion is cognitive trickery to believe that something indeed exists when in fact it does not. When our minds fill our blindspot with imagery, we believe that imagery represents what we can’t see. The illusion is reinforced when tested against reality.

A multitude of cognitive processes that permit us to recognize wholes, or see discrepancies are recruited to perceive the illusion of a self. The self does not come into being absent from processes of perception. You cannot be fooled if you cannot see. We see the illusions of the self because our brains are wired to perceive and act on the world. Without this wiring lack, the capability to perceive and hence there’s no illusion of self and therefore no consciousness.

There are different kinds of consciousness (i.e. illusions of selves). What it is to be a bat is different from what it is to be a dolphin. Thus the construction of their identity will differ and therefore their illusions of self will differ.

Thus when we speak of consciousness, it’s not a spectrum that projects to a one-dimensional like the colors of the rainbow. Rather, it’s a multidimensional object that is informed by a being’s umwelt.

Umwelt — Wikipedia

Human consciousness is different from dolphin consciousness because our umwelt is radically different. In the same way, human intelligence is different from dolphin intelligence. We cannot possibly understand a dolphin’s ability to use sonar to recognize shapes.

Fundamentally, living things are selves. Perhaps not all are complex selves, but they all exist to propagate their selves. The selfish-gene idea is perhaps an oxymoron because genes by themselves do not have bodies.

But what about machines that do not have selves. How can a machine gain consciousness if they lack a concept of a self? What does it mean to have an illusion of a self if an intelligence has no experience to live as a self? To have a self requires developing from being a self. To have consciousness you need an illusion of a self. To have an illusion of self, you have to have a self. To have a self requires first being a self. We are who we are because we grew to be who we are,

But what about the tools that make us who we are? Surely we aren’t who we are absent from the technologies of our civilization. Growing is an inside-out process that involves interaction with the world.

This realization gives you a hint of how conscious AI may be constructed. You begin with a primordial self that learns its way to mastering existing technologies. The point is that you always begin with something that has the abstract characteristics of living.

It’s often misunderstood that neural networks do not predict the effect of a cause. Induction does not predict this, rather it predicts the cause of the effect. It’s known as anti-causality learning (see: Generalization in anti-causal learning)

Diffusion models are curious new architectures that appear to box at a much higher level than their weight class. Pound for pound, they are the most effective neural networks in existence! Let’s examine its training from an anti-causal perspective. The idea of diffusion models: “slowly destroy the structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data” (see: https://arxiv.org/abs/1503.03585)

What is learned is the reverse process of starting from a destroyed structure (i.e., the effect) and discovering the original structure (i.e., the cause). It’s anti-causal inference, but across a multitude of iterations. It's reversible computation.

But what’s clearly odd is that it transitions from greater entropy to a state of lesser entropy. It’s even odder when you realize that you don’t have to destroy structure randomly, you can employ and deterministic method.

This process differs from conventional neural networks because they are not trained to be classifiers. That is, they are not trained to predict a label. Analagous to self-supervised methods, they are trained to predict another distribution.

But why are Diffusion models so efficient? This is because it uses numerical methods in calculus to compute the reversal process! There many algorithms (i.e., “samplers”) for this developed over decades (see: https://yang-song.net/blog/2021/score/…)

The method is so efficient that they are now using the diffusion models not only to predict image reconstruction but also to predict temporal change in subsequence images (i.e., video). Resulting in text to video generation. (see:: https://imagen.research.google)

This development hints at a way to do next-sequence predictions. The bread and butter of much larger and resource-intensive transformer models. Coincidentally, the Phenaki project employs transformer models to predict its time sequence. https://phenaki.github.io/#interactive

Here’s the difference between the newly revealed text-to-video networks: Make-A-Video, Imagen Video and Phenaki.

For future reference, here are the different text-to-image models. Take note about modularity of these systems.

Twitter: @savvyRL

It’s an eery coincidence that this coordination of diffusion and transformer models appears to map to how we conceive brains to work. There are two kinds of memory are coordinated via navigation. Episodic and Procedural.

https://www.researchgate.net/publication/281022346_Towards_the_synthetic_self_making_others_perceive_me_as_an_other

--

--