Memory In Various Contexts

What do an OR-Latch, a game bot and ChatGPT have in common?

Yamac Eren Ay
The Quantastic Journal
12 min readJul 22, 2024

--

Weirdest Explanation of Memory
Illustration of memory. By DALL-E.

This time, I decided to fit everything memory-related into a single article so that we explore hidden connections between concepts across different fields, be ready for some surprising revelations and a-ha moments!

Memory is an observable property or a function of any system to store data in a reusable way. Ideally, a system memory should persist some info as requested, and response with the “same” info whenever it is queried. Note that we will relax the “sameness” criteria later a bit. Which property of a system memory unit justifies this functionality?

There are 4 candidates worth considering: Recurrence, recursion, reoccurrence, repetition. They all are so similar that I bet you use them interchangeably on a daily basis. In fact, each term has its specific context and nuance, yet they are appropriate for different situations:

Table of definitions. By Yamac Eren Ay.

In my humble opinion, recurrence explains the best how the memory should operate. Now that we’re done with the necessary evil, let’s start with the most primitive memory models.

Memory in Electric Circuits

If you didn’t have to take some sort of mandatory “Introduction to Computer Architecture” class during your studies, here are some extra resources to get you started:

Now, here is my humble contribution to this topic: everything starts with a simple recurrent connection. Basically, if there was no recurrence, there wouldn’t be memory. Take the following OR-Latch for example: if you set A at any point to true, then the output YES stays true forever. In addition to OR-Latch, SR-Latch allows resetting the memory as well by defining two parallel gates with mutual recurrent connections.

Left: OR-Latch, Right: SR-Latch. By Yamac Eren Ay.

On top of SR-Latch, one can create an even more sophisticated circuit called D-Latch, which can be regarded as a wrapper for the two most basic memory operations, namely saving and reading data. So, all these binary programs lay the groundwork for Random Access Memory (RAM) and make the computers remember something.

Left: D-Latch, Right: RAM (with D-Latch as unit.)

Remind yourself that this is just a logical implementation of memory, and possibly, one of the endless many. Such models are useful for understanding how memory in our brain works, but NOT sufficient. For this, let’s consider how the information might be stored and evaluated in our brain, without going much into neuroscientific details.

Memory in Neural Circuits

In my previous article series (linked below), I briefly mentioned how the neural spike data can be learned to use non-parameter density estimation, and briefly explained that neural spikes correspond to 1s and the rest to 0s. In fact, I meant temporal coding.

However, it is challenging to match two spike samples due to the apparent randomness and sparsity of the process. It might not be economically viable for the brain to encode all information with such fine granularity and precision.

A more plausible approach would be to encode information using the expected instantaneous spike rate ⟨r⟩, the temporal change of the spike rate on average. Below are two simple examples of how to sample such instantaneous spike rate distributions:

  1. Wire together, fire together: Simulating similar neurons firing at the same time.
  2. History repeats itself: Simulating a neuron at different time points.

Rate coding is generally more relevant because it allows our brain to learn stable neural states called “attractors”. In the plot below, read the state as the spike rate and the energy as the probability of transitioning to another state. It is the hardest to switch to another state, if the state is stuck in one of those two attractors. So, this is how a neuron corresponds to a D-Latch encoding binary states.

Visualization of attractors. Source.

Technically, if you put many binary neurons together to form a fully connected neural network, you can achieve a similar behavior as RAM.

Enter Hopfield Networks: A recurrent neural network consisting of a list of neurons which are connected to each other, where the strength of the connection between two neurons are given by the pairwise weights w{ij}.

Given that the high states are encoded as +1 and low states as -1 (and not 0), the energy function E of the Hopfield network looks like:

Energy function E.

Using this function, we can define stable states (attractors) of the network. The system evolves by updating the state of each neuron to minimize the energy. If two neurons share the same state and have a strong connection, the energy is at its lowest. Do you now understand what “wire together, fire together” means?

The update rule (recurrence relation) for the state of neuron i is given by:

Recurrent update function s.

where sign(x) is the sign function that returns +1 if x is positive and -1 if x is negative. By iteratively applying this update rule, the network can recall stored patterns from partial / noisy inputs. So, associative short-term memory in its purest form!

Memory in Reinforcement Learning

https://towardsdatascience.com/crystal-clear-reinforcement-learning-7e6c1541365e
A Reinforcement Learning environment. Source.

Now, we proceed to a completely different field: Reinforcement Learning! In simple words, an agent (e.g. a person) observes the current state (e.g. belly fat), performs an action (e.g. working out) with the intention to maximize its future rewards (e.g. healthy life) and transitions to the next state (e.g. less belly fat).

In theory, there are cases (totally qualified as Markov chains) where an agent can fully observe its exact state, where memory is not necessary. Take chess for example: If you are given any game position, you already have all the information you need to make the most well-informed move, and you can simply forget what exactly happened at the game opening.

https://www.scaler.com/topics/artificial-intelligence-tutorial/pomdp/
Partially observable Markov decision processes. Source.

However, in practice, it is not possible to entirely observe states but only a small subset of features roughly describing the current state, and the quality of such observations affect directly how the agent perceives its environment. In the same chess example, if you can only observe the last move (and not the whole board), then you definitely need all previous observations to be able to provide a full picture of the chess board. So, can we suggest that retaining the entire history of observations yields the best outcomes?

https://lod2.eu/stack/
Visualization of a database. Source.

Accuracy-wise, yes, but it introduces a huge performance drawback. See that this requires significant memory resources, and gets computationally very expensive, especially if the number of observations grows very fast. A solution proposal would be to just accept the fate of the trade-off and store a limited number of previous observations instead of the full history. Can we do better?

A possibly more practical approach to persist the previous states into the memory is to maintain a belief system, which I tried to explain in the article below:

The key takeaway from my article: More than often, it suffices to learn a few characteristics to be able to make well-informed decisions, and not all states. Same as trying to find the average weight of a group of people after someone joins to the group. Instead of summing all weights and dividing by the number of people each time someone joins, you can just use the previous average value to calculate it on the fly.

https://www.quora.com/Is-a-single-layered-ReLu-network-still-a-universal-approximator
Fitting a curve at an arbitrary precision. Source.

However, remind yourself that some models might be useful, but all models are wrong. A naive belief system lacks the ability to generalize well to real-life use cases, for example, the weight distribution of a group of people must not necessarily be normally distributed. This is a drawback of naive belief systems where prior assumptions are made about the underlying ground-truth, so we need a more data-driven approach. From now on, keep in mind that the memory won’t remain exactly the “same” anymore (recall from above).

Memory in Neural Networks

The challenge now is how to estimate an arbitrary distribution in the most general way possible. Enter Deep Learning models, universal function approximators. I’ve explained this concept in simple terms in another article of mine, so feel free to check it out:

It’s crucial to recognize that Deep Learning is not merely a single algorithm or model architecture. Rather, it represents the most comprehensive framework for all ML models, and incorporates a wide variety of architectures, best practices, challenges and learning components, see below for example:

https://medium.com/r/?url=https%3A%2F%2Fwww.labellerr.com%2Fblog%2Fall-about-deep-learning-models-that-you-should-know%2F
Different (artificial) neural network architectures. Source.

Typically, in a classical problem, the model is expected to return an answer based on the input without considering the previous answers. However, in tasks, where the sequence of inputs is key, such as Natural Language Processing (NLP), we need a way to remember previous information, or as mentioned often enough in this article, recurrence in neural networks! Enter Recurrent Neural Networks (RNNs), as shown in (e).

Luckily, I came across an article in geeksforgeeks.org which covers them very well, absolutely recommended:

Despite the fancy wording, RNNs are actually quite straightforward: they are ML models that, in addition to taking the standard input x{t}, also consider the previous output h{t-1} when predicting the current output h{t}. This allows the model to learn and maintain an internal state h, which effectively represents its belief based on incoming updates x. Unlike traditional belief systems that rely on strict assumptions, this internal state is an approximate, data-driven representation.

https://medium.com/analytics-vidhya/natural-language-processing-from-basics-to-using-rnn-and-lstm-ef6779e4ae66
A simple generative task performed by RNNs. Source.

In theory, this enables RNNs to achieve short-term memory in its most fundamental form. However, in practice, RNNs can struggle with retaining information from far earlier in the sequence due to a problem known as vanishing gradients. This issue causes the influence of earlier data points to diminish with the increasing time distance, which leads to difficulties in learning long-term dependencies.

https://medium.com/r/?url=https%3A%2F%2Fwww.linkedin.com%2Fpulse%2Frnn-lstm-gru-why-do-we-need-them-suvankar-maity-joegc
Comparison of different sequential model architectures. Left: RNN, Middle: LSTM, Right: GRU. Source.

To address this, Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs) were developed. These are advanced types of RNNs designed to remember long-term information more effectively. LSTMs introduce a mechanism to better control the flow of information, allowing them to maintain and retrieve relevant information over longer sequences. It has been experimentally shown that LSTMs can remember up to 1000 previous states, making them significantly more powerful than standard RNNs for tasks requiring long-term memory.

However, LSTMs still treat all pieces of information with equal importance, which can be inefficient and quite noisy. This limitation is tackled by Transformer models, which introduce a mechanism called Self-Attention. Self-Attention dynamically weights the importance of previous states depending on their relevance to the task at hand.

https://medium.com/carbon-consulting/transformer-architecture-how-transformer-models-work-46fc70b4ea59
Encoder-decoder transformer architecture. Source.

This means that the model can prioritize key information from the past, similar to how human memory works. Humans do not process all information equally, but rather focus on the most relevant details in a noisy environment. Transformers mimic this kind of selective attention, which makes them way more powerful than their predecessors.

https://research.google/blog/transformer-a-novel-neural-network-architecture-for-language-understanding/
Selective attention achieved by transformers while performing text generation. Source.

From this point, you’ll encounter a wide variety of algorithms and countless hybrid or novel approaches, so don’t be scared by the quite complex landscape, and see the bigger frame. There are numerous articles that explain the transformer architecture in detail, I recommend some articles that you might find interesting:

Memory in Large Language Models

As a bonus topic, I wanted to briefly point out the currently most researched form of memory: Large Language Models (LLMs). Nowadays, we see large language models (LLMs) everywhere, from language translation to content creation and chat programs. Take ChatGPT for example, a chat program from OpenAI which has become an integral part of our daily lives, thanks to its accessibility and high performance.

https://www.gruender.de/kuenstliche-intelligenz/gpt-4-kostenlos-nutzen-239152/
GPT-4 from OpenAI. Source.

It’s built on top of GPT ( = Generative Pre-trained Transformer) models with hundreds of billions (and even trillions) of learned parameters, trained on high-quality data such as Wikipedia articles. This huge amount of parametric information allows the model to understand context and determine which tokens in a sequence to focus on. Basically, the model has already learnt what and how to “memorize”, which makes it the perfect candidate for a chatbot.

ChatGPT is just a fancy wrapper for these sequential models, and nothing new. A very cheesy but concrete example: just open a fresh conversation, and type a few basic information about yourself, e.g. your age is 21. Even after a 50 message long conversation about music tastes (which have nothing to do with your age), if you ask the bot again how old you are, the bot will most probably remember it correctly, as you can see below. Seemingly the closest to how we memorize and remember in real-life!

ChatGPT conversation. By Yamac Eren Ay

Finally, I don’t want it to sound like I’m solely advertising for the most modern approaches. We stand on the shoulders of giants, and it’s often easier to build on top of a system than to create one from scratch. ChatGPT is fancy indeed, but not as fancy as its predecessors.

Final Thoughts

As we’ve explored throughout this series, memory is a fundamental aspect of both biological and artificial systems, and it seems to be more complicated than just “remembering”. From the basic mechanisms in electric circuits to the sophisticated models in neural networks, understanding how memory works can provide deep insights into the nature of intelligence and learning.

To keep this article as simple as possible, I left the biological aspects untouched, but if you’re interested, I recommend you to check this out, which answers how the brain stores memories:

Whether you’re a student, a researcher, or simply an enthusiast, I hope this series has provided you with valuable insights and sparked your curiosity to further explore the fascinating world of memory in various domains. I’m happy to receive your feedback!

--

--