Fight Bias with Bayes — The Experimental Connection Between Machine Learning and the Biological Brain
This blog post is mainly based on the astonishingly fascinating paper by Hideaki Shmimazaki , where I will explain and discuss his article in the context of other research papers.
TL;DR.
The Bayesian Brain Theory illustrates the interplay between biological and artificial intelligence, showing us how machine learning demonstrates the similarities of biological brains. This is demonstrated through Bayesian inference in order to gather information. Information is here represented as probability distributions.
As animals grows and age, the brain activities associated with both spontaneous and stimulus-evoked reactions become increasingly similar. This process is seen consistently throughout the lifespan.
The thermodynamic model of the brain considers conservation of energy as the first law, and the second law of increasing entropy. The conservation acts as a constraint and by that enabling the neural activity to be more efficient. The second law considers increasing entropy as a mechanism for learning and interpreting stimuli from the external world.
To that aim, the Bayesian framework can be employed in many other contexts, and thus gaining insight into both machine learning algorithms and human cognition and biases alike.
What is Bayesian neurology, and how does it relate to Machine Learning?
On the surface, the mechanisms of how our minds work may seem elementary and straightforward.
But firstly, we have to go back to very crude beginnings of organisms in order to understand these evolutionary objectives: What is the point in having a brain? And how can information be so important?
“Information as originally defined by Shannon, as it is a reduction of uncertainty. Selection means the elimination of a number of possible variants or options, and therefore a reduction in uncertainty. Natural selection therefore by definition creates information: the selected system now “knows” which of its variants are fit, and which are unfit”. Gershenson & Heylighen 2003.
In other words, there is an evolutionary reward of creating and managing information, and in particularly; reducing uncertainty. In one of my favourite research papers, H. Shimazaki demonstrates even the simplest thing such as how our retina reacts to different light intensities, and how this process requires a non-linear response. Somehow, the primary visual cortex already “knows” and responds in a way that requires more complex inference than what a linear response can provide. The reason why the retina and primary visual cortex knows this, can be understood as an evolutionary adaptation. Furthermore, we cannot consciously decide to adjust our retinae to such and such thresholds according to different light intensities. Instead, we now interpret this response mechanism as an unconscious inference (more on that later).
How Machine Learning can help us in understanding the brain better
The frameworks of machine learning can be applied to investigate and research the biological brains in a variety of ways. Such models can employed in severeal ways as means of identifying patterns in brain activity. This is where the Bayesian inference aspect comes in, as being both a simple and elegant approach to explain its intrinsic complexity.
Bayesian inference in relation to brain models can be illustrated as follows with an example of two events, A and B. And between them there are the crucial probabilities that has to be considered:
Posterior: The probability of event A happening, given that B has already happened: this is the probability you do not know and that you intend to find out. Prior: what you know beforehand about event A its probability. Likelihood: the probability of event B given event A, a known probability which you already have experienced before. It is the relationships between these different probabilities that are the core essence of Bayesian inference.
To further employ this Bayesian inference framework, we have to divide its mechanisms into parametres from Shimazaki's concept.
x = neural population
w = brain structure
y = external stimuli
Put into simpler terms: x is how the neurons behave, indicating the property of the neural population’s activity. W is the very structure of how the biological brain is built up, and how it holds previous informations (in the form as probability distributions). Y is symbol for our external world, here manifested as stimuli. The external world provides us with samples that we can interpret as external stimuli. The letter p indicates the probability of a parameter being true.
And it must be noted that the brain can only interpret a single sample through stimuli from the “true” external world. A sample of the external world in the form of stimuli is not the same as the completeness of the external world itself!
What the dynamics of the brain tells us, is that the difference between the external world and the joint probability of p (y|w) will converge and become closer to each other. In other words, the inner workings of the brain will gradually start to more closely resemble the external world: Hence, the very process of learning, hence the phenomenon of learning. This model also treat x as an vector form with the populations of x-ith of neurons.
The brain is effectively building up data in the form of countless old and proven probability distributions, which are then cross-checked and finally selected as the most likely comparison for what we are experiencing at the present moment of stimuli.
In order to compare the distributions of the probabilities of what you are experiencing now (stimulus Y and likelihood) compared to what your brain has experienced before (brain structure and prior W), we apply the calculation of Kullback-Leibler Divergence as a mean of measuring the statistical distance between these two distributions (here noted as “Y” and “W” instead of the traditional “P” and “Q”).
By that, the parameter of the brain W will seek to optimise its marginal likelihood. The log function is used because is more precise, and it makes computing small decimal numbers easier. The likelood function is marginal because the different likelihoods thought to be independent from each other.
By the use of an ArgMax function similar to machine learning, W will optimise and choose the distributions with the highest likelihood from a matrix of other likelihoods, thus noted W*. This ArgMax function can be viewed as a mechanism, relating to mathematics and computer science, that simply chooses the highest or lowest number from an array of numbers.
The model for learning: W* = arg max (w) log p (Y1:n| w)
In this way, Shimazaki demonstrates a generative model (preceeding the current prevalence of Generative AI) in the form of a joint probability function which describes the connection between neural activity and stimuli.
p (y|x,w) is noted as the observation model.
p (x|w) is referred to as spontaneous activity, because there is no stimuli (y) presents.
p (y|w) is how the neurons behave during stimuli and is here noted as sensory stimulus model.
p (y,x|w) = p (y|x,w) p (x|w)
To make it more comprehensible, we can contemplate over this calculation more tangible way:
The probability of how the stimuli and the neural activity acts in combination, given the brain structure with previous information. This will result in:
The probability of the stimuli being true, given the neural activity multiplied with brain structure, multiplied with the probability of the neural activity, given brain structure. Quite a mouthful! Shimazaki illustrates this further as the concept of the generative model. It generates models of the outside world (Y). Generative model = activity in observation model × spontaneous activity. It makes a comparison of how the brain behaves during stimuli, and how it behaves without stimuli.
The probability of the neural activity which is initiated by stimulus is considered as a sample of the posterior distribution. This means that we will have a joint probability density of neural activity x given observation Y.
We can demonstrate this by putting in some easy numbers.
y = 0,35. Here we have a large degree of uncertainty about the stimulus from the outside world. Just like asking if what we now are experiencing, is true.
x = 0,60. We have some degree of certainty about how the neurons behave.
w = 0,80. We have a high degree of certainty about our prior knowledge, or should we call it prior experience, which resides in the brain.
As demonstrated above, we had greater uncertainty about the outside world (Y) stimuli, but ended up with greater certainty on the posterior distribution. By combining the numbers from the respective probabilities, we actually reduced the amount of uncertainty. However, it must be added that the calculation of probability will be more precise if we use a distribution function rather than a single numbers representing the percentage of probability. A distribution function provides more data through a bell-shape (width or dispersion) mean (location parameter) and standard deviation (scale parameter) (Kruschke, John).
In addition, Shimazaki points out that a perfect inference would probably be very unlikely, so the posterior distribution for the stimulus-evoked neural activity is expressed as an approximation with q (x|y) ≈ p (x|Y,w). This approximation is called the recognition model — the process of recognition happens when the stimulus (Y) is combined with the structure of the brain (w) which holds prior knowledge.
All this happens inside our faculties, and most of these processes are actually unconscious inference. Our likelihoods and priors are both conscious and unconscious, and we live and learn by these two parameters throughout life.
The thermodynamic properties of neural entropy and its constraints of the Bayesian brain
The researcher also presents another fascinating concept: the thermodynamic mechanism of the brain. How does the process of learning behave in itself? How does the changes of states inside the brain manifest themselves through neural spike dynamics?
The dynamics of neural activity is expressed in such a way by the thermodynamic law of conservation through the state of spontaneous activity. And by the second law which states that entropy increases, a manifested as a process of learning. These dynamics are then employed by modulating the gain of interplay between feedforward from more primitive parts of the brain, and feedback from higher cortical areas. As this back-and-forth communication requires time-delay, it can be measured and detected. For the sake of this blog post, we could view this in some similarty of how computers works, as of which these feedback- and forward streams are taking place. We could therefore in broad terms. distinguish these entities as short term storage in our hippocampus as Random Access Memory (RAM), and long term memory, as solid state drives or hard disk drives.
“Similarly to the gain control in engineering systems, neural systems can realize the gain control by either feed- forward or feedback connections…We show that the delayed gain control of the stimulus response via recurrent feedback connections is modelled as a dynamic process of the Bayesian inference that combines the observation and top-down prior with time-delay” H.Shimazaki.
The thermodynamic costs of storing and creating information
In order to retain and generate knowledge through new stimuli, there must be an energy requirement involved!
This is because the recognition model q (x|Y) would require energy: It must firstly be activated, and that means the brain must change from its initial lower state of spontaneous activity p (x|w) where prior information is stored. The same holds true for observation models p (Y|x,w). We must then acknowledge that this process will imply a change of state, and therefore requires energy to initiate the observation process. On the other hand, the First Law of thermodynamics formulates the conservation of energy, where the total energy in a closed system can neither be created nor destroyed. By that, there will be several limiting factors. Such as how the refractory period limits the number of action potentials that a given nerve cell can produce per unit time, or the limits of neural configurations, and the metabolic “costs” of firing rates themselves. All these will put a limiting constraint on the neural activity as a whole.
Shimazaki states the stimulus-response neural activity where the entropy is as follows:
The constraints of this entropy are considered as weighted biases of neuron firing activity rate α and gain control β between feedback and feedforward. The recognition model q (x|Y) will then be constrained by the log minus of the prior and the log minus of the observation model. This is solved by the author by using the Lagrange multiplier method in order to find the minimum of free energy.
Which then will be calculated to be a probability density function where the free energy is as close to zero as possible. The probability density function of the approximated recognition model q (x|Y) will under the constrains be:
While the second law of thermodynamic states the entropy in closed systems always increases, it can be seen as how neural activity under the constraints of controlling activity trajectories and firing rate, will assemble itself into a neural state of creating and storing information. It has been demonstrated that the development of information (hence priors) are evident in neural activity in animals. The internal spontaneous activity will optimise itself by becoming more similar to external stimuli-evoked activity in the converging trajectory as the animal grows.
Bayesian framework applied to decisions
Some other similar papers demonstrate how cue combination can create new likelihoods and priors. When different stimuli happen simultaneously such as smell, vision, sensations, and coinciding events, it builds up into new probability distributions. Hypothetically, say we have previously experienced that event A, B, C, D happened at the same time. When we then encounter a similar event where only A, B, C occurs, we will automatically make associations and therefore expect that event D will also happen.
Cue combination
Our senses can act together as cues to create a grouping of simultaneous stimuli. They will therefore represent joint probabilities. For example, a certain smell can act as a reminder of a haptic experience or a sense of place. For example, the smell of freshly baked food may remind someone of their childhood. Similarly, the sight of a particular mountain may recall memories of a past vacation. This combination of several senses into one experience (hence, cue), memories and feelings become more vivid and powerful. For example, the sound of waves crashing on the beach combined with the smell of salty air of and the warmth of the sun can create a powerful nostalgic feeling. Cue combination as a phenomenon can be particularly persuading and , dare I say — misleading, something that politicians and marketers alike are fully aware of, as we will get back to later.
Bayesian decision-making
Let's apply this mode of inference to the context of decision theory. We can easily demonstrate this both in contexts of biological and machine intelligence.
In this way, we simply apply the same Bayesian inference into the context of decision-making: We replace the neural activity probabilities with hypothesis as a prior and data as evidence. Biases are calculated as weights α and β, changing the decision outcomes. This works through the equation under as:
log (P (H |D)) = β log P (D |H ) + α log P (H) + const.
We can prove this by asking more relatable questions:
Are we forgetful? Then the weighted bias α of the prior is weaker.
Are we stereotyping? Then the weighted bias α of the prior is stronger.
Are we being too flexible? Then the weighted bias β of the likelihood is stronger.
Are we being too rigid? Then the weighted bias β of the likelihood is weaker.
If we were working calibration of computer algorithms, the same types of questions concering inference could be asked.
How Artificial Intelligence can exploit the Achilles' heel of our biological brain
We can also put this theoretical decision-model into a now relevant and pressing matters. Let’s say the topics of choice are politics and elections. As an example we could ask: What is the likelihood that this politician’s statement is true, given the specific circumstances? What previous evidence or priors, do we have? And as we all know by now, if the aim is to create confusion and bewilderment, digital algorithms can then be used with great effect to manipulate the likelihoods and priors in our minds. There is of course nothing new about newspapers and media pandering deliberately to our prejudices. It's just that nowadays it can be done more efficiently by the means of digital technology - such as flooding opponents or targeted readers with malevolently constructed or skewed news content. And to that end, it has become a common way of bypassing our senses of logic and tampering with the very same mental concepts — The forgetfullness, rigidness and stereotyping among readers or voters, just as illustrated in the diagram and equation above. Just like how the retina reacts to light, brains simply cannot turn off the old reflexes awakened by these unconscious inferences from past experiences, be it sensory impressions, or consuming digital content. Applying this concept into political science gives it even more credibility, along with fascinating new approaches. Will a political party change their political cause and orientation to X, if X is refuted? Or will X disreard new evidence and continue as a valid political cause and orientation? (update 21/08/2023, further relevant findings, https://psycnet.apa.org/record/2023-92406-003). Given that this concept encapsulates the core of cognitive perception, or can portray how amplification of existing beliefs can work, it is difficult to find a context, where this Bayesian inference framework does not hold relevance or applicability.
Bayes Beats Bias
On the flip side, this idea can also be used to our advantages. We could therefore use it as a mental tool for everyday relatable events, in order to make more informed decisions.
Simply put, if we remember to consciously reverse the question for any given context: “What is the probability that B will happen, given A (likelihood)?”. This means that you can use likelihood as a simple and quick tool against bias. Even if it doesn't take into account the other elements of the equation (as shown below), it will still improve your estimate.
Summary: looking into the essence of Shimazaki’s probabilities
Putting this into more relatable words, we are essentially just analysing three elements: What you are thinking without stimuli, what you are thinking during stimuli, and what your brain recalls from previous stimuli/experiences.
If we consider these three elements, they can be arranged in differenent mathematical relations to each other. And one of the biggest questions is about how the brain activity changes from experiencing stimuli, to not experiencing stimuli. And then the gain modulation will mediate between these, through “crosschecking” with higher brain regions that are more intelligent than the simple brain regions that just take in stimuli, hence a top-down mechanism. For instance, a delayed response (in milliseconds) can imply a feedback from the higher cortices.
If we go back and look at it again in simple mathematical terms, we can entertain and distinguish the ideas of mental processses such as:
p (x|w) = introspection, thinking: Spontaneous brain activity separated from stimuli.
p (y, x|w) = the probability of p that you are thinking something (x) combined with what you are experiencing (y) given what you have previously learned (w) (brain structure).
p (y|w) = sensory stimulus, and neurons activated and occupied with sensory input (not structural) = What your sensory brain is doing when experiencing stimuli.
p (y|x,w) = observation model, where active neural populations (x) are combined and calculate the Bayseian inference with brain faculties/structures using established probability distributions (w).
The natural environment is not stationary but continuously changing. Living organisms need to adapt and learn. Hence, organisms need to infer and try to understand what properties and phenomena’s which are most stationary and build up the complexity from there.
On the artificial side of things, machine learning has been gaining a lot of speed recently. We are beginning to see a sort of race between humans and AI in being the most effective and efficient at reducing uncertainty. Machines are leveraging their speed, accuracy, and data-crunching abilities while we shoud definitely make use of are our creativity, intuition, and for now, superior abductive problem-solving skills.
“If the human brain were so simple, that we could understand it, we would be so simple that we couldn’t.” — Emerson M. Pugh –
I did it Bayes' way
Understanding the inner workings of biological and artificial intelligence are powerful and future-proof tools to possess, as well as an incredibly interesting subject in itself to learn about. And there is no doubt that this knowledge can be proactively employed in almost any circumstance that demands sound decisions. Once you have seen this mind-boggling concept, you cannot unsee it.
My greatest respect and admiration for the brilliant and inspiring scientific work done by Hideaki Shmimazaki.
https://medium.com/@monadsblog/the-kullback-leibler-divergence-5071c707a4a6
https://machinelearningmastery.com/argmax-in-machine-learning/
https://www.elsevier.com/books/doing-bayesian-data-analysis/kruschke/978-0-12-405888-0
https://www.science.org/doi/abs/10.1126/science.1195870
https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/wcs.1540
https://psycnet.apa.org/record/2023-92406-003
https://www.frontiersin.org/articles/10.3389/fnins.2018.00734/full