Simulating Brain Regions in CNNs

Published in

Bits and Neurons

10 min readAug 3, 2023

Convolutional Neural Networks (CNNs) were inspired from animal visual systems in order to artificially perform vision based tasks like Image Classification and Object Recognition. These CNNs have evolved a lot over time, optimising for each problem that they encounter. How similar are today’s networks to the visual cortex of animal brains? Well, the short and easy answer is that they are not too alike at all. In this article, I discuss the longer answer and recent attempts to make these CNNs more ‘brain-like’.

Motivation
Brain-Score
CORnet-S
Simulating V1
Aligning IT representations
Parting thoughts

Some Motivation

CNNs and other Computer Vision models have been making huge strides in recent years as they get close to (and sometimes surpass) human performance in certain visual tasks (click here for my previous article that offers more background for CNNs). However, getting the answer (mostly) right to the question: “What do you see here?” when presented an image with a clear subject is not all that humans can do. We are able to perform a much wider range of tasks in the visual domain. We are able to understand visual scenes when presented a lot of stimuli in our receptive field and grasp the intricate relationship between objects in the scene very fast. We are also able to generalise very well. For example, if a person has never seen a certain breed of a dog in their life before and encounter it for the first time, they are confidently able to label it as a dog due to their prior understanding of what a dog should look like. The same statements cannot be made about our neural network equivalents. Besides the problem of generalisation, there is the tiny problem that these networks only learn after being exposed to A LOT more data points than humans do (orders of magnitude more). And finally, CNNs are very susceptible to adversarial attacks.

Adversarial attacks are a way to fool a CNN by minimally modifying the input. Take the diagram below for example. When an image of a panda is modified ever so slightly with Gaussian noise, the model is unable to recognise it whereas we are able to tell that both are images of a panda without a drop in our confidence.

All of these abovementioned factors tells us that CNNs are learning visual features and representations that are quite different to our own. In other words, these models are ‘learning to see’ in ways that are significantly different to how evolution taught primate brains to see. This is the long answer to the question I posed at the beginning — we are still quite far away from getting brain-like CNNs. Now, how can we change this? For the remainder of this article, I will be going over very recent attempts in AI and Neuroscience research trying to address this matter. Most of the work I will discuss in this article is from the talented people at the DiCarlo lab at MIT.

Brain-Score

The first paper I want to talk about is the one that presents Brain-Score. The paper’s main contribution is a benchmark for assessing how ‘brain-like’ a vision model is — Brain-Score. This work is worth mentioning as it provides us a way to quantitatively assess how similar the workings of a neural network is to the primate brain. The assessment is done by evaluating a model in terms of neural and behavioural predictivity. In simpler words, the model is assessed on how well it is able to model or ‘predict’ the neural and behavioural responses of a primate brain (the paper uses neural data of Macaque monkeys) when looking at a specific stimulus. The greater the predictivity, the more alike the model is to the primate brain in that specific way.

Neural predictivity establishes how well the model’s internal representations match the internal representations of a primate brain. This is achieved with a simple linear regression outlined in the paper. Neural data of monkey V4 and Inferior-Temporal Cortex regions are used for this metric.

Behavioural predictivity computes how similar the behavioural response of a model is compared to a human or a monkey when presented a visual task (for example, where they make errors and misclassifications). The method used to compute the similarity is called I2n which involves image-by-image patterns of difficulty broken down by the object choice alternatives. For more detail on this metric, please refer to the paper which explains the complexities of the metric well. For now we can just hand-wave this and say that we get a score that tells us how similar the output of a model is to a human when both are asked to perform the same task with the same data. The final Brain-Score is the mean of the V4 neural predictivity, IT neural predictivity and Behavioural I2n predictivity.

Figure that shows how models compare in terms of their classification performance and their Brain-Score. source

The above figure shows us how popular and recent CNNs perform on Brain-Score relative to their classification accuracies. The main trend this reveals is that as performance has increased over time, these models also exhibit greater brain-similarity. However, the trend has started to fall off towards the end where we have our recent, very large networks that perform really well. This suggests that the recent models learn representations that are growing increasingly distant from the representations in primate brains despite their improved performances.

You can view the current leading Brain-like models here at brain-score.org. Brain-Score is an important innovation in this area as we need a universal metric to decide how ‘brain-like’ a model is before we can start comparing ways to make a model more ‘brain-like’.

CORnet-S

The paper ‘Brain-Like Object Recognition with High-Performing Shallow Recurrent ANNs’ introduces CORnet-S (repo for model) which is a shallow neural network with a key feature that makes its stand out from the rest of the CNNs we see today — it is made up of four anatomically mapped areas and recurrent connectivity.

The model’s brain-like ‘regions’; source

CORnet-S is split up into blocks that are analogous to cortical areas in the brain believed to be crucial in visual processing from Visual Neuroscience. These regions are V1, V2, V4 and IT (Inferior-Temporal cortex). The circuitry in each block performs the usual CNN computations like convolution, addition, activation functions, normalisation and pooling. However, the sizes of each layer is proportional to the estimated neural population in the analogous brain area. Traditionally, CNNs do not have are feedforward — meaning that the output of a layer only goes to the next layer. However, the brain’s neural layers are often recurrent which means that there are connections that can go backward in the network like when a layer takes its own output as an input. This is also implemented in CORnet-S between layers in each block.

CORnet-S is the closest approximation to the anatomy of the ventral visual system we have that is comparable in performance to current state-of-the-art neural networks in visual processing. For this reason, the model already performs very well on Brain-Score. An additional benefit of mapping brain areas to blocks in the neural network will be seen in the later papers that I discuss as it allows for brain area-specific tinkering.

CORnet-S achieves the highest Brain-Score (at its time) and manages to be competitive in terms of performance; source

Simulating the Visual Cortex at the front of CNNs

In the effort to bridge the gap between the primary visual cortex in primates and CNNs, Dapello et al. developed VOneNet — a biologically constrained neural network that simulates the V1 brain region at the front end of the CNN. This biologically-inspired CNN front is called the VOneBlock that can be plugged in place of the early layers of any modern CNN vision model. As the previously established CORnet-S has a devoted V1 block, it is easy to decide where to plug VOneBlock in.

The VOneBlock is based on a popular neuroscientific model of the V1 region — a Linear-Nonlinear-Poisson (LNP) model (Wikipedia). This block is constructed in three stages: convolution, nonlinearity and stochasticity generator which is reminiscent of most CNN blocks (except for the randomness).

The components of VOneBlock and the performance of VOneNets on adversarial attacks; source

The Convolutional layer is a Gabor Filter Bank which is tuned to approximate empirical primate V1 neural data. This, in essence, tries to capture the same low-level features that the primate V1 captures. The second layer performs traditional non-linearity functions based on one of the two possible cell types (simple and complex). Finally, the last stochastic layer adds the characteristic randomness of neural spikes to the networks. This was established when repeated measurements of a neuron in response to identical visual inputs ended up producing different spike trains in neural experiments. Empirical data suggests that the spike train for each trial can be approximated by a Poisson distribution.

With these three components making up the VOneBlock and once fit into a CNN, the resultant VOneNet performs considerably better in a key way that is resemblant of humans — proneness to adversarial attacks. The graph above shows the gain in performance when these networks are subjected to varying strengths of adversarial attacks.

This paper provides us with a key insight: forcing relevant biological constraints onto CNNs leads to behaviour that is similar to that found in primate vision. In this case, this is robustness to adversarial attacks. The addition of the VOneBlock was enough to drive the downstream layers of the network to learn representations that are more robust to these attacks. Traditionally, in order to make CNNs resistant to adversarial attacks, you need to train them with these ‘attacked’ data points in your training set which introduces an additional overhead during training. VOneNet, however is able to generalise to these attacks without any specific additional training and are more robust from the get-go — similar to humans!

Aligning Neural Representations in IT Regions

Another approach to bridging the gap between the primate visual cortex and CNNs that Dapello et al. (different paper, similar authors) took was to align the neural representations of layers in a region rather than introduce constraints at the beginning. In other words, the constraints in this case is by forcing the neural units in the CNN layer to have similar activations to the neurons in the analogous layers in the primate visual cortex.

The study cleverly uses empirical neural IT recordings of primates when exposed to certain visual stimuli and tries to force CNNs to converge to identical neural unit activations. Of course, neural recordings and CNN neural unit activations are not directly comparable, this is why they use the CKA loss function. Central Kernel Alignment (CKA) is a measure of linear subspace alignment and it lets us understand how close or far the subspaces generated by the neural recording data and the CNN neural unit activations are. The paper trains a CNN with a multi-loss formulation which includes a standard cross-entropy loss to optimise the model’s image recognition capabilities and a CKA loss to optimise neural predictivity (the ability of the CNN to have similar activations to the neural recordings of primates given the same data).

Using CORnet-S allows us to isolate the IT layer and force it to have similar representations found in the primate brain. Using the Brain-Score mechanism, they were able to confirm that once the IT alignment had taken place, IT neural similarity also increased over time. Interestingly, they also noticed that adversarial robustness also showed a correlated increase with increasing IT neural similarity.

Figures showing the positive relationship between IT neural similarity and accuracy on adversarial datasets; source

These results don’t seem like quite the revelation: “the more brain-like a CNN is, the more brain-like it behaves i.e. performs better against adversarial attacks”. However, it is an important piece of evidence that warrants more deep dives into making CNNs more brain-like using methods similar to the novel ones presented in this paper and its predecessors. Current CNNs struggle on adversarial datasets like these and perform far differently to how humans do when tasked with similar problems. Representationally-aligning models seem to increase human behavioural alignment as presented in this paper.

Some parting thoughts

The DiCarlo Lab has put out considerable work in the area of simulating brain regions in CNNs through various means like neuro-physical constraints as well as representational alignment. The results showed superior performance in some specific domains namely adversarial attacks where these neuro-inspired models outperform vanilla state-of-the-art CNNs today. Currently, models need to be trained explicitly on datasets with adversarial attacks in order to gain robustness which is extremely computationally taxing. However, the introduction of these neurological constrains one way or the other leads to robustness without the extra training — a trait that is more in line with human visual behaviour! As a budding computational cognitive scientist, I am excited to see how the DiCarlo Lab comes up with new ways to introduce these neurological constraints. Furthermore, I am also thrilled at the prospect of all the upcoming work that explores the concept of more brain-like neural networks.