Combining principals from Cognitive Psychology with Deep Learning — Part 1

Kaneel Senevirathne
Analytics Vidhya
Published in
8 min readNov 3, 2021

W e often hear that deep neural networks are inspired by the brain and they are “mimicking brain function”. While it is true that neural networks are inspired by the brain, “mimicking brain function” is far from true. Take machine vision for example. State of the art computer vision models are mostly relying on vastly available labeled data. However, the human visual system is way more complex than that. Our visual system is more generalizable and has a much richer object representation. As a self learnt deep learning & cognitive science enthusiast, I highly believe that these two fields complement each other and it is important to find ways to combine them further to advance Deep Learning.

Taking this as a motivation, I took one cognitive psychology concept I learnt during my master’s and applied it to a deep learning problem. I used principals derived from computational models of multisensory integration and combined them with deep learning to classify gender from a video dataset (VoxCeleb dataset) containing utterances of words & sentences by celebrities. In this article, I will introduce you to multisensory integration, what psychophysical studies of multisensory integration tell us and how I applied one principle from multisensory integration to my deep learning model. If you would like to check out the code first, you can visit my GitHub repository.

What is multisensory integration?

Our brain combines multisensory signals coming from various sources from the outside world to form a three dimensional, panoramic, fully immersive movie. For example when you are talking to a friend, the brain gets information from both vision and audition. By using both of these senses, the brain not only can hear what she is saying but also see her facial expressions. This helps the brain to better interpret what she is saying.

Figure 1: The brain combines both visual and auditory cues

So it is advantageous for the brain or any prediction system to use multiple sources of information to make decisions. In the worst case scenario, if one of the information sources is not reliable, the system can weigh it’s decision on the more reliable source.

Experimental studies in cognitive psychology further refine these principals. One study that stands out in the literature is the audio-visual study by Alias and Burr¹. This study revealed three key ideas.

  • The brain combines audio and visual data optimally — First they showed that the combination of auditory and visual information produces an optimal result than using visual and auditory information alone. In other words, using multiple senses result in more accurate predictions.
  • When auditory information is uncertain visual information takes charge — They also suggested that given uncertain auditory information the brain will rely more on the visual information. A famous example is the ancient art of ventriloquism (Figure 2). Ventriloquism makes one’s voice appear to come from elsewhere (often from a puppet). Since the puppets mouth is moving, the brain tends to think that the voice is coming from the puppet.
  • When visual information is uncertain auditory information takes charge — One of the significant findings of this study was, given uncertain visual information the brain relies more on auditory information. This suggests a reverse ventriloquist effect.
Figure 2: Ventriloquism is the art of making one’s voice come from elsewhere.

Mathematically, we can measure the combined outcome by weighting the prediction of each individual source by it’s precision. (Precision is the measurement of the reliability of the information source)

Equation 1: vision(v), audition(a)

According to equation 1 , the prediction of each source is weighted by its’ precision. Going back to the ventriloquist example, because the sound seems to come from a different location, the precision of auditory information is low. So based on equation 1 the final prediction is weighted more by visual information. Thus, the brain is tricked to think that the sound is coming from the puppet.

This was one of the first journal articles I read and it is still one of my favorite papers. I have cited the journal article in references if you’d like to read it. I highly encourage you to read it because it is a classic! If you also like to check the code for the cognitive model you can to to this Jupyter Notebook.

Using these principals in deep learning

As I mentioned earlier, in this project I used the “VoxCeleb” video dataset to classify gender of speakers. In the first part of this project, I tested the first concept of multisensory integration that states combining multiple sources of information produces an optimal result. To do this I first created two models, a visual only and an audio only, to predict the gender of speakers. Then I compared the performances of these two models to a combined model.

Visual model — The first model I used for the classification task was a vision only model. I extracted frames from the videos, created a 2D convolutional neural network (CNN) and trained the model to predict the gender of the speaker in the video. The CNN consisted three convolutional layers each followed by batch normalization and max pooling. The output of the convolutional layers were then flattened and sent through a dropout layer followed by three dense layers. You can check the colab notebook here.

I also created a 3D CNN model (spatial and temporal) which I didn’t end up using. You can check this model in this colab notebook.

Auditory model — For the audio only model I first extracted 3 seconds of audio from the video file and then created Mel spectrograms . (A Mel spectrogram is a time vs frequency plot where the frequencies are converted to the mel scale. It’s been studied that humans are better at detecting differences in lower frequencies than higher frequencies. So converting frequencies to the mel scale is preferred when dealing with human audio samples)

The model architecture for the audio only model was similar to the visual only model. I used a 2D CNN model with three convolutional layers each followed by batch norm and max pooling. The output of the convolutional layers were then flattened and sent through two dropout layers and three dense layers. You can check the colab notebook here.

Audio-visual model — For the audio-visual model, I used the trained visual and auditory models to generate unimodal predictions and then used equation 1 to calculate the bimodal prediction. I assumed the precision of each model to be 1 (100% precise), so the audio-visual model in this scenario is basically averaging the two outputs of the visual only and auditory only models and then predicting the final outcome. You can see a diagram of the audio-visual model in figure 3.

Figure 3: Audio-visual model architecture

Results

Both audio and visual models were trained for 50 epochs with a batch size of 8 and was stopped when the training loss dropped for five consecutive epochs. Both models were trained using the Adam optimizer and the model weights were saved separately. The audio-visual model was created using the saved weights of the audio and visual models with a precision of 1 as mentioned above.

Then, I picked 10 new data batches (each containing 100 videos) from the “VoxCeleb” dataset and tested the three models. The testing criteria was the model accuracy of each batch. The results are presented in Figure 4 below. You can go to this colab notebook to check out results.

Figure 4: The model accuracies for the three models. The models were tested by finding the accuracies of 10 batches (each batch containing 100 videos). The bar represents the mean value and the error bar represents the standard deviation of each model. The scattered points are the individual accuracies of each batch.

The results show that the mean accuracies for the video, audio and audio-visual (bimodal) models are 87%, 86% and 92% respectively. Abiding to the principal of optimality mentioned above, the mean prediction accuracy of the audio-visual model is higher compared to the means of both audio and visual models. This shows that using multiple sources of information could be advantageous to artificial systems as well.

The results also show that out of the two unimodal results (video and audio), the visual accuracy is higher than the audio accuracy. This is interesting because psychophysical studies in humans have shown that the our visual system is more accurate compared to other senses¹. The reason for this could be that we perceive most information about the surroundings by our sight. Although there is no substantial evidence that this is true in artificial systems, I thought it is an interesting fact to point out!

Future work

Here we assumed perfect precision for both audio and visual sources. What if one of the sources is corrupted? In that case, we can use a lower precision for the corrupted source and a higher precision for the more reliable source. For example, if there is rainy noise in the background that makes the audio source unreliable, the model then can rely more on the visual information to make the prediction. In future, I am planning to inject noise to the unimodal networks (for example mix soundtracks of rain or white noise to the audio track) and get the neural network to predict both the outcome and the precision of the video (When you inject more noise the less precise the source becomes). Then we can create an audio-visual model that combines the outcome based on the precision of each source. According to multisensory integration principals, the combined network still should perform better than the unimodal networks.

Conclusion

Here I talked about how I used principals from multisensory integration and combined them with deep learning. I used an open source video dataset and combined visual and auditory information to classify gender of speakers in the videos. I compared the unimodal networks with the combined network to see if the latter produce an optimal result. Agreeing to the principals from multisensory integration literature, the combined model in fact produced the optimal result. There is also further evidence in the literature that suggest multimodal data could be very influential in artificial intelligence and potentially could be very useful in solving real world problems in autonomous driving, healthcare and various other sectors.

This concludes my first blog article about using cognitive psychology concepts in deep learning. In future, I am hoping to research more about cognitive psychology concepts and combine them with deep learning. You can follow my journey on medium. Thank you very much for reading!

References

[1] Alais, David, and David Burr. “The Ventriloquist Effect Results from Near-Optimal Bimodal Integration.” Current Biology 14, no. 3 (February 2004): 257–62

[2] Lorijn Zaadnoordijk, Tarek R. Besold, and Rhodri Cusack, “The Next Big Thing(s) in Unsupervised Machine Learning: Five Lessons from Infant Learning,” ArXiv:2009.08497 [Cs], September 17, 2020, http://arxiv.org/abs/2009.08497.

[3] Jiquan Ngiam et al., “Multimodal Deep Learning,” n.d., 8.

--

--