Why Deep Learning Ensembles Outperform Bayesian Neural Networks

Don’t they do the same thing??

Devansh
The Startup
7 min readAug 10, 2020

--

Join 31K+ AI People keeping in touch with the most important ideas in Machine Learning through my free newsletter over here

Recently I came across an interesting Paper named, “Deep Ensembles: A Loss Landscape Perspective” by a Laxshminarayan et al.In this article, I will break down the paper, summarise it’s findings and delve into some of the techniques and strategies they used that will be useful for delving into understanding models and their learning process. It will also go over some possible extensions to the paper. You can also find my annotations on the paper down below.

If you’re busy, here’s a quick video. It explains the crux of the paper, including the hypothesis and conclusion of the paper. Feel free to check it out instead. This article will go into more detail and watching the video is not necessary to understand the article. Any feedback on the video would be much appreciated though. Don’t forget to like and sub :)

Still working on production quality. Leave your thoughts in the comments

The Theory

The authors conjectured (correctly) that Deep Ensembles (an ensemble of Deep learning models) outperform Bayesian Neural Networks because “popular scalable variational Bayesian methods tend to focus on a single mode, whereas deep ensembles tend to explore diverse modes in function space.”

In simple words, when running a Bayesian Network at a single initialization it will reach one of the peaks and stop. Deep ensembles will explore different modes, therefore reducing error when put in practice. In picture form:

Taken from the paper

Depending on it’s hyperparameters, a single run of a bayesian network will find one of the paths (colors)and it’s mode. Therefore it won’t explore the set of parameters. On the other hand, a deep ensemble will explore all the paths, and therefore get a better understanding of the weight space (and solutions). To understand why this translates to better understanding consider the following illustration.

3 Possible Solutions. The area colored red is what it gets wrong

In the diagram, we have 3 possible solution spaces, corresponding to each of the trajectories. The optimized mode for each gives a performance gives us a score of 90% (for example). Each mode is unable to solve a certain kind of problem (highlighted in red). A Bayesian Network will get to either A, B, or C in a run while a Deep Ensemble will be able to train over all 3.

The Techniques

They proved their hypothesis using various strategies. This allowed them to approach the problem from various perspectives. I will show the details for each of them.

Cosine Similarity:

Cosine similarity is defined, as the “measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1.” It is derived from the dot product between vectors. Imagine 3 texts: A, B, and C. A and C and large documents on a similar topic and B is a very short summary of A. A and C might end up having a low Euclidean Distance because they have a lot of overlapping words or phrases, while A and B will have a larger distance because of the difference in size. The cosine similarities would paint a different picture, however, since A and B would have a low angle (thus high similarity) between them.

This diagram is used to show that “checkpoints along a trajectory are largely similar both in the weight space and the function space.” Checkpoint 30 and Checkpoint 25 have high similarity (red) while 30 and 5 have relatively low similarity (grey). An interesting thing to note is that the lowest labeled point is a similarity of 0.68. This goes to show how quickly models

Disagreement in Function Space:

The opposite of the similarity

Defined as “the fraction of points the checkpoints disagree on”, or,

network for input x:

here f(x; theta) denotes the class label predicted by the model for input x

Disagreement is like the complement to the similarity scores and serves to showcase the difference in checkpoints along the same trajectory in a more direct manner. Both similarity and disagreement are calculated over a single run with a single model. The highest slab for disagreement starts at 0.45. This shows that there is relatively low disagreement, consistent with the findings of the similarity map.

The use of disagreement and similarity shows that points along the same trajectory have very similar predictions. The third way is used to prove that trajectories can take very different paths, and thus end up unable to solve certain kinds of problems.

Plotting Different Random Initializations using tsne

This makes a comeback

As stated before, this diagram is used to show how different random initializations differ in function space. It is used to effectively plot the higher dimensional data into lower (human understandable) dimensions. It is an alternative to PCA. Unlike PCA, tSNE is non-linear and probabilistic in nature. It is also far more computationally expensive (esp. with lots of samples and high dimensionality). In this context, TSNE makes more sense than PCA because deep learning is also not a linear process, so TSNE suits it better. The researchers applied some preprocessing to keep the costs down. The steps they took specifically were, “for each checkpoint we take the softmax output for a set of examples, flatten the vector and use it to represent the model’s predictions. The t-SNE algorithm is then used to reduce it to a 2D point in the t-SNE plot. Figure 2(c) shows that the functions explored by different trajectories (denoted by circles with different colors) are far away, while functions explored within a single trajectory (circles with the same color) tend to be much more similar.”

Subspace Sampling and Diversity Measurement

The last 2 things the paper implemented were subspace sampling and diversity measurement. Subspace sampling involves training the data without all the features. This can be used to create several learners, that when combined, perform better than the original. The details of the sampling methods are in the paper. The samples, validated the results of the full-feature tsne, with random initializations going along different paths.

3 different subspace sampling methods, all lead to distinct neighborhoods

The diversity score quantifies the difference of two functions (a base solution and a sampled one), by measuring the fraction of data points on which the predictions differ. This simple approach is enough to validate the premise

Both the sampling and accuracy-diversity plots are further proof of the hypothesis.

Combined these techniques prove two things:

  1. Different initializations of a network will cause a model to reach and get stuck at different modes (peaks). This will let it solve certain problems and not others. Points along the trajectory along to the mode will have very similar predictions.
  2. Ensembles are able to learn from multiple modes, thus improve their performance.

Extensions

This paper was great in popping the hood behind the learning processes Deep Ensembles and Bayesian Networks. An analysis of the learning curves and validation curves would’ve been interesting. Furthermore, it would be interesting to see how deep learning ensembles stack up Random Forests, or other ensembles. Performing a similar analysis on the learning processes of these would allow us to create mixed ensembles that might use be good for solving complex problems.

Reach Out To Me

Please leave your feedback on this article below. If this was useful to you, please share it and follow me here.

Free Weekly Summary of the important updates in Machine Learning(sponsored)- https://lnkd.in/gCFTuivn

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube. It’s a work in progress haha: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Twitter: https://twitter.com/Machine01776819

My Substack: https://devanshacc.substack.com/

If you would like to work with me email me: devanshverma425@gmail.com

Live conversations at twitch here: https://rb.gy/zlhk9y

To get updates on my content- Instagram: https://rb.gy/gmvuy9

Get a free stock on Robinhood: https://join.robinhood.com/fnud75

Highlighted Paper

Below is the paper. I have highlighted what I thought was important and added definitions to some important concepts. Hope it helps

--

--

Devansh
The Startup

Writing about AI, Math, the Tech Industry and whatever else interests me. Join my cult to gain inner peace and to support my crippling chocolate milk addiction