Neural Networks without Backpropagation: Direct Feedback Alignment
Here’s a quick summary on Arild Nøkland’s 2016 paper “Direct Feedback Alignment” which is not only written clearly but also interesting. Both Lillicrap et al. (2016) and Nøkland (2016) were able to train a Neural Network (NN) without Backpropagation.
First, create a simple NN like this:
with cross-entropy as its loss function.
So here is our implementation so far:
To train a NN, we have to get the loss function derivative w.r.t to softmax function. Let’s take a look at Equation 5 from Nøkland’s paper.
which simply corresponds to:
I am sorry as I am not going to explain the Calculus behind this. Should you refer to Sadowski’s Notes on Backpropagation if you want the explanation.
This is our implementation of Backpropagation:
ewill be used later for updating weights.
So, how can we learn without Backpropagation?
Lillicrap et al. proposed an algorithm named Feedback Alignment. They argued that symmetric weights on Backpropagation isn’t required for learning.
Which means: Normally, with Backpropagation, you have to use the transpose of current weight matrices for updating their own weights. But turns out: you don’t have to. You can train your weights with fixed-random matrices just fine.
And on 2016, Nøkland built on this idea further.
We will take a look at four methods. They are Backpropagation (BP), Feedback Alignment (FA), and two methods proposed by Nøkland himself: Direct Feedback Alignment (DFA) and Indirect Feedback Alignment (IFA).
You can see on the illustration below on how they works. Note that, the grey arrows is the forward propagation, while the black arrows is the backward propagation or the learning process. You can just focus on the black arrows.
Below, you can see the detailed explanation of those learning methods:
You can see that the paper was written clearly, it even explained Backpropagation in a simple way. Next, let’s continue the code, so it can corresponds to those equations.
So what’s the point of all these? They look similar.
Yes, all of them use derivatives. But…
While Feedback Alignment implementation looks almost similar to Backpropagation, it uses random matrix. It shows that Neural Networks can learn just fine using random matrices, without using the weight matrices.
With DFA, you can just use the gradient from the last layer to train all layers in Neural Networks. Each of your layers do not need to depends on gradient from the layers behind of them. So, the training process doesn’t have to progress layer by layer anymore.
IFA is even also interesting, you can train a layer from a feedback from a layer in front of it.
But, how do we initialize B, the fixed-random matrix?
Let me summarize both initializations for you:
- Lillicrap’s B were sampled from a uniform distribution in the range of
-0.5 to 0.5.
- Nøkland’s B were sampled from a uniform distribution in the range of
Implementation-wise, you might want to know that BP & FA’s matrices has the same shape. While DFA & IFA has different shape to BP & FA’s matrices. So, you shouldn’t be surprised when you met an error message on matrix shapes later.
That’s all, I leave the rest of implementation as an exercise to readers :)
I’ve tried FA/DFA/IFA on a small dataset such as this Kaggle competition dataset, and the NN did can learn.
You should try it on a deeper architecture and a more serious dataset such as CIFAR, which results are available on Nøkland’s paper. Nøkland’s original implementation is available on https://github.com/anokland/dfa-torch