G. Hinton’s Forward-Forward algorithm makes training Neural Networks Biologically plausible

Fabrizio Maria Aymone
Polimi Data Scientists
8 min readDec 22, 2022
Source: utoronto.ca

Few weeks ago, at NeurIPS 2022, the leading machine learning and computational neuroscience conference, the godfather of AI Geoffrey Hinton presented “Forward-Forward”¹ (FF), a revolutionizing algorithm for training neural networks in a biologically plausible way. This new approach will compete with backpropagation, the currently most used method. The latter is, in fact, responsible for the outstanding results Deep Neural Networks have obtained over the last decade. So, why even bother finding a new algorithm?

The problems with backpropagation

If neural networks should mimic the learning process of human brain, backpropagation is for sure not how the brain works. There are four fundamental problems².

Summary: the equations of backpropagation
Source: neuralnetworksanddeeplearning.com
  1. When performing backpropagation, as you can see in (BP2), you are considering the same weights used for the forward pass to calculate the gradient. Biologically speaking, it is impossible for the network to know during the backward pass the weights used in the forward pass³. This problem is known as “weight transport” or “weight symmetry” problem.
  2. The activity states of neurons are left untouched by the error backpropagation, i.e. freezing neural activity. In the cortex, instead, it has been proven⁴ that feedback connections indeed affect the neural activity generated during the forward pass.
  3. The update of a certain weight in the neural net, depends on the computations of downstream neurons. On the other hand, biological synapses modify their strength according to local signals, i.e. their neighboring neurons⁵.
  4. Lastly, in order to update the weights of a layer it is necessary to complete the forward pass and wait for the backward pass to arrive at the layer considered. The brain, instead, is capable of processing external stimuli in an online fashion. This problem is known as the “update locking problem”⁶.

The aforementioned downsides of backpropagation have prompted different researchers to find alternatives for credit assignment in neural networks, introducing modifications to the algorithm. Nevertheless, all the proposed solutions were not capable of tackling all four problems at once. What was needed was a complete shift in paradigm…

The Forward-Forward Algorithm

The Forward-Forward algorithm replaces the forward and backward passes of backpropagation by two forward passes, a positive and a negative one. The positive pass uses as input real data and updates the weights with the aim of increasing a “goodness” function in every hidden layer. The negative pass, alternatively, uses “negative data” and adjusts the weights to decrease the goodness in every hidden layer. Negative data consists of samples that have been intentionally manipulated or generated in order to resemble non-existent features or, in the case of supervised learning, wrong feature-label association.

Generation of non-existent digit images¹

The exclusive use of forward passes eludes all the problems of backpropagation. Furthermore, since the weights are updated at each layer, solving the update locking problem, it is possible to massively parallelize the computations reducing substantially training time. For instance, we could do the following: at time step one we calculate based on the first sample the activations of the first hidden layer and update its relative weights, at time step two we simultaneously update the weights of the second hidden layer based on the first sample and the weights of the first hidden layer based on the second sample, and so on.

There are different goodness function that could be implemented, but the one adopted by the original paper is the sum of the squares of the activities in the layer, where the activation function is ReLU. In particular, the goal of learning is to push the goodness function to be above a certain threshold for real data and below for negative data. We can formulate the probability of a layer to be positive by applying the sigmoid function to the sum of the goodness function minus the threshold.

We will therefore update weights in order to increase this probability in every hidden layer for positive samples and decrease it for negative ones. To do that, more specifically, we will compute the gradient of the log-probability (don’t worry about the log, it is only to make differentiation easier) of the layer with respect to its relative weights. Then, we will update the weights adding to them such gradient for real data and subtracting it for negative data. For a single-layer network it is pretty obvious that this method converges to optimal solution of the weights. However, if we consider multi-layer architectures there is a problem. The second hidden layer takes as input the activities of the first hidden layer and, therefore, it could use the information contained in the length of the activity vector in the first hidden layer to distinguish between positive and negative data. This information leakage prevents the model from learning the best weights, as weights in further layer will rely on the job done by weights in earlier layers. To work around this issue, we normalize the activity vector of a layer before calculating the activities of the next layer. In this way, the only information that is passed to the next layer is the relative activities of the neurons.

Now let’s see a practical example.

FF predicts MNIST digits

The MNIST dataset consists of black and white images of handwritten digits. Each image has a size of 28x28 pixels and is associated with a label of the digit depicted. It is not trivial to implement a supervised learning model with the Forward-Forward architecture, as we do not have anymore an output layer. Moreover, to be fair, it is not yet clear how we could even make a prediction with such an architecture!

To begin with, we encapsulate the label within the image taking advantage of the fact that each image has a one pixel black border on all its sides. We will, in fact, substitute the first 10 pixels of each image with a one-hot encoded vector of our label. Then, in order to generate negative data we will associate images with a wrong label. Lastly, we should define the size of our layers. The original paper suggests the use of four hidden layers containing 2000 neurons with ReLU activation function. However, you could experiment whichever architecture seems to you the most promising. Now we are ready to train our model!

Weight matrices relative to 100 neurons from the first hidden layer¹

After training, we want to asses our model based on its predictions on the test set. Given an unlabeled image, in order to predict its label, we should first of all generate 10 images, each of which corresponding to the original image and encapsulating a different label. Hence, we feed one of these labeled images at a time to our neural network, we compute the sum of the goodness functions of all layers except the first and store it for later as the “score” of the label. We continue iterating this process, until we have evaluated all of the generated labeled-images. Then, we will choose as our prediction the label with the highest score.

A neural net with such architecture gets 1.36% test error after 60 epochs, in comparison to the 20 epochs needed by backpropagation to obtain similar results.

After this brief tutorial, you know everything you need to implement your first FF neural net!

Final considerations: Analog Computing

We have seen how the Forward-Forward algorithm is revolutionizing AI by bringing to the table a biologically plausible method for updating weights, while still obtaining similar results compared to backpropagation. Yet, as an Electronic Engineering Student, I would like to give relevance to the considerations Hinton made at the end of his paper about analog hardware and “mortal computation”.

Analog computers have always been known for their blazing computational speed and energy-efficiency. The Forward-Forward algorithm is capable of fully exploiting the advantages of analog computing, compared to backpropagation that, instead, requires A-to-D converters. The most performed operation during model training is, in fact, matrix multiplication between the activity vector and the weight matrix. Analog hardware can perform this operation fast and energy-efficiently by treating activities as voltages and weights as conductances (the inverse of resistance). This turns out in watts and, so, money saved, while obtaining even faster computation time.

Ohm’s Law

A pillar that has always accompanied the computer science community is that software should be separate from hardware. In other words, hardware should be designed to execute indistinctly any piece of software. This principle, unfortunately, is violated by analog computers, which are only single-purpose. In this sense, analog computers can be defined as “mortal”, in contrast with immortal digital computers that can fulfill different purposes. Hinton advocates for the unification of hardware and software and the use of mortal computers in order to get the best performance for neural networks in terms of energy and speed. His recently proposed algorithm, Forward-Forward, makes the need for analog hardware more compelling than ever.

[1]: Hinton, G. (2022). The Forward-Forward Algorithm: Some Preliminary Investigations

[2]: Dellaferrera, G., & Kreiman, G. (2022). Error-driven Input Modulation: Solving the Credit Assignment Problem without a Backward Pass. arXiv preprint arXiv:2201.11665.

[3]: Liao, Q., Leibo, J., & Poggio, T. (2016, February). How important is weight symmetry in backpropagation?. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 30, №1).

[4]: Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35(4), 773–782.

[5]: Whittington, J. C., & Bogacz, R. (2019). Theories of error back-propagation in the brain. Trends in cognitive sciences, 23(3), 235–250.

[6]: Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., Silver, D., & Kavukcuoglu, K. (2017, July). Decoupled neural interfaces using synthetic gradients. In International conference on machine learning (pp. 1627–1635). PMLR.

--

--

Fabrizio Maria Aymone
Polimi Data Scientists

Electronic Engineering student at Politecnico di Milano passionate about machine learning and neuromorphic computing.