Back-Propagation No More? Breaking Down the Forward-Forward Algorithm

The Forward-Forward Algorithm: Some Preliminary Investigations

Published in

deMISTify

12 min readDec 27, 2022

This article will discuss a new learning method for neural networks and it will go over some high-level examples that show that it works efficiently enough to be extrapolated to bigger applications.

Artificial intelligence and machine learning have already revolutionized many industries and areas of research, but behind all this success is one key subset of AI: Neural networks. A neural network is a type of machine learning algorithm modeled after the structure and function of the human brain. It is made up of layers of interconnected “neurons,” which process and transmit information.

If you want to learn more about neural networks and the different types of neural networks, check my article out here.

These neural networks have many components to them, but one of the key processes within them is backpropagation.

Backpropagation is an algorithm that is used for training neural networks; the fundamental idea behind neural networks is that each neuron receives an input from other neurons and they use that as an input to compute an output which is transmitted to the other neurons in the next layer. The entire process of training a neural network is highly dependent on adjusting weights and biases of the connections between neurons. In order to do this, backpropagation is employed. Backpropagation is an efficient algorithm for learning the weights and biases of a neural network. It works by using the chain rule of calculus to compute the gradient of the error function with respect to the network’s weights and biases. The gradient is a vector that points in the direction of steepest increase in the error function, and can be used to update the weights and biases in a way that reduces the error.

To compute the gradient, backpropagation makes use of the “backward pass”, which starts at the output layer and works its way backwards through the layers of the network, propagating the gradient of the error function backwards through the network. The backward pass is computationally efficient because it makes use of the intermediate results computed during the forward pass, which involves computing the output of the network given an input.

Overall, backpropagation is a key algorithm for training artificial neural networks, and has been widely used in a variety of applications, including image recognition, natural language processing, and speech recognition.

However, backpropagation isn’t this perfect algorithm, there are problems with it. Let’s jump into what is exactly wrong with back propagation.

What’s Wrong With Backpropagation?

Without a doubt, the success and heights that deep learning has reached over the last decade has been astonishing. A lot of this boils down to the success of gradient descents with a large number of parameters and data. These gradients are computed with backpropagation, showing the successes that it has had over the years.

However, there have been many efforts over the years to find a way to implement backpropagation with real neurons; it still remains as an implausible model for how the cortex of the brain learns.

For context, the cortex is the outer layer of the brain and its main functionalities are consciousness, thought, emotion, reasoning, language and memory.

Despite the numerous efforts, there isn’t strong evidence that suggests that the cortex uses error derivatives or stores neural activities for a backward pass. Moreover, the top-down connections don’t match the bottom-up connections; this is something that would be expected if backpropagation was being employed. Along with this, backpropagation also requires perfect knowledge of the computation performed in the forward pass in order to compute the correct derivatives. For example, if a black box was inserted into the forward pass, we cannot perform backpropagation unless we have knowledge of a differentiable model of the black box. An alternative to this would be to employ reinforcement learning. Reinforcement learning can act as an alternative algorithm to train neural networks, but it is expensive, slow and often unstable when dealing with a large number of parameters and dataset.

However, to get around this issue, a new algorithm is introduced, the Forward-Forward Algorithm. This algorithm has similar speed to backpropagation, but the key advantage is that it can be used when we don’t have a perfect model. Let’s jump right into the Forward-Forward Algorithm.

Forward-Forward Algorithm

The Forward-Forward algorithm is heavily inspired by Boltzmann machines and Noise Contrastive Estimation. The core idea behind this algorithm is that it replaces the backward pass of backpropagation with another forward pass. These two forward passes have similar functionalities, but they operate on different data. The positive pass is working with real data and adjusts the weights to increase the ‘goodness’ in each hidden layer. The negative forward pass operates on the negative data and adjusts the weight to decrease the ‘goodness’ in each hidden layer. Now that we have a rough idea of how the Forward-Forward Algorithm works, let’s go over some examples of where this algorithm has been implemented!

Experiments with FF

Understanding some of the basic experiments that were run with the Fast-Forward algorithm to validate that it is a valid algorithm to potentially replace back-propagation.

To give some context, the experiments that were conducted in the paper used the MNIST dataset of handwritten digits. This dataset consists of 50,000 training images and the test set consists of 10,000 images. MNIST is made for simple neural networks and this makes it very convenient for testing out new algorithms.

Unsupervised Example of FF

There are 2 key questions about the Forward-Forward algorithm that need to be addressed.

Firstly, let’s suppose we have a source of negative data, is it possible for it to learn effective multi-layer representations such that it can capture the structure in the data. The second big question is, where does this negative data come from?

For a supervised learning task, contrastive learning is commonly used. Contrastive learning is a self-supervised deep learning technique that allows it so a model can learn about data without the labels. In order to do this, the first task is to convert the input vectors into representation vectors without any information about labels (this is what adds to the unsupervised part). Once this is done, it must learn a simple linear transformation for these representation vectors. They must be converted into vectors of logits; they are used in softmax, an activation function in the output layer of a NN, to decipher the probability distribution of the labels. The learning portion of the transformation is supervised, but it doesn’t require any hidden layers, thus no backpropagation is required. As a result, the Forward-Forward algorithm can be employed to perform these tasks by using the real data vectors for the positive forward pass and the corrupt data as the negative forward pass.

There are numerous ways to classify/corrupt the data. For example, if we are looking at long range correlations in images, this can be done through differentiating the difference between long and short range correlations. For example, in images that characterize shapes, we can create negative data that has different long range correlations, but similar short range correlations. If we create a mask with large regions of ones and zeros, we can then create a hybrid image for the negative data. This can be achieved by ‘adding together one digit image times the mask and a different digit image times the reverse of the mask’. That was a lot of content, let’s look at a visual example down below.

Masks similar to this are created by starting with a random bit image and then blurring the image with a filter of the form [¼, ½, ¼] both horizontally and vertically. After this is repeated, the image is at a threshold of 0.5.

Here were the results from the paper:

After the network was trained with 4 hidden layers of 2000 ReLUs each for 100 epochs, the error rate was 1.37%.

Supervised Example of FF

On the other hand, we have a supervised example of the forward-forward algorithm in action. From the section above, it becomes evident that for large models, working without labels is the most efficient and makes the most sense. Since these large models have a bunch of tasks to execute, labeling information wouldn’t be of the best interest. However, if we focus on a smaller model that is focusing only on one task, logically, it makes the most sense to use supervised learning. In order to implement the Forward-Forward algorithm, it is necessary to include the label in the input. As such, the positive forward pass would be the image with the correct label and the negative forward pass would be the image with the incorrect label. Since the label is the differentiating factor, the algorithm will ignore the features of the image that don’t correlate with the label.

Let’s take a closer look at the MNIST dataset: the MNIST images contain a black border to make life easier for convolutional neural networks. A network with 4 hidden layers, as used in the previous example, each with 2000 ReLUs and full connectivity between the layers gets a 1.36% test error on MNIST after 60 epochs. Backpropagation can achieve similar results with 20 epochs, but this was the initial hypothesis: The FF algorithm will be slower, but will have a larger range of applications where backpropagation can fall short.

After analyzing the Forward-Forward algorithm, it can be seen that it is possible to classify a handwritten digit with just a single forward pass through the network. ‘It starts from an input that consists of the test digit and a neutral label composed of the entries of 0.1.’ The first hidden layers activities are used as inputs for the softmax function that has been learned during the training process. This entire process is how the FF algorithm can classify an image.

Moreover, the training data can be augmented by jittering the images by up 2 pixels; by doing so, we get 25 different shifts for each image. Using this augmented data, we can train the same NN with 500 epochs and get a test error of only 0.64%, which is very similar to what we achieved with backpropagation. Here’s a visualization of it:

Using FF to model top-down effects in perception

The examples that I went over were highly focused on image classification with feed-forward neural networks. As such, the learning process was done one layer at a time. In other words, the information/learning that occurred in the later layers had no impact on what was learned in the earlier layers. Especially when using this with the Forward-Forward algorithm, this is a huge weakness.

To get past this limitation, the static image must be treated as a video that is processed by a multi-layer recurrent neural network.

To check that this method is valid, a video input with the static MNIST image was used. The bottom layer was the pixel image and the top later was the one-of-N representation of the digit class. Between these layers, there are 2 or 3 intermediate layers which each consists of 2000 neurons. ‘In the experiment, the RNN was run for 10 time-steps and at each time-step the even layers were updated based on the normalized activities in the odd layers and the odd layers were updated based on the normalized activities in the even layers’.

In the image below, the NN was trained on MNIST for 60 epochs. For each image, the hidden layers were initialized with a bottom-up pass. The performance of the network was evaluated by running it on 8 iterations with each of the 10 labels picking the label that has the highest ‘goodness average’ over 3–5 iterations. Overall, it had a test error of 1.31%.

Experiments with CIFAR-10 Dataset

The FF algorithm was tested on the CIFAR-10 dataset as well. For context, this dataset consists of 50,000 training images which are 32 x 32 with 3 colour channels for each pixel. The images have very complicated backgrounds due to their high variability. Thus making it difficult to model the images well given the limited training data. Fully connected networks overfit with 2 or 3 hidden layers when trained using backpropagation; thus, all results that were found were using convolutional neural networks.

The goal here is to show that the FF algorithm can deliver similar performance to backpropagation for images with high variability in their backgrounds.

The network contains 2 or 3 hidden layers of 3072 ReLUs each. Each hidden layer has a 32 x 32 topographic map with 3 hidden units at each location. The hidden unit has a ’11 x 11 receptive field in the layer below so it has 363 bottom-up inputs’. Although the performance of FF is not as great as backpropagation, it’s only marginally worse. We can see the performance of these units in the table below.

Sequence Learning with Forward-Forward Algorithm

In this section, we are going to look at generating discrete symbols. By taking the task of predicting the next character in a sequence, we’ll be able to see that a network trained with the FF algorithm has the capability to generate its own negative data.

Let’s take the task of learning to predict the next character in a string based on the previous 10 characters. This can be done by simply using hidden layers to obtain the higher-order features of the previous 10 character string. After this, the activities of the hidden units can be used as inputs into the softmax function that can predict the probability distribution of all possible next characters. The most common way to train this model would be with backpropagation, but the FF algorithm provides a more realistic alternative.

To test this, in the paper, 248 strings of 100 characters were extracted from the Aesop’s fables. All upper case letters were converted into lower case and if the characters weren’t part of the 30 symbol alphabet, they were deleted. The first 10 characters were considered as the ‘data/context’ and the neural network was trained to predict the last 90 layers. Each of the hidden layers had 2000 ReLUs. Here’s a visualization:

The hidden layers are trained by using the 10 character strings from the real data as ‘positive data’. The strings in which the last character is replaced with a prediction is considered as ‘negative data’. Here’s a visual that goes over FF and its results in predicting the next character.

Future Work/Thoughts

The Forward-Forward algorithm holds a lot of promise for the future as it addresses the key areas where backpropagation has fallen short in the past. Moreover, it has the capabilities of delivering similar results of backpropagation as well. It’ll be interesting to see how the applications of the Forward-Forward network are extrapolated beyond the examples discussed in this article.

If you enjoyed this article, feel free to give it claps and share it! You can catch me on LinkedIn and if you want to check out some of my other work, here’s my personal website :)

References

Dickson, B. (2022, December 19). What is the “forward-forward” algorithm, Geoffrey Hinton’s new AI technique? TechTalks. https://bdtechtalks.com/2022/12/19/forward-forward-algorithm-geoffrey-hinton/
Hinton, G., & Brain, G. (n.d.). The Forward-Forward Algorithm: Some Preliminary Investigations. Retrieved December 27, 2022, from https://www.cs.toronto.edu/~hinton/FFA13.pdf
Simeon Kostadinov. (2019, August 8). Understanding Backpropagation Algorithm. Medium; Towards Data Science. https://towardsdatascience.com/understanding-backpropagation-algorithm-7bb3aa2f95fd
Wood, T. (2019, May 17). Softmax Layer. DeepAI. https://deepai.org/machine-learning-glossary-and-terms/softmax-layer