The evolution of Neural Network’s learning

From Perceptron’s learning procedure to Backpropagation

8 min readNov 13, 2021

In this article we will see how the first generation of neural networks used to learn weights and biases and how Backpropagation made the learning possible for bigger networks.

Perceptron Model (Minksy-Papert in 1906)

Perceptrons

What is a Perceptron ?

It’s a simple neuron that takes as inputs a set of features ‘xi’ and outputs ‘1’ if the weighted sum of every feature ‘xi’ by its weight ‘ωi’ is greater than a specific threshold ‘θ’, otherwise it outputs ‘0’ (a threshold is equivalent to having a weight on an extra input that always has 1 as input, it’s called biais)

Weight-space

It has one dimension per weight so every point in the space represents a particular setting of all weights “ω=(ωi,ωj)”. For simplicity, let’s assume that “θ=0”, then each training case can be represented as a hyperplan (the grey plan) through the origin, as shown bellow :

Weigh-space when the correct answer is 1 (left) and when it’s 0 (right)

To get the correct answer, the weights must lie on the green side either on the left side of the hyperplan in case of the training case at the left, or on the right side for the case at the right. Let’s see why !

Explanation:

For each training case (x,y) with x the input and y ϵ{0,1} the possible labels. If y=1, then ω=(ωi,ωj) is a solution, if and only if :

And knowing that :

We conclude that θ must be in the interval [-π/2,π/2] (mod 2π ) and thus ω=(ωi,ωj) must be in the green side as shown above. In the same way, we show that, if y=0 then θ must be in the interval ]π/2,3π/2[ (mod 2π ).

To get all training cases right we need to find a point on the right side of all the planes ( they may not be any such a point ). If there are any weight vectors that get the right answer for all cases, then they certainly lie in a hyper-cone with its apex at the origin ( because this is a convex problem, i.e : the average of 2 good weight vector is a good weight vector )

How it learns ?

If the output unit is correct, don’t touch its weights. If the output incorrectly outputs a 0, add the input vector to the weight vector : ω = ω + x. If the output incorrectly outputs a 1, subtract the input vector from weight vector : ω = ω-x

Why the learning procedure works ?

Let’s take for example a case where the output unit incorrectly outputs a 0

Weight space when the output is 0 but the correct answer is 1

By making this mistake, the current weight will increase by: ω = ω + x

Weight space after different steps of training

So after a finite number of mistakes, the weight vector must lie in the green region (if this region exists). You may have notice that the weight vector may not get closer to the feasible vector ω’.

Perceptron’s weaknesses

It cannot easily handle binary input ie : xi ∈ {0,1}. In fact, we need so many features because feature units get activated by just one binary input.
A binary threshold output cannot even tell if two single bit features are the same !

(+) cases: (1,1) ->1, (0,0) ->1 (-) cases: (1,0) ->0, (0,1) ->0

As you can see, we cannot separate the positive and negative cases by a plane.

Group Invariance Theorem: The part of Perceptron that learns cannot learn the transformations that form a group ( ex: wrap around translations of pixel’s patterns ).
Perceptron learning procedure cannot be generalized to hidden layers (because not every time the average of two good solutions is a good solution, ie : non-convex problem)

Learning with delta rule

What is delta rule ?

It’s an updating formula that uses gradient descent to minimize a function. In fact, we are doing so by calculating the steepest descent along the error surface to find the local minima. It uses the derivative of a loss function (here we used the quadratic loss).

A way to improve learning

Instead of showing that the weights get closer to a good set of weights, we can show that the actual output values get closer to a target value ie |y_output-y_target|<ε. Let’s take the example of a linear neuron :

It outputs a real-value
It will learn by minimizing the error summed over all training cases.

A simple example

To illustrate this iterative method, let’s take the example of a person who goes each day to a restaurant near his work place to buy lunch. Let’s say that his diet consists of fish, chips and ketchup and he buys several portions of each. The cashier only tells him the total price of the meal. Therefore, the total price is : Price = x_fish*w_fish + x_chips*w_chips + x_ketchup*w_ketchup.

The prices of the portions are like the weights of a linear neuron, therefore we have to find the optimal w = (w_fish, w_chips, w_ketchup) that gives us the correct price each time. Let’s say that the prices in the restaurant’s menu are w_fish = 150 $, w_chips = 50 $ and w_ketchup = 100 $, now let’s find those prices with an iterative method !

PS: analytically, it’s like solving a set of linear equations, but we want a method that can be generalized to multi-layer non-linear neural networks.

We start with a random initial weights then we compute the error: E = y_target-y_output
Then we update the weights by : Δwi = ε*xi*E (it’s called the “delta-rule”) with ε as the learning rate.
We repeat this iterative method until we get an error as close as we want to 0.

Try it for yourself

Here is a simple python script for the example above :

import numpy as npclass Perceptron():
  def __init__(self,x0,w0,lr):
    self.x0 = x0    # initial input
    self.w0 = w0    # initial weight
    self.lr = lr    # learning rate  @staticmethod
  def activation(w, x, b):
  """linear unit output"""
    return sum(w*x) + b  @staticmethod
  """derived quadratic loss"""
  def loss(y_hat, y_train):
    return  y_train - y_hat  def delta_rule(self, w, x, y_hat, y_train):
  """ gradient descent updating formula """
    return w - self.lr*self.loss(y_hat, y_train)*x  def train(self, x_train, y_train):
  """ iterative process """
    loss = sum(w*self.x0) - self.activation(self.w0,self.x0,0)
    while abs(loss) > 1 :
      for i in range(100):
        print(self.w0)
        self.w0= self.delta_rule(
              self.w0,x_train[i],y_train[i],
              self.activation(self.w0,x_train[i],0)
              )
        loss = y_train[i] - self.activation(self.w0,x_train[i],0)
    return self.w0# creating training data
x_train = np.array(
      [np.array([np.random.randint(25) for i in range(3)]) for i in        
       range(200)])   # 200 training inputs
w = np.array([150,50,100])   # the correct prices to be found
y_train = np.sum(w*x_train, axis=1)   # 200 training outputs# start training
x0 = np.array([2,5,3])    
w0 = np.array([0,0,0])    
lr = 0.000001
perceptron = Perceptron(x0, w0, lr)
perceptron.train(x_train, y_train)

The Backpropagation Algorithm

Multi-layer Neural Network

We saw that the delta-rule works well in training neural networks with one layer, but how can we train multi-layer architecture with hidden units ?

Learning with hidden units

Instead of changing the weights, we can randomly change the activities (ie : the outputs) of the hidden units.

The idea behind Backpropagation

We don’t know what the hidden units ought to do, but with the “chain rule” we can compute how fast the error changes as we change a hidden activity.

Use error derivatives with respect to hidden activities to train hidden units
Each hidden activity can effect any output units and therefore have many separate effects on the error
Once we have the error derivatives for the hidden activities it’s easy to get the error derivatives for the weights going into that hidden unit.

Meaning of every equation above

1 : The squared error equation of an “n-layer” network.

2 : Relationship between the input and the activity of the 2nd hidden unit in the layer “n” (f is the activation function).

3: Relationship between the input of the 2nd unit of the layer “n” and the activity of the 2nd hidden unit of the layer “n-1”.

4 : Derivative of the error w.r.t the the activity of the 2nd unit of the layer “n”.

5: Derivative of the error w.r.t the the input of the 2nd unit of the layer “n” (computed using the “chain rule” and the equation “4”).

6: Derivative of the error w.r.t a certain weight “j” between layer “n-1” and layer “n” (computed using the “chain rule” and the equation “5”).

Conclusion

The back propagation algorithm is a efficient way of computing the error derivatives “dE/dw” for every weight on a single training case. Once we have it, we can minimize the error by updating every weight as follow :

With that we are making one step proportional to the negative of the gradient of the function at the current point, which is the direction of the minima(if such point exists).

References

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986)
Learning representations by back-propagating errors.
Nature, 323, 533–536.
LeCun, Y., Bengio, Y. and Hinton, G. E. (2015)
Deep Learning
Nature, Vol. 521, pp 436–444.
Geoffrey Hinton, DeepLearning.AI
Neural Networks
Coursera, Deep Learning.