Backpropagation

  1. Gradient descent
  2. Backpropagation
  3. Gradient vanish
  4. code

1. Gradient descent

let’s see a simple neural network demo :

  • y = sigmoid(w * x)
  • x = np.array([1,2])
  • target = np.array(0.5)
  • w = np.array([0.5,-0.5])
  • error(SSE) = 1/2 Σ(target-y)²

ok, now ,how to update weight w to minimums value of error

we should derivate weights to get Δw​ then w += Δw​ to update weight.

  • (error)dw = (target−​y)f​′​​(h)=(target−​y​​​)f​′​​(∑w​i​​x​i​​)
  • f​′​​(∑w​i​​x​i​​) = y′​​ = sigmoid′​​(w * x) = sigmoid(w*x) * (1-sigmoid(w*x)) * x

=> (error)dw = (target−​y) * sigmoid(w*x) * (1-sigmoid(w*x)) * x

  • Δw = η * (error)dw ;
  • η: learning rate ,to control descent steps ,normally 0.1,0.01,0.001,0.0001

Code is:

# Defining the sigmoid function for activations
def sigmoid(x):
return 1/(1+np.exp(-x))

# Derivative of the sigmoid function
def sigmoid_prime(x):
return sigmoid(x) * (1 - sigmoid(x))

# Input data
x = np.array([0.1, 0.3])
# Target
y = 0.2
# Input to output weights
weights = np.array([-0.8, 0.5])

# The learning rate, eta in the weight step equation
learnrate = 0.5

# The neural network output (y-hat)
nn_output = sigmoid(x[0]*weights[0] + x[1]*weights[1])
# or nn_output = sigmoid(np.dot(x, weights))

# output error (y - y-hat)
error = y - nn_output

# error term (lowercase delta)
error_term = error * sigmoid_prime(np.dot(x,weights))

# Gradient descent step
del_w = [ learnrate * error_term * x[0],
learnrate * error_term * x[1]]
# or del_w = learnrate * error_term * x

2.Backpropagation

backpropagation is extension of gradient descent.

calculate method is almost same, if neural network has hidden layer between input layer and output layer, we need update multiple weights,

neural network

Now, let’ see the process of update weighs.

base on the gradient descent.

first derivate dwH ,then derivate (dwH)dwX, dwH is weight of hidden layer

dwX is weight of input layer

  • h = sigmoid(wX * x)
  • output = sigmoid(wH * h)
  • error = y-output
  • (error)dwH = error * sigmoid(wH*h) * (1-sigmoid(wH*h)) * h
  • ((error)dwH)dwX = (error)dwH * sigmoid(wX * x) * (1-sigmoid(wX * x)) * x

3. Gradient vanish

why sigmoid gradient vanish?

Because sigmoid derivate is sigmoid(1-sigmoid),and backpropagation process will generate more and more sigmoid, and sigmoid value range in (0,1), so more numbers of hidden layer ,value decrease sharper, the first layer update weight is very small:

Diagram like below:

sig (output layer)

sig * sig (layer n -1)

….

sig * sig * sig * …. * sig (layer 1)

4. Code

Here is full code of gradient descent & backpropagation, u can run it by urself.