Backpropagation
- Gradient descent
- Backpropagation
- Gradient vanish
- code
1. Gradient descent
let’s see a simple neural network demo :
- y = sigmoid(w * x)
- x = np.array([1,2])
- target = np.array(0.5)
- w = np.array([0.5,-0.5])
- error(SSE) = 1/2 Σ(target-y)²
ok, now ,how to update weight w to minimums value of error
we should derivate weights to get Δw then w += Δw to update weight.
- (error)dw = (target−y)f′(h)=(target−y)f′(∑wixi)
- f′(∑wixi) = y′ = sigmoid′(w * x) = sigmoid(w*x) * (1-sigmoid(w*x)) * x
=> (error)dw = (target−y) * sigmoid(w*x) * (1-sigmoid(w*x)) * x
- Δw = η * (error)dw ;
- η: learning rate ,to control descent steps ,normally 0.1,0.01,0.001,0.0001
Code is:
# Defining the sigmoid function for activations
def sigmoid(x):
return 1/(1+np.exp(-x))
# Derivative of the sigmoid function
def sigmoid_prime(x):
return sigmoid(x) * (1 - sigmoid(x))
# Input data
x = np.array([0.1, 0.3])
# Target
y = 0.2
# Input to output weights
weights = np.array([-0.8, 0.5])
# The learning rate, eta in the weight step equation
learnrate = 0.5
# The neural network output (y-hat)
nn_output = sigmoid(x[0]*weights[0] + x[1]*weights[1])
# or nn_output = sigmoid(np.dot(x, weights))
# output error (y - y-hat)
error = y - nn_output
# error term (lowercase delta)
error_term = error * sigmoid_prime(np.dot(x,weights))
# Gradient descent step
del_w = [ learnrate * error_term * x[0],
learnrate * error_term * x[1]]
# or del_w = learnrate * error_term * x2.Backpropagation
backpropagation is extension of gradient descent.
calculate method is almost same, if neural network has hidden layer between input layer and output layer, we need update multiple weights,



Now, let’ see the process of update weighs.
base on the gradient descent.
first derivate dwH ,then derivate (dwH)dwX, dwH is weight of hidden layer
dwX is weight of input layer
- h = sigmoid(wX * x)
- output = sigmoid(wH * h)
- error = y-output
- (error)dwH = error * sigmoid(wH*h) * (1-sigmoid(wH*h)) * h
- ((error)dwH)dwX = (error)dwH * sigmoid(wX * x) * (1-sigmoid(wX * x)) * x
3. Gradient vanish
why sigmoid gradient vanish?
Because sigmoid derivate is sigmoid(1-sigmoid),and backpropagation process will generate more and more sigmoid, and sigmoid value range in (0,1), so more numbers of hidden layer ,value decrease sharper, the first layer update weight is very small:
Diagram like below:
sig (output layer)
sig * sig (layer n -1)
….
sig * sig * sig * …. * sig (layer 1)
4. Code
Here is full code of gradient descent & backpropagation, u can run it by urself.
