Published in

jun-devpBlog

# 1. RNN Intro

The networks that the previous chapters dealt do not allow cycle in its layers. The recurrent neural network(RNN) is introduced by relaxing this constraint from feedforward networks in order to deal with sequential, time-series data.

## what RNN is capable of

• unlike the classic feedforward which only can map from input to output, the RNN can generate outputs reflecting the history of previous inputs. In other words, the cycle structure in RNN gives the power of memorizing previous inputs to influence the output.

## Recursive Computation

The above figure illustrates an unfolded graph of the recursive system. Each node implies the state at some time-step t and a function f which maps a state to the next state. In every step, the same parameter ϴ is used.

This is the concept the RNN introduced, which was called a cycle in the above chapter, in its layer(s).

# 2. RNN Forward propagation

All figures, equations for variables are from goodFellow

Activation functions in figure 6 can be substituted by other activation functions except for the softmax at the end of the network since it is used to produce the probability distribution.

The Loss function for the configuration of figure 6 is as below.

The figure below is my computation of the gradients for some parameters. Gradients for other parameters are on goodFellow.

The way I computed the derivative of cross entropy as well as the reason why I transposed some partial derivatives, please refer to these links (1) and (2).

Another way, which is more intuitive for me, to calculate the gradients is well explained in this video.

# 3. Teacher Forcing

Teacher forcing is one technique of training the RNN that has a connection between the target value y at time step t-1 and the hidden unit at time step t. Intuitively, in this structure, the network converges faster as the hidden unit gets information about the target value of the previous step. In classic RNN without the teacher forcing, the hidden states are likely to be updated with sequence of wrong prediction, accumulated error which delays the convergence of the network and makes the training difficult.

However, in the test phase, we do not have the target value y and this discrepancy might lead the network resulting poor performance and instability.

# 4. Reference

[1] Bishop. Pattern Recognition and Machine Learning. Springer, 2006

[2] GoodFellow

[3] Graves

Any corrections, suggestions, and comments are welcome