AWD-LSTM

Published in

AI³ | Theory, Practice, Business

3 min readSep 9, 2019

AWD-LSTM (ASGD Weight-Dropped LSTM) is one of the most popular language models. It has been used in many top papers, and its performance in the character-level model is also excellent. It uses DropConnect and the average random gradient descent method, and several other regularization strategies.

Long Short-Term Memory (LSTM) is a kind of time Recurrent Neural Network (RNN), which is suitable for processing and predicting sequential sequences, especially for natural language processing.

It’s rather than just doing a matrix multiply as RNN layer. LSTM cell structure consists of four component, which is a forgotten gate, input gate, output gate and cell state. The architecture is shown as Fig 1, σ represents a sigmoid function, tanh represents a tanh function, X(t) represents the input, h(t-1) means some hidden state, C(t) is the state of the cell.

(1) Forget gate controls whether to forget. In LSTM, it controls whether to forget the hidden cell state of the upper layer with a certain probability. The input in the figure has the hidden state h(t−1) of the previous sequence and the data X(t) of the sequence. Through an activation function, usually sigmoid, the output of the forget gate is obtained. Since the sigmoid output f(t) is between [0,1], the output represents the probability of forgetting the previous hidden cell state.

(2) Input gate handles the input of the current sequence position. The Input gate is composed of two parts: the first part is processed with the sigmoid activation function; the second part is processed with the tanh activation function. The results of both two are multiplied later to update the cell state.

(3) Cell state of LSTM. The results of both the Forget gate and the Input gate act on the cell state C(t).Let’s see how we get C(t) from the cell state C(t−1). Cell state C(t) consists of two parts. The first part is the product of C(t−1) and the Forget gate output f(t), and the second part is the product of Input gate’s output and the output of the hidden state h(t−1) of the previous sequence and the data X(t) of the sequence which are processed with tanh function.

(4) Output gate. The update of the hidden state h(t) consists of two parts. The first part is sigmoid(cell), which is obtained from the hidden state h(t−1) of the previous sequence, the data X(t) of the sequence, and the activation function sigmoid. The second part is composed of the hidden state C(t) and the tanh activation function.

LSTM Cell is implemented by PyTorch, which is shown in Fig 2.

In AWD-LATM, it proposed the variantSGD — ASGD. ASGD algorithm uses the same gradient update step as SGD algorithm but does not return the weight calculated in the current iteration, instead that it considers the weight of previous iterations and returns an average value.

LSTM loop connections may results in overfitting. To figure it out, WeightDropout is proposed. WeightDropout is called drop connect in computer vision. It is applied to the weight matrix between hidden states. Traditional Dropout goes to zero for a randomly selected subset of activation at each layer. WeightDropout returns to zero not with activation, but with a randomly selected subset of weights. Therefore, each cell receives input from the random subset of the previous layer. The overfitting of LSTM loop connections can be prevented by discarding partial information of the weight matrix between the hidden states.

L2 regularization is often used on weights to alleviate overfitting. L2 regularization can also be applied to single-cell activation. Activation regularization penalizes significantly too much activation.

The main model of AWD LSTM is a regular LSTM with several layers, but using kinds of dropouts. The implementation of AWD LSTM with PyTorch is shown in Fig 3.

AWD-LSTM

Written by Yuanrui Dong