Introduction to recurrent neural networks (RNNs)
Neural networks with memory
Introduction
A typical feed-forward neural network maps inputs to outputs with no consideration of previous computations or where the current input fits in relation to others. The network applies the same function to each input regardless of sequence. This is fine for many applications, but often the context of an input has some relevance to the target output. One way to address this problem is to use a recurrent neural network (RNN). An RNN is a network with ‘memory.’ The network maintains information about previous inputs allowing its current output to be generated with consideration about the past. There are a wide range of applications for RNNs including language, music, weather forecasting, and stock market prediction. In this article, we’ll take a look at basic RNN structure, building an RNN cell from its components as well as the RNN module in PyTorch.
RNN structure
An RNN cell takes an input vector at some time 𝑥(𝑡) and maps it to an output vector 𝑦̂(𝑡) while also considering something called the network hidden state ℎ(𝑡) which depends on the previous inputs to the network. At every time step the network computes the output vector while also computing the hidden state as a function of the input and previous hidden state, passing the computed hidden state as an input to the next RNN cell. (The functions f and g shown below are just the operations conducted on the input and previous hidden state to generate the new hidden state and output at time t. We’ll go into more detail later.)
These RNN cells can be strung together to form multiple layers of cells and mixed with other NN structures such as feed-forward layers to form the final model.
The actual computation of the RNN cell is shown in the equations below where 𝑊ℎ, 𝑊𝑥, 𝑊𝑦 are the weight matrices corresponding to the hidden state, input vector, and output vector respectively, and 𝑏ℎ and by are the bias vectors. The weight and bias operations will be familiar to anyone who has worked with feed-forward neural networks. The difference is the consideration of the hidden state generated by previous computations. Different activation functions can be used for different network applications.
Here’s one way to construct an RNN cell in PyTorch using tahn
and softmax
activations. The hidden state output can be used as an input to the next RNN cell.
class RNNCell(nn.Module):
def __init__(self, inputSize, hiddenSize, outputSize):
super(RNNCell, self).__init__()
self.Wx = torch.randn(hiddenSize, inputSize) # input weights
self.Wh = torch.randn(hiddenSize, hiddenSize) # hidden weights
self.Wy = torch.randn(outputSize,recurhiddenSizerentSize) # output weights
self.h = torch.zeros(hiddenSize,1) # initial hidden state
self.bh = torch.zeros(hiddenSize,1) # hidden state bias
self.by = torch.zeros(outputSize,1) # output biasdef forward(self, x):
self.h = torch.tanh(self.bh + torch.matmul(self.Wx, x) + torch.matmul(self.Wh,self.h))
output = nn.Softmax(self.by + torch.matmul(self.Wy,self.h))
return output, self.h
But PyTorch comes with RNN
and RNNCell
classes that can create a single RNN cell or a multilayer RNN network allowing you to use RNNs without constructing all of the parameters shown above (Note that Pytorch uses a hidden state bias and input bias vector instead of a single vector). The equivalent call to our cell above is just RNNCell(inputSize, hiddenSize, bias=True, nonlinearity = 'tahn')
.
Back-propagation
The network parameters are updated in the same way as a normal feed-forward network by taking the gradient of the loss with respect to each parameter in the network, something known as back-propagation through time (BPTT) in RNNs. PyTorch’s standard loss.backward()
function accomplishes this for you but to summarize the gradient of the hidden state at the final time τ with respect to the loss L is shown below where o is the output.
For a more detailed explanation see the Deep Learning text.
A simple example
To see the RNN in action, let’s create a basic RNN in PyTorch that will learn to predict the next value in a sine curve given a preceding sequence. First, create the training data.
X = torch.sin(torch.linspace(0,100,100000))
plt.plot(X)
plt.ylabel('Sin x')
plt.xlabel('x')
Next use PyTorch’s Dataset/Dataloader class to create a dataset that will randomly return a sequence from our training data as well as the next data point in the sequence. The Dataset/Dataloader functions in PyTorch are very powerful ways to let PyTorch handle the feeding of inputs and targets to your network for training. The __getitem__
method below will return a vector of data points in sequence and the target next point y.
class RNNData(Dataset):
def __init__(self, X, sequenceLength):
'Initialization'
self.X = X
self.sequenceLength = sequenceLength def __len__(self):
'Denotes the total number of samples'
return int(torch.floor(torch.tensor(len(self.X)/self.sequenceLength)))
def __getitem__(self, index):
sequence = self.X[index:index+self.sequenceLength]
y = self.X[index+self.sequenceLength+1]
return sequence, y
We’ll use the following parameters for the network and training.
#hyperparameters
batchSize = 100
sequenceLength = 50
numLayers = 1
hiddenSize = 4
learningRate = 0.01
epochs = 100
Let’s check that the data loader returns what we expect
data = RNNData(X,sequenceLength)
dataLoader = DataLoader(data, batch_size=batchSize, shuffle=True)
for x,y in dataLoader:
print(x)
print(y)
break
Now we’ll create the RNN using PyTorch’s RNN class followed by a linear layer. Since batch_first = True
the inputs and outputs of the RNN are (batchSize, sequenceLength, features)
. Since we’re just predicting a scalar value, our features size is just 1. We’ll use MSELoss
as the loss function and an Adam
optimizer
# create our RNN based network with an RNN followed by a linear layer
class RNN(nn.Module):
def __init__(self, inputSize, hiddenSize, numLayers):
super().__init__()
self.RNN = nn.RNN(input_size=inputSize,
hidden_size=hiddenSize,
num_layers=numLayers,
nonlinearity='tanh',
batch_first=True) #inputs and outputs are (batch, seq, feature)
self.linear = nn.Linear(hiddenSize,1)
def forward(self,x,hState):
x, h = self.RNN(x,hState)
out = self.linear(x[:,-1,:]) # gets last output
return out# create our network instance, pick loss function and optimizer
model = RNN(1,hiddenSize,numLayers)
lossFn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learningRate)
A quick check that everything is working as we expected. We should get a vector of size batchSize
x features
as an output.
ytest = model(torch.randn(batchSize,sequenceLength,1),torch.zeros([numLayers, batchSize, hiddenSize]))
ytest.shapetorch.Size([100, 1])
Now we can train the model and graph the loss for each of the 100 epochs in our training. We initialize the hidden state as an array of zeros at the beginning of each sequence.
model.train()
lossHistory = []
for epoch in range(epochs):
lossTotal = 0
for x,y in dataLoader:
hState = torch.zeros([numLayers, batchSize, hiddenSize])
yhat= model(x.reshape([batchSize,sequenceLength, 1]),hState)
loss = lossFn(yhat.view(-1),y)
model.zero_grad()
loss.backward()
optimizer.step()
lossTotal +=loss
lossHistory.append(lossTotal)
print(lossTotal.item())
plt.plot(lossHistory)
plt.title('Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
This is just a simple example so there’s no test data but we can take a look at the network’s prediction for one sequence in the training data. First print the sequence to be fed to the network as well as the correct next element, then call the model on the sequence and see what the predicted output is.
print(X[:sequenceLength])
print(X[sequenceLength+1])tensor([0.0000, 0.0010, 0.0020, 0.0030, 0.0040, 0.0050, 0.0060, 0.0070, 0.0080,
0.0090, 0.0100, 0.0110, 0.0120, 0.0130, 0.0140, 0.0150, 0.0160, 0.0170,
0.0180, 0.0190, 0.0200, 0.0210, 0.0220, 0.0230, 0.0240, 0.0250, 0.0260,
0.0270, 0.0280, 0.0290, 0.0300, 0.0310, 0.0320, 0.0330, 0.0340, 0.0350,
0.0360, 0.0370, 0.0380, 0.0390, 0.0400, 0.0410, 0.0420, 0.0430, 0.0440,
0.0450, 0.0460, 0.0470, 0.0480, 0.0490])
tensor(0.0510)
The correct output should be 0.0510. Let’s see how the model does.
model.eval()
model(X[:sequenceLength].reshape(1,sequenceLength,1),torch.zeros([numLayers, 1, hiddenSize]))tensor([[0.0523]], grad_fn=<AddmmBackward>)
The model predicted 0.0523, not too bad for a simple network!
This covers the basics of an RNN. Next, we’ll look at Long Short-term Memory networks (LSTMs) and why they’re better suited than vanilla RNNs for certain applications. If you want to run the code yourself, I included the gist below.
If you have any questions, leave them in the comments below.
Thanks for reading and you can find the rest of my articles here.