RNNs

Published in

The Deep Hub

21 min readJun 28, 2024

Recurrent Neural Networks (RNNs) are something that sound really cool to me. And because of this, I will be writing an article about them, with of course, a project sort of thing inside of the article. I want to do this with audio. I’m thinking, classification between two maybe three or four audio recordings, using, of course, an RNN. My goals for this project:

Re-learn what RNNs are
Be able to explain RNNs to a five year old
Figure out how to use audio
Train on a mini dataset of me saying yes and no (for my own personal practice)

I realize it’ll be hard for you readers to hear the audio, but, what the heck, I’ll take you along for the ride the best I can possibly do. I have studied this before, but that was a while back, and that was when I was naive, now I’m serious. So without further ado, let us begin.

What is an RNN?

Like I said before, I have studied this, but I don’t really remember much. I will write down what I remember, then I will write down what I learn after reading about it.

What I remember:

So I remember that these things remember things, and that these things are just like a normal neural network except that they recall things and then they are able to predict. I know the LSTM and GRU are some types of RNNs or cells in RNNs but that all I can recall right now.

What I learned:

So I was right, it does remember things. It is really good at predicting things because it is able to remember things. It uses the things that it remembers to interpret the things that get inputted into it. Unlike a normal NN which just processes each input independently, RNNs sequentially process inputs, this means they do it in a sequence and with each input, they update this thing called the hidden state. The hidden state is just where the old inputs are remembered. I will dive deeper into the hidden states and all that later but for now, its like its memory bank. I think its also where it makes associations between the words, so that it knows that this is how this string of words works together, and then because of this, it can logically conclude the next word in the sentence.

Now just to simplify everything, here is a clearer explanation for this. When you have a sentence, lets say “I love you”, the words have a relationship between each other, the word “I” is usually followed by something, and if trained on enough sentences it will learn that the word “I” is followed by words like “love”, “hate”, “want”, “need”, and other words that us human brains can easily figure out. This is what RNNs are good at, they figure out the relationship between words and then they are able to put them together. Thats it, it basically mimics the human mind when generating words.

Now of course there are some downsides, the major one being, it can’t really remember things from a long time ago. Or a long-amount-of-time-in -the-sequence ago. This is when LSTM and GRU come in, they solve that problem. Next, I want to dive into the math and the fascinating world behind RNNs. I will of course explain it to you to the best of my abilities, and please correct me if I’m wrong.

The Math Behind RNNs

Now I will be explaining the math behind RNNs. I don’t really know much about it, but I will try my best to explain it, after I read it of course. Let us begin.

A main part of RNNs are their ability to keep information and understand information, this is kept in something called there hidden state. Its like the input history of the model. The hidden state “encapsulates a summary of the input sequences until time t”. And here is where one problem about RNNs arise. You can see, based on this fact, that if t was say 10000 words or something astronomically high that would be extremely difficult to compute for a computer, the quality of the connections between the individual tokens (words, and this case is for the use of a sentence), the speed of the network, and the overall accuracy of the network would drop.

The hidden state is very good at feature representation, this is because it acts as a compressed representation of sequences. It captures the relevant information in the input sequences, and uses them to predict future sequences. It captures the important things from the sequence, kind of how us humans capture certain things that we find important from a series of events that play out, or a series of words that are said to you. Like if I called my brother stupid by saying “I think you are stupid”, this RNN would probably see the word stupid, put more weight (like mathematical weight on it) and then see the words around it like “you” and “are”.

Something else that the hidden state can do is dynamically adjust the size of itself to allow more newly inputted sequences and also update its internal sequences based on the new inputted sequence. This allows for it to account for different lengths of inputs. But of course, as mentioned before, it has its limitations. The hidden state also allows for tranfer learning. Not the type of transfer learning where you take a pre-trained model, but where you take this hidden state layer (this is very similar to taking a pre-trained model and training it on a different dataset) and feed it to another network, like another RNN or a normal NN (where everything is feed forward). It just kind of gives it super powers.

Hidden State Math

So h-sub-t means hidden state, lets get the easy part out of the way. We now have the rest of the jumble of letters and numbers to dissect, so let us begin.

We have f, this f is the activation function, you will see it in the code, but it is a function that transforms the values generated by the layer into something the next layer can handle. It just makes it so that the next layer can take then previous layers output (this previous layer in the hidden state). We then have W-sub-h, this is the weight matrix of the previous hidden state, so its whatever was there before the current timestep t. We then have h-sub-t-minus-1, this is the previous hidden state. And you can see why its being multiplied by the previous hidden states weight matrix. We then add the current time step weight matrix, which is multiplied by the current timestep, and then we add a bias vector for the hidden states (which is b-sub-h). So, to translate the gibberish to less gibberish english:

Less Gibberish (just read carefully)

Output Math

Every model has an output, so as usual, we need to talk about the output math. It’s crucial, there will only be a little bit more math, but if you are getting bored at this point, why are you even reading this, like this is supposed to be interesting to you right!?!?! Sorry, just keep reading, it easy to understand. I hope.

Anyways, for most tasks, you will need to calculate your output, this output calculation is very very similar to the hidden state calculation. You have time t, time t is basically your place in time when training that model, and then you have your usual matrices, activation functions, and then your bias vector. See if you can understand it without me explaining it, so don’t scroll down to far.

I believe that this is very very similar to the hidden state equation, if you think so, just skip this part about the math, you have already probably figured it out, but for those who haven’t figured it out, here is the explanation. We have y-sub-t, this y-sub-t is your output at time t. We then have g which is your activation function, the letter is different because for output layers, you need to have a different type of activation function, like softmax. These activation functions depend on of course, the number of outputs, or the shape of the output to be more precise. Each activation function truncated the values to a certain range and make them fit the problem you are dealing with, but enough on that. We then have W-sub-y. This is simply, you guessed it, the output weight matrix, its not to hard to remember see. We then have h-sub-t is the, you guessed it once again, the hidden state at time t. And then the lonely b-sub-y is the bias vector for y. So, again, to translate this to less gibberish:

See, that wasn’t to hard. So now lets think about this a little bit. Exactly how does this whole thing work together as a unit. How does this whole network work together, here is my answer. You input (lets name this input J) J into the network. J is disassembled by the hidden state, and connections are made with the equation we saw earlier. Then it runs it through to the output layer. The only reason it is able to do this is because the activation function, ReLu or Tanh just makes the hidden state able to be “eaten” by the output node. I think this is making sense, correct me if I am wrong. Then the output outputs its stuff, and then the network just keeps doing that, over and over again until its metrics are good. Or until you stop it. I hope that made sense.

Training RNNs

I will be doing this section in before I read about it (and gain knowledge points) and after I gain knowledge (and have gained knowledge points).

Before reading about it: Now training an RNN to me seems like any other type of network, but lets see if I am wrong. I probably am wrong since you need to do something about this hidden state. But I feel like you just pass your inputs in, which of course need to be in a sequence, and then, once your done that, you just let it train. That’s it.

After Reading about it:

Simply put, I think we could tell what I am about to say just based on the hidden state, the input goes in, and then it loops through h, this h is (I believe) just all the layers of the hidden layers summed up, I believe it is also considered the hidden state. Now there is something special compared to other neural networks, this is called backpropagation through time (BPTT). Sounds cool? No, well, its gonna be cool once you think about it. So we begin with this hidden state (I believe a lot of the RNN revolves around the fact that it contains a hidden state), and this hidden state updates its weights and biases through every loop (or every sequence of inputs) and after this, it computes a gradient. This word gradient is really confusing to me, in fact this whole thing is really confusing to me, so please let me know if everything I am saying wrong, but, what I think this means is that it computes kind of like a score of improvement, or something to help the loss function (the thing that shows you how much error you have) improve. While it is doing this for every input sequence it is accumulating the gradients, so that at time T (this is the final time, the total time….. I think), there can take place an optimization using that g, that loss function for the output layer, I think. So I will draw a diagram to show you how this works, and then explain it a bit better.

I believe this is the best that draw on google docs could do it. But you get the idea. This whole loop thing, it happens with h. h encompasses all of these hidden layers (hence the name hidden state) and it is constantly updating its weights and biases (W and b as we saw in the equations). And this whole BPTT is simply put the process of it updating its weights and biases through time. If you were to unravel h through time, this whole thing would be happening. And this gradient, the calculating of the metric to improve the loss function (the actual performance of the model) is just part of this whole BPTT. This is the best I could explain it, its a little complex. Please tell me if I am wrong.

With this gradient thing comes this problem called vanishing gradient and this is what will talk about next. And the solutions people have created to solve it. After that, we shall begin coding.

Vanishing Gradient Problem

This problem is simply the gradient getting to small and destabilizing the model and the other reverse side of this problem is the exploding gradient problem, which means the gradient of course is getting way to big, destabilizing the model. So of course, we have something to fix, or the scientists who’s massive brains can actually comprehend this enough to say its too slow have something to fix. They fixed it by making new RNN cells and also using a little clever trick. Plus there are a ton of other ways of doing this, but I will first just talk about these 3 ways.

Gradient Clipping
LSTM
GRU

Gradient Clipping

Unbelievably easy to understand, simply, you just clip off the gradient value when you are training, I have no idea how to do it in code, but you can for sure do it, and its a little clever trick that people used, but I have a feeling it can only take you so far. Like, you can clip it, but by how much, and when, and a bunch of other factors come into play. I think this is where LSTMs and GRUs come in.

LSTM

Long-Short-Term-Memory cell. Thats what it stands for. Now I will go a little faster here, but essentially, it is designed to address the vanishing gradient problem. The vanishing gradient problem as we said before is when the gradient gets too small. Okay, how does it fix the problem. LSTMs contain a memory cell, hence the name, and the memory cell mantians information over long periods of time. Then there are three gates, the input gate, the forget gate, and the output gate. Each of these decide respectively what information is added to the gate, what information is removed (forgotten) from the gate, and what information is exposed as output (outputted). Simply here are the mathematics behind each gate. They are all very similar, they each include the previous hidden state, an activation function (sigmoid or tanh), weights and biases of each respective gate. Below are the equations.

Equation for Forget Gate

As you can see, everything is there that I had explained before, and the little o looking symbol is the sigmoid activation function.

Equation for Input Gate

As you can see, it is basically the exact same. Now after the input, there is always a potential update, this is because part of the job of the input gate is to choose whether to update or not. This is called the candidate cell state, and below is the equation for that .

Equation for Candidate Cell State

This is very similar to all the equations above, in fact, it is so similar, I won’t tell you whats happening. Except, tanh is just another activation function. Now it needs to decide how much to add, and this is done in tandem with the forget equation and the input equation as you can see below.

Equation for how much to add from Candidate Cell State

Intuitively, this equation makes sense because f-sub-t is the forget equation and this is being multiplied by the previous cell state, C-without-the-symbol-on-top-t. Then, logically, being added to it is the C-with-the-symbol-on-top-t being multiplied by the input equation. I believe both the input and forget equation land in between zero and one.

And finally, we have the output equation. After the output equation, the hidden state of the RNN must be updated, that equation is simply the output gate activation (equation) multiplied by the tanh of the current cell state.

Output Gate Equation

New Hidden State Equation

The reason there is so much advantage to this makes sense it is basically just a memory bank inside of a memory bank, and the LSTM one has a better control of the flow of memory. This is because it can decide what goes in and what doesn’t.

GRU

Gated-Recurrent-Unit. Thats what the name of the main character of Despicable Me 4 in all caps stands for in machine learning terms. This is simply just a simpler version of the LSTM Cell. It combines the input and forget gate and the hidden state and cell state. The new names for them are the update gate and the reset gate. The equations are extremely similar.

Equation for Update Gate

Equation for Reset Gate

We also have equations for the candidate hidden state and the hidden state. Remember, just like the candidate cell state from the LSTM, the candidate hidden state is just the potential update to the hidden state. And then of course there is a hidden state calculation based on the candidate hidden state (a calculation of the new hidden state). Below are the equations for this bunch of computations.

Candidate Hidden State Equation

Hidden state calculation

The 1-Z-sub-t is the update gate trying to remain positive instead of negative, I am not to sure what that means. But, feel free to correct me on that. Now, there are advantages to this, and the obvious one is that there is less computing. Less equation means less computing and that makes for a faster more efficient model, and that makes sense right? There is also less parameters, meaning, there is less internal number jumble. Make sense? Well, as I promised, if you are here for some coding, let us begin. (If anything I explained is wrong, please write a comment, I don’t care if it’s mean, I want to learn more).

Coding

Now my idea for this project is simple. I want to use an RNN to classify between different audio files, there are a few things I need of course. I need an audio dataset (will be using the MNIST Audio dataset from Kaggle), and then I will be using tensorflow (I know PyTorch may be better for this project, but I feel more comfortable with Tensorflow). And then, the final step is, I need to know how to begin. I will probably need to load in the dataset, but I have no idea how neural networks take in audio, let alone, RNNs.

So, from my research, and my looking at articles similar and not similar to this one, we have to extract features from the audio. Makes sense. And, this can be done through a spectrograph, I think. Or spectrogram, I’m not sure what it is called, and frankly, I’m a little to lazy to go look it up. So, now that I know that, we need to use a library to get these audio features, and that library is librosa, I just found out that it exists today, and this makes a lot of sense to use since its job is to do exactly what I need to do right now, which is extract audio. So let us begin, we need to load in the dataset right now.

root = "/Users/rish/Desktop/data"
n = 60 # the different number of speakers
folders = [os.path.join(root, str(i).zfill(2)) for i in range(1, n+1)]
print(folders)

We load in the dataset (this code, I took from kaggle because I wasn’t sure how to deal with the data)

files = []

for folder in folders:
    files += os.listdir(folder)
print(files[:20])

This code gets the individual folders and puts them in a list of files so that we can access them later.

X = []
Y = []

for file in files:
    label = file.split("_")[0]
    human = file.split("_")[1]
    X.append(os.path.join(root, human, file))
    Y.append(label)

print(len(X))
print(len(Y))
print(X[:20])
print(Y[:20])

This code accesses the folders, and takes each individual audio file and puts it in X and then we get the label appended to Y. So now we have an X and a Y. We can extract features from each file in X, probably add it to another list, then use train test split, and then we can use our RNN. But next, I want to display an audio file using librosa.

y, sr = librosa.load(X[5])
librosa.display.waveshow(y, sr = sr)

The load function gives us some values and (I found this out today) sr means sampling rate (which means the number of samples per second (I think) and judging from how the graph look, I don’t think these recordings are long. They are very very short, half a second.

You can see clearly in the waveform when someone says the number and when they begin to say the number. It’s pretty cool in my opinion. So now that we have that, our third step (I don’t know what step we are on now) is to extract a feature from the audio file, and probably append it to a new list called X_features. We will need to librosa for that too. But, the question is, what feature of audio do we extract, and what feature is good for RNNs, if it does depend.

Features of Audio

Before we try and train and RNN on audio, I need to (and so do you readers) need to learn what features of audio are important, and what features of audio are important to ML and specifically, what is important to RNNs? Now, I could probably just look this up, like the question “What feature should I extract from audio to build and RNN?” and it would be there. But I don’t want to, I want to type more (haha). So, whats the answer? Let us begin.

Audio feature extraction (this basically just mean grabbing different, well I guess, features or unique aspects, of the audio and inputing them into the model. These aspects are often statistical measurements of different features in the audio.) Here are a few features: Level Of Abstraction, Temporal Scope, Musical Aspect, Signal Domain, and then, the one we’ve all been waiting for, the ML approach. These are some of the features that there are, and we as humans can understand these. Some of these are high level features (things use human can easily understand), then some of these are mid-level feature (we can understand them, but not most humans, musicians are good at these feature, this is coming from a musician). Then there are low level features, these are ones that don’t make sense to human, only machines, for example (please tell me if you can hear any of these when you listen to audio, if you can, your not human (this should be a test to test whether someone is human)) : amplitude envelope, energy, spectral centroid, spectral flux, and zero-crossing rate just to name a few.

So I now know what type of audio features to extract, I didn’t know the name, but I wanted mathy ones, and now we got it, the mathematical features called low-level audio features. Now there are many many types of audio features for machine learning, but for deep learning, according to this article that I am reading right now, the spectrogram has become the most popular type of low-level feature. There is also something called the mel-spectrogram, which, is a spectrogram but based on (or represented on, I’m not to sure) the mel scale. I had never heard of this scale before, but it is a scale where the perceived pitches (the notes) are equal distance from each other (from the sound of a downwards chromatic mel scale, it is a half step). The scale is based on the fact that we (some genius figured this out) hear pitches logarithmically and that we can tell the difference between lower pitches easier than the higher pitches. This scale puts everything on an equal perceptual distance.

Then there is the Mel-Frequency Cepstral Coefficients or MFCC for short. I think this is the one guys, the feature we will use. But what is it? This is going to be complicated, I will warn you. I don’t understand it, but I will try and explain it. Cepstrum is the “spectrum of the log of the spectrum of the time signal”, so take the log of the spectrum over a given time signal (time period) and you will get the Cepstrum. It is the information rate of change (slope) of the spectral bands of a signal, (when you see a waveform graph, I believe its the ups and downs, like when you speak, the little audio graph shows up, thats the spectrum, I think this is the average change in like the sound or something, I’m not to sure). “The Mel-Frequency Cepstral Coefficients (MFCCs) are nothing but the coefficients that make up the mel-frequency cepstrum” (I will link the article here). And basically, this number shows us characteristics of the audio and the timbre of the audio (pronounced tamber) of the audio, meaning the style of the sound (those are the best words I can use to describe it). So let us begin and use it. There is a bunch more information which is really confusingly cool, I encourage you to look at it, you will never think of audio the same.

Audio Feature Extraction

hop_length = 512
n_fft = 255

#make shift time start and end variables
t_start = 0
t_end = 0.6

y_cut = y[int(round(t_start*sr)):int(round(t_end*sr))] #the y values

MFCCs = librosa.feature.mfcc(y=y_cut, n_fft=n_fft,hop_length=hop_length,n_mfcc=128)

And then…..

fig, ax = plt.subplots(figsize=(20,7))
librosa.display.specshow(MFCCs,sr=sr, cmap='cool',hop_length=hop_length)
ax.set_xlabel('Time', fontsize=15)
ax.set_title('MFCC', size=20)
plt.colorbar()
plt.show()

I am not to sure of what this graph means, I can see there is some variation in the sound, like in the middle it gets less pink. I think that means the logarithmic scale is less spread out, or something, I am not too sure, remember, I had heard of this like 20 minutes ago. So, let us begin using this information and input into our RNN.

RNN Time

Before we create our RNN, we need to make the data usable, so I used a very inefficient for loop that loops through all 30,000 files and gets them to into a form that the RNN can actually use.

features = []


hop_length = 512
n_fft = 200  

for i in range(0,len(X)):
    y, sr = librosa.load(X[i])
    #make shift time start and end variables
    t_start = 0
    t_end = 0

    y_cut = y[int(round(t_start*sr, ndigits=None)):int(round(t_end*sr, ndigits=None))] #the y values   
    data = np.array([librosa.feature.mfcc(y=y_cut, 
         n_fft=n_fft,hop_length=hop_length,n_mfcc=128)]) 
    features.append(data)

We can now do a train test split, I am not sure if the data is looking correct, but I assume there are values in between 0 and some high number, and it doesn’t go to low (of course relative to that high number) if that makes sense.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print(X_train.shape), print(y_train.shape), print(X_test.shape), print(y_test.shape)

We get this shape

(21000, 1, 128, 1) (21000,) (9000, 1, 128, 1) (9000,)

What I think this shape means is that we have an extra dimension somewhere, I don’t know if it matter though, as long as the input dimensions are the same (meaning the ones we set to the model and the ones that the model receives) then we are all good. Now, now, now, before that, lets just squeeze the extra dimensions out of here, we should get 3D not 4D so here we go.

X_train = np.squeeze(X_train)
X_test = np.squeeze(X_test)

(21000,128) (900,128)

Ok, so that didn’t work, we need to add one more dimension to this data, so lets expand the dimensions using np.expand_dims

X_train = np.expand_dims(np.squeeze(X_train), axis = -1)
X_test = np.expand_dims(np.squeeze(X_test), axis = -1)

print(X_train.shape), print(y_train.shape), print(X_test.shape), print(y_test.shape)

Okay, we are almost to the training and to the finishing of this article. Lets create and train our model.

input_shape=(128,1)
model = tf.keras.Sequential()
model.add(LSTM(128,input_shape=input_shape))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(48, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(24, activation='softmax'))
model.summary()

Train, ok, so I went to train, a few things began to happen, one of which I am stuck on right now, but the other thing was a simple fix, our y values are strings, I don’t know how I didn’t notice that, so lets fix it real quick

y_train = [eval(i) for i in y_train]
y_test = [eval(i) for i in y_test]

Training!!!!

history = model.fit(X_train, y_train, epochs=10, shuffle=False)

So we get this number of accuracy that is 0.0993, with a high loss. I am not going to go over whether this is good or not but I will tell you what I learned. But, I did promise myself that I would do a thing where it classified between audio of me saying yes and no, but I don’t need to do that. I think I understand the steps, so here they are:

Files, put the list of all audio files in a list
Then extract a feature, and put that in a list
Then just go about as normal, the first too steps were what I needed help with the most, I didn’t know how to start

What I learned

Files, files, files, I’m not good with them, but boy did I need to know how to use them. I had to figure out how to put them in a list (I didn’t a random kaggle notebook did), but I now understand how to do it, and more importantly, where to start on a project like this. I also learned a lot about audio, and the mel-scale, and the MFCC (I might have spelled the acronym wrong some in here). And I also learned that you should check whether your labels are of type string or not before you train. But at the end of the week (or two weeks, I don’t know) it took me to write this, I am happy and I hope you enjoyed reading as much as I did writing. And, as always, if anything I did is wrong, please just tell me in the comments, or email me, or something, I don’t know, just tell me. Thanks!!!