Word Embeddings, LSTMs and CNNs Explained

Paul Hurlocker
4 min readMay 1, 2022

--

TL;DR

This is a supplement to my Predicting Price with and without Sentiment post. It provides a brief overview of Word Embeddings, Long-Short Term Memory Networks and Convolutional Neural Networks.

Word Embeddings

Traditional frequency based embedding methods for creating text representations, such as TF-IDF and count vectorization, create sparse representations that don’t effectively capture the relationship between words. Sparse representations can lead to very high dimensional spaces that make it difficult to compute similarity. Word embeddings are multi-dimensional vectors that can capture semantic and syntactic relationships between words. An example of a semantic relationship between two words is “teacher” and “student” or “stop” and “go”. An example of a syntactic relationship is knowing to start questions with a question word like “Where is the dog?” or that adjectives come before nouns such as “blue car”. Pre-trained word embeddings, such as Word2Vec and GLoVe, can be utilized instead of training a word embedding layer with a model (Mikolov, 2013 and Pennington, 2014). The word embeddings of pre-trained models can be updated or used as is. The Bitcoin experiment I performed does not use a pre-trained embedding layer.

In a word embedding, each word is represented by a fixed length vector. Embedding layers work like lookup tables where the words are the keys and the dense word vectors are the values. As inputs, embedding layers take the size of the vocabulary to be encoded, the desired size of each word vector, and the length of each sequence passed into the layer. The output of an embedding layer is a two dimensional vector with one embedding for each word in the input sequence. The weights in the embedding layers are the vector representation of the words in the vocabulary that can be looked up. When training embedding layers in conjunction with a task such as predicting sentiment, the weights of the embedding layer are adjusted to make words that are associated with positive sentiment closer in the embedding space. Conversely, words with negative sentiment are closer in the embedding space.

Long Short-Term Memory (LSTM)

A long short-term memory (LSTM) neural network is a type of recurrent neural network (RNN) that can be used to model time or sequence dependent behavior by feeding the output of a hidden layer at time step t to the input of the same layer at t + 1. The figure below demonstrates the inner workings of an individual hidden cell in an LSTM. An LSTM layer has as many hidden cells as there are time steps in a hidden layer. An input vector xt with size f at time step t is fed into each hidden cell. The vector xt is fed through the various gates in the hidden cell and the hidden state ht and cell state ct are output. This iteration process continues t times with the previous hidden state ht-1 used as an input along with xt+1 and ct-1 for each iteration.

LSTM hidden cell

The LSTM hidden cell has various gate functions; the input gate it, forget gate ft, and the output gate ot. W and U are the weights for the input and previous hidden state and b is the bias. The following calculations are performed inside each hidden cell:

Convolutional Neural Network (CNN)

A convolutional neural network (CNN) is a type of neural network often used for computer vision tasks, but can be applied to various other regression and classification tasks. CNN-based models are feed-forward neural networks that use back propagation for optimization like other neural networks. They have at least one convolutional layer that is not densely connected like a standard feed forward neural network.

A convolutional layer employs a mathematical operation called a convolution, as demonstrated in the figure below, which is a small matrix called a kernel that is strided over the input data using element wise multiplication to produce the output. The size of the kernel, stride, and padding determines the size of the output. A convolutional layer has a set number of filters and a set kernel size.

CNN Input, Kernel, and output example.

References

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. “Efficient Estimation of Word Representations in Vector Space.” arXiv.org, 2013. https://arxiv.org/abs/1301.3781.

Pennington, Jeffrey, Richard Socher, and Christopher Manning. “GloVe: Global Vectors for Word Representation.”, 2014. https://nlp.stanford.edu/pubs/glove.pdf.

Hochreiter, Sepp, and Jürgen Schmidhuber. “Long Short-Term Memory.” Neural Computation 9, no. 8 (November 1, 1997): 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.

Lecun, Yann, Yoshua Bengio, and Rm 4g332. n.d. “Convolutional Networks for Images, Speech, and Time-Series.” http://www.iro.umontreal.ca/~lisa/pointeurs/handbook-convo.pdf.

--

--

Paul Hurlocker

CTO @Spring Oaks Capital ~ Capital One C4ML Alum ~ Notch Co-founder ~ MS, Data Science @Northwestern University