Character awareness for language models using CNNs

Arnab
Analytics Vidhya
Published in
7 min readJun 17, 2021

This article will go through a use-case of using CNNs for NLP with practical details of implementation on PyTorch & Python. It is based on the paper “Character-Aware Neural Language Models”.

Photo by Alina Grubnyak on Unsplash

Background:

Language models: The language model assigns a probability to a sentence or a phrase. Given some words, we can predict the next word which would assign the highest probability to the phrase. We often see these being used for text completion for messaging services

The children play in the ________

A good language would be able to predict the next word as ‘park’ or something else which makes sense.

More formally,

We can model the probability of a given sequence of ‘m’ words, by taking in each word and computing the probability conditioned on the words that have appeared before.

In the pre-Transformer era, most of the papers used RNNs for such a task, and such is the case for the model used in the paper.

Typical RNN architecture for the language modeling task

Given a word, an embedding for the word is looked up in a pre-defined vocabulary and linearly transformed using weight matrix W_e. The recurrent part comes from the hidden states ‘h’, where again we use the same weight matrix W_h at every timestep. Then it is squashed through a nonlinearity, usually a tanh.

Finally, when it comes to predicting the next word we use a softmax to output a distribution over the vocabulary.

The shapes of each of the matrices are as follows:

  1. h_t: (D_h, ) → Shape of the hidden state
  2. W_h: (D_h x D_h) → Shape of the weight matrix for hidden states
  3. x_t: ( E, ) → Shape of the embedding of each word
  4. W_e: ( D_h x E ) → Shape of the weight matrix for embeddings

To recap, during training we take in each input word by word and look up their word-level representation, feed them through an RNN to generate a prediction. We then, train our network against the true probability distribution of the label in our vocabulary.

Although these word-level embeddings work great in practice; when we encounter unknown words or rare words our predictions might be not as accurate.

Usually, for any unknown word which is not present in our vocabulary, we use an ‘unk’ token which in itself has some embedding representation.

In the above-mentioned paper, a character-level representation of each word is used which we obtain through 1d convolutions over character-level embeddings.

Character Representation of words:

We represent each character as a 1D vector and concatenate them to represent each word. I also want to add some practical implementation details. So let’s through it step by step. Initially, we want a representation of each character as indices.

We use the open & close brackets ( { , } ) to denote the start and end of a word in a sentence. We also have to define a function that takes in a list of sentences and outputs the character indices ( the ones we mapped earlier )

For example:

Each word is converted into indexes using the dictionary we defined earlier. Additionally they start with 1 ( { ) and end with 2 ( } ). But the lengths of the sentences & words would be different. Thus, we must pad each word to get an equal length.

For each of these indices, we look up a specific character embedding and concatenate them together, to get them ready to pass them through the convolutional layer.

So if the size of the largest word is ‘m’ & the character embedding dimension is ‘d’, the size of the matrix passed on o the CNN layer would be (m x d). We can call this X_emb.

Conv layer:

Now that we’ve got our concatenated character embedding, we pass it through 1D convolutions & max pooling to obtain an embedding for a word of a fixed dimension.

In the original paper, the authors used multiple filters with separate filter widths where each filter would capture a different representation. According to the authors, “ A filter is essentially picking out a character n-gram, where the size of the n-gram corresponds to the filter width”. For instance, if we have the word “Anarchy”, a filter width of size three might focus on “ana”, “nar” or “arc” to see if they make sense.

From the (m x d) X_emb matrix, we have to follow the following steps:

  1. Reshape to (d x m): This is just because PyTorch performs the convolutions across the last dimension of the matrix. So we want to move convolved along the columns ( ie each character ).
  2. We choose the number of “filters”, this would ultimately be the size of the word’s embedding dimension that we want. So we can call this number “E”, as mentioned above in the RNN example. Another hyperparameter is the width “k”.
  3. With the reshaped matrix of X_emb, we compute the element-wise product & sum with each of our filters with the -width “k”.
  4. For each word, we compute (m−k+1) long vector for each filter. As we have E filters, we end up with our final matrix with the shape E x (m-k+1).

The equation above just denotes how we compute the i-th element of our m-k+1 long vector. We take the Frobenius inner product ( which is just like the dot product for matrices, element-wise multiplication + sum ) of the window of size k with our filter H of size ( d x k ).

5. Post this we add a non-linearity like tanh or a Relu. In the paper they used tanh.

6. Finally, we use max-pooling over the 2nd dimension, to arrive at a single vector of size E.

To recap, for each word we go from a matrix of (m x d) → E x (m-k+1) → E

So, finally for each word we have a word embedding derived from characters. Now in the paper, they passed this on to a “Highway Network”. The paper for this technique can be found here.

Highway Network:

Instead of passing the word_embedding derived from the conv layer directly to the RNN, we use a “gate” to regulate how much of the input is fed into the RNN. This seems to have helped with improved gradient flow, and optimization in general.

The gate which is controlled by a sigmoid has the connotation of “how much” information needs to be passed. The projection is calculated with a non-linearity & multiplied with the gate. The rest is directly passed on with a factor of (1-x_gate).

The weights W_proj & W_gate are of shape ( E x E ) ie word_embedding x word embedding.

Final Layer and predictions

The output from the highway layer is passed on the RNN (LSTM in this case). And it is trained with the true label of the next word.

According to the paper, this model was able to achieve better accuracy than a word-level model with few parameters. Here PPL refers to ‘perplexity’, a metric to evaluate language models. ( The lower the better )

--

--