Preparing Text for RNN Language Modeling in Python

James Moody
3 min readMay 25, 2020

Recurrent Neural Networks (RNNs) have an uncanny ability to model natural language. The basic idea is to train an RNN to learn the probability distribution over possible next words of a sentence given the start of a sentence. So after training, the RNN can take in a partial sentence like “The dog looked for the” and give the conditional probability of the sixth word of the sentence being “bone” versus “hydrant” versus “cat” versus … etc. given that the first five words of the sentence are “the”, “dog”, “looked”, “for”, and “the”. The RNN learns these conditional probabilities by being trained on a corpus of example sentences.

The difficulty is, RNNs expect inputs to be sequences of vectors. To accomplish this, we need to take sentences and convert them to sequences of vectors.

Step 1 (Tokenization). We’ll assume that you’ve already obtained a corpus split into sentences. The first step in this process is to clean and tokenize our sentences. This means we should strip out unwanted symbols and turn our sentences into lists of “tokens”, which can represent words or punctuation marks (like periods, commas, etc.). There are a number of ways to accomplish this, and the following article gives a good overview of your options:

--

--