Getting Text into Tensorflow with the Dataset API
Deep learning has made NLP easier by providing us with algorithms that can operate on arbitrary sequences. While the algorithms are crystal clear and many implementations are widely available, getting your data into them is often opaque, tedious and frustrating. Often, its the part of the job that makes me feel like this:
This post will discuss consuming text in Tensorflow with the Dataset API, which makes things almost easy. To illustrate the ideas in this post, I’ve uploaded a repo with an implementation of the end to end process described here. It contains a model that reads a verse from the bible character by character and predicts which book it came from. (e.g. “in the beginning…” came from the book of Genesis). The model itself is not the point, rather I hope the repo serves as a living example of how to use the Dataset API to work with textual data.
How to read this
I learnt a lot from technical blog posts and reading github but this post isn’t like that. In my experience, reading a technical post is like training a neural network, it doesn’t generalize to other problems without blood and tears.
Instead, I want to convey the ideas and principles, the whats and whys, of consuming data. I believe that by understanding principles behind code, we can manipulate it easily after copy pasting it from the internet.
I suggest to the reader to read this and only then look at the github repo.
Motivation (or why I wrote 2000 words about an API)
At LightTag, the company of which I am a founder, we help our clients label their text data for downstream NLP. A part of our product learns from annotators as they are annotating and provides suggestions to help them work faster. Thus a significant part of our pipeline is taking new datasets and putting them in a form that Tensorflow can consume.
Apart from ourselves , everyone that works with text has to go through this. Looking at github, every implementation of an interesting NLP paper also builds the scaffolding for consuming text, each in their own way. Consuming data should be a simple engineering task, not a unique snowflake.
The problems we have consuming text
When dealing with text in Tensorflow a common pattern emerges
- Text is stored in some semi structured file
- A process is run that converts it to numpy arrays
- Those arrays are fed into a model with Tensorflow’s feed dict example.
Their are a few points with this method that leave room for improvement
- Batching overhead
Numpy arrays are arrays which means that all “sequences” in an array need to be of the same length. This in turn requires us to pad each example to the length of the longest sequence in our dataset.
While not a showstopper this is an inconvenience as it requires more memory and work during preproccessing as well as more disk space for storage.
- Complex structures are hard.
For anything outside of a language model we need to feed our model both sources and targets. Sources are our input text, targets can be a sequence for another language (translation), the sentiment of a tweet (sentiment) the part of speech of each word (sequence tagging) and so on. Often times we want to get fancy and train our network with a few targets.
An approach I’ve seen and used is to store each “item” (source, target1, target2…) as separate numpy arrays and feed them separate into the model . It works, but can become tricky when we begin shuffling data or introduce curriculum learning.
- We end up implementing concepts of Dataset and Iterator.
A dataset is a collection of data and an iterator is some way to iterate over it. We always need these abstractions and end up using them when we train our models, batch our data or separate training and validation data and the processing steps applied to them.
Tensorflow (now) provides abstractions for these concepts, The Dataset API, out of the box and thus it makes little sense to build our own.
Getting text into any deep learning framework consists of the following steps henceforth called the process
- Deciding what your tokens are (e.g. words/characters/phrases)
The first step is deciding what our tokens are. A token is the minimal unit of processing. It can be a word, a character or something else.
- Mapping tokens to embeddings (building a vocabulary)
Deep learning models operate on “vectors”. Words are not vectors. Tokens are not vectors. Only vectors are vectors. To get our tokens into our model we need to map each one to an to a vector.
But, passing around 300 dimensional vectors for our tokens would be very cumbersome, and so the usual paradigm is to assign each token an ID and use the frameworks lookup function to fetch a corresponding vector.
This requires us to make an upfront commitment, what are the tokens/ids/vectors that our model knows about. That commitment to a predefined list of tokens is our vocabulary and usually we’ll want to keep maps from tokens to IDs as well as from IDs to tokens.
- Mapping Sequences to embeddings
Having decided on a vocabulary and made the mapping from token to ids, our next step is to convert entire sequences to their corresponding list of ids. We need to do this for every training example in our dataset
- Storing sequences and targets
Having converted our text to lists of IDs, we still need to store it in a way that
1) Our framework can read and
2) Maintains the relationship between a source and target (for example keeps an English sentance and its German translation together). As I mentioned in the prior art section, this is often done by converting everything to numpy arrays.
- Consuming the data
This really breaks down into two parts, reading (deserializing) the data that we’ve stored and then getting it into the model.
One way to do this is to read persisted numpy arrays and feed them in using Tensorflow’s feed_dict, the other way, which we’ll discuss is using the Dataset API.
Steps 1–3 in the process are fairly independent of the deep learning framework you work with. And while you always have to do steps 4 and 5, each framework handles that differently. Since this post is about Tensorflow, lets introduce the Datset API and its companion, the TFRecord format.
The Dataset API
As I mentioned, we always need some representation in our code of our data as well as a way to iterate through it. Often times we’d also like to manipulate the data or be clever about how we iterate through it (For example sort by length for the first three epochs then shuffle) . These are the mechanisms that the Dataset API provides, e.g. a way to represent a Dataset, consume it and manipulate it in a way that is outside of our model but (almost always) internal to the computation graph.
There are a few ways of getting actual data into a Dataset, one of them is via numpy arrays. But if we already have our data in numpy arrays we haven’t solved most of the problems we described at the onset.
The alternative that we’ll cover is serializing our data as TFRecords. This corresponds to step 4 in the process. TFRecords is a serialization format, essentially a glorified JSON that Tensorflow can read. The best source on the internet for using TFRecords (and the one I follow) is this blog post by Denny Britz and the associated notebook. I’ll add one note to his great explanation:
The TFRecord format defines a concept of an Example, which is basically all the data we need in order to perform one step of training/inference. A nice extension to the Example is the SequenceExample (docs are worthless, read Denny’s post) , which as NLP people, sounds like exactly what we need.
The Sequence Example contains a feature_list which is a dictionary of one or more sequences (If you are doing sequence tagging then source and target for instance) and additionally context features (such as the length of the sequence or the sentiment in it etc).
This abstraction solves one of the problems we mentioned in the begining, how do I hold all the information needed to do a step of training in one place.
A short rant on TFRecords
Another good resource on TFRecords is the youtube video that follows (don’t watch it yet) Aside from the information in it, the speakers hateful rants about the coherence of the TFRecord structure and its documentation are funny and true. Their is some pain involved in getting them to work, with subtle gotchas.
I think that for people working on image and video, where performance and especially memory and disk IO were a big issue, TFRecords were more compelling from the start.Also, those vision people like big models and have big data and so, I think, tend to go for distributed training more often (which requires a tighter integration of the data reading and the computation graph) .
In NLP our datasets are comparatively small and its easy to fit even a “big” NLP dataset in memory. Also, most models I see in industry aren’t ridiculously large, and if they are it is usually more for fun than for profit so distributed training isn’t as much of a pressing matter.
All that is to say that their wasn’t much motivation to use TFRecords in NLP, since the problems they solved weren’t problems we have.
But, if their is one takeaway from the post it should be: The functionality of the Dataset API is useful enough to be worth the hassle of TFRecords.
To close this section, here’s that video, watch the whole thing, but I’ve pointed you to a funny point of frustration.
A practical example
As promised in the beginning, we’ll train a model to predict which book of the bible a verse came from. The model we’ll train is a standard GRU, which will run over the characters in a verse. We’ll take the final GRU state and try and predict which book of the bible the verse came from. It’s a terrible model, and to make it a little less terrible we’ll also train a language model (predict the next character) inspired by this (excellent) paper
But honestly, the model is aside the point. The point is to get data into it. So what we need to get into the model is
- A sequence of token ids (in this case each charecter is a token)
- A number representing which book of the bible the verse/sequence came from
- The length of the sequence (we can actually calculate it adhoc, but this is more convenient for illustration)
Also, we need to be able to batch a few verses together to make training efficient and that means we need to pad all the verses in the same batch to be the same length.
Getting the data
The data we are using is the King James bible from Project Gutenberg. It’s included in the repo. The first step we need to take is separating one giant text file into verses and marking which book each verse came from. We do that in PrepareBibleExamples.ipynb.
Most of that notebook is some Regex-fu which is always fun but not in our scope. The last part calls this class BibPreppy which is defined here. BibPreppy prepares the bible hence the name. It exactly executes steps 1–4 in the process.
Steps 1–3 in the process look like this in BibPreppy
- Deciding what your tokens
We pass BibPreppy the python function list as a tokenizer. This has the effect of splitting a string into an array of charecters. That’s what we want since we are working at the charecter level
- Building a vocabulary
We get a little clever here and use pythons defaultdict. This allows us to go over the data in one pass, and every time we see a new character we assign it an as yet unused id.
- Mapping Sequences to embeddings
Since we got clever, this happens concurrently with step 2. This step occurs in the method sentance_to_id_list which takes a raw string, tokenizes it, converts each token to an id and adds new ids if needed.
After going through steps 1–3, we need to store our examples to disk. As discussed, we’ll use the TFRecords format, and we need a way to convert our example (the one we made in the PrepareBibleExamples.ipynb notebook) into TFRecords.
That’s exactly what the method sequence_to_tf_examples does. It uses that abstraction of a SequenceExample we spoke about to store all of the data we need (The sequence, its length, and the book it came from) in one single unit.
The method parse goes in the opposite direction. It knows how to read a TFRecord and convert it into the only thing Tensorflow can really work with, namely a tensor. In fact, it does something a little better, it converts it into a dictionary of Tensors.
Side Rant: A Dictionary of Tensors, WTF?!!
Sometimes I’m surprised that we can work with dictionaries of Tensors, since a Tensor is a Tensorflow primitive but a python dictionary has no place in the computational graph. This confuses me sometimes. It’s important to remember that when we are working with Tensorflow in python, we are dealing with abstract symbols that go into the graph, and not the computation graph itself. That’s why parse, a python function, can returns dictionaries of Tensors.
Using The Data — The Dataset API in action
Inside of prepare_dataset.py you’ll see this code, which shows the Dataset API in most of its glory
The function make_dataset opens a TFRecord at path, parses it with BibPreppys parse method and then…
The bad magic
There is a little bit of bad magic in there. The calls to expand and deflate. These are there because that’s the only way I could get padded_batch to work with scalar values, the length of the sequence and book_id the ID of the book the example came from (our target)
The Good Magic
The Good Magic is the call to that function, padded_batch. Not only does it pad our Tensors but it also pads the sequence dynamically, to the length of the longest example in the batch. And it does this for each Tensor .
Even before that, their is a call to shuffle which shuffles the data. Between these two pieces of magic we’ve solved the remaining two problems we had in the beginning
- How to I avoid padding my entire dataset to the length of the longest example
- How do I easily shuffle my data keeping source and all targets together.
Bonus — Train and Val iteration
The dataset makes one more thing amazingly convenient if not downright magical. That thing is doing an epoch of training and then a validation run, possibly with some logic in between .
With feed_dicts this wasn’t to hard but tended to be tightly coupled to the representation you chose for the dataset and its iterator. In the days of TFRecords without the dataset API, I think this was impossible because you ended up hardcoding a certain dataset into the graph. So lets see how the dataset API makes this easier
I chopped out some stuff from the code here to make it more legible and stand alone. In the repo I have some logic that reduces the learning rate whenever the validation loss increases from the previous epoch. I always found it annoying to implement that functionality and I found the dataset api to be a convenient abstraction for it.
If you got this far I’m flattered. :-) Here’s what you learned:
- Their are a few shortcomings to using numpy arrays for working with text
- The dataset API helps us solve them
- But you need to use TFRecords, which is annoying
- But the dataset API is so good that it is worth it
- And then a few examples to see how to use TFRecords and how to leverage the dataset API
Now that you know all that, see it in action in the repo .
I hope this has helped you. And if you need to label your text data before putting it into Tensorflow, we at LightTag would be happy to help you manage and execute your annotation projects. And if you have questions, comments and suggestions tweet me at @thetalperry