Are you eager?

Handling text data in tf.eager

Vineet Kumar
2 min readAug 25, 2018

In this post, we will learn how to work with text data in tf.eager. Our goal is very simple: Read a text file and convert each word to an integer.

Users of Graph/Session (Non-Eager) Tensorflow should be able to appreciate how the playing with data is drastically simpler. You can play with data right from the word go! No data iterators no tables initializer is required.

As we are dealing with Tensorflow, we would obtain a Tensor of integers. Further, as any machine learning or deep learning model would expect, we would see how to create a (mini)batch of say 32 examples. We also provide a detailed notebook along with setup instructions and a sample text file so you can get started quickly.

TextLineDataset: Get string tensors from file

Okay, so let us get started by first enabling eager.

Now, let us play with the TextLineDataset which converts each sentence in a file to a string tensor. You can feed a sample sentences_file from here, or use your favorite text dataset.

What makes tf.eager easy to use is that you can directly iterate over dataset. Here, sentence is a Tensor of type string

Vocab Tables: Convert String Tensor to word indexes

Next, we convert a sentence to a list of word indexes. We first split the string Tensor into a list of words using tf.string_split. Then, we will convert each word to index. Notice, here the use of map which applies an operation to each member of the dataset.

The example above uses a file vocab.txt, which is simply a list of words in your corpus. See a sample here. Here first word is UNK which means a word not in the list is assigned index 0.

Batching: Create batches of sentences

We have only dealt with a single sentence till now. What if we want to create batches of sentences, which means pack say 32 sentences in one go? A key challenge here, is that not all sentences are equal. We will not be able to batch it easily. Let us try the following code:

Straightforward batch will fail for text, and different sentences have different words!

Al right, so how do we create batches with text? Well, the solution is surprisingly simple. We create a batch with longest sentence, and pad the short sentences with say 0. Enter, padded_batch:

That is it! We need to specify shape of our structure. We don’t know the size of our list, and thus specify [None]. Also, there is another example in the notebook which specifies how to pad with a value other than 0. Happy learning!

--

--

Vineet Kumar

Machine Learning and Deep Learning enthusiast. Tensorflow hacker. Love python! Research Software Engineer at IBM Research Labs, New Delhi, India.