Tutorial on bucket_by_sequence_length API for efficiently batching NLP data while training.

Prateek Bhatt
Analytics Vidhya
Published in
6 min readJun 23, 2020

I was first introduced to bucket_by_sequence_length API during the Tensorflow Dev summit of 2017 time 7:00 minutes.
This API initially resided in the package `tf.contrib.training` and now it has been moved to `tf.data.experimental`.
Since then the API has undergone few changes. We will discuss the latest API in this tutorial.
This tutorial explains the need and importance of using the API and later demonstrates a working example.

So let us begin.

Consider the task of sarcasm detection.
The data looks something like this

{
“is_sarcastic”: 1,
“headline”: “thirtysomething scientists unveil doomsday clock of hair loss”,
“article_link”: “https://www.theonion.com/thirtysomething-scientists-unveil-doomsday-clock-of-hai-1819586205"
}

a dictionary object, where `is_sarcastic` is our target, and the ‘headline’ is our feature.

In the complete dataset, the headline will be of different lengths. This means to train a model to detect sarcasm, we will have to make the length of the ‘headline’ in the complete dataset the same. This step has to be done as the training of the model happens in batches and each batch should have the same shape.

One way to make all the ‘headline’ data the same, is to do padding. Pad all the headline data to the maximum length of the headline data found in the database. Pad the text data with ‘<pad>’ token.

This works to some degree, but this is not memory efficient. Let us do some analysis of the dataset.

First, load the JSON data using the following. Download the v2 data to your local folder from Kaggle.

df = pd.read_json(“./data/Sarcasm_Headlines_Dataset_v2.json”, lines=True)

Let us get the maximum length of the headline data.

print(‘maximum length of headline data is ‘, df.headline.str.split(‘ ‘).map(len).max())```
We receive result `maximum length of headline data is 151`

Let’s get the minimum length of the headline data.

print(‘minimum length of headline data is ‘, df.headline.str.split(‘ ‘).map(len).min())`
We receive result `minimum length of headline data is 2`

Now let also get the mean of the lengths of the headline text data.

print(‘mean of the lengths of the headline data is ‘, df.headline.str.split(‘ ‘).map(len).mean())`
The result is
`mean of the lengths of the headline data is 10.051853663650023`

From the above data, it can be seen that if we pad each headline data to 151, we will waste a lot of memory and for few data, we will have more `<pad>` tokens than the actual words.

This brings us to the API `bucket_by_sequence_length`, this method is more efficient as it will still pad the text data to the same length but not to the complete dataset (i.e. 151 lengths), but for a single batch. This means that every headline in a batch will be of the same length, but each batch will be of different length depending upon the highest length of the text data in that batch.

I had tried to implement this API myself, I however struggled to find proper documentation and proper examples. So once I figured it out, I thought it will be better if more people knew and could use it in their work.

If you read the API documentation, it says that it returns a transformation function that can be passed to the `tf.data.Dataset.apply`.
From the documentation, `A Dataset transformation function, which can be passed to tf.data.Dataset.apply.`

This means that first of all, we will have to convert our dataframe to the tf.data.Dataset. It is recommended by TensorFlow to use the tf.data.Dataset API as it is optimized for the input pipeline. One can do several transformations on the dataset, but more on Dataset some other time.

But before heading there, let’s first transform the texts into integers. We will use the TensorFlow tokenizer for this. We will set the vocabulary size, embedding_dim as well as out of vocabulary token. We will set the batch_size as well, but later you could see that we can do dynamic batching as well.

vocab_size = 1000
embedding_dim = 16
oov_tok = “<OOV>”
batch_size = 64

After setting the above parameters, let’s transform the texts into integers.

# Creating an instance of tokenizer
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok, lower=True)
# Creates and updates the internal vocabulary based on the text.
tokenizer.fit_on_texts(df.headline)
# Add padding token.
tokenizer.word_index[‘<pad>’] = 0
tokenizer.index_word[0] = ‘<pad>’
# Transforms the sentences to integers
sentences_int = tokenizer.texts_to_sequences(df.headline)

Let us also get the labels in a list.

labels = df.is_sarcastic.values

Let us now create Dataset that is recommended for creating input pipelines.


# Using generator for creating the dataset.
def generator():
for i in range(0, len(sentences_int)):
# creates x’s and y’s for the dataset.
yield sentences_int[i], [labels[i]]
# Calling the from_generator to generate the dataset.
# Here output types and output shapes are very important to initialize.
# the output types are tf.int64 as our dataset consists of x’s that are int as well as the labels that are int as well.
# The tensor shape for x is tf.TensorShape([None]) as the sentences can be of varied length.
# The tensorshape of y is tf.TensorShape([1]) as that consists of only the labels that can be either 0 or 1.
dataset = tf.data.Dataset.from_generator(generator, (tf.int64, tf.int64),
(tf.TensorShape([None]), tf.TensorShape([1])))

Our dataset is ready, let us now use the bucket_by_sequence_length API to generate batches as well as pad our sentences according to the upper bucket size that we will provide. Let us create the upper length of different buckets. We can create buckets as much as we would like to have. I will suggest analyzing the dataset first to understand the different buckets that you might need.

# These are the upper length boundaries for the buckets.
# Based on these boundaries, the sentences will be shifted to #different buckets.
boundaries = [df.headline.map(len).max() — 850, df.headline.map(len).max() — 700, df.headline.map(len).max() — 500,
df.headline.map(len).max() — 300, df.headline.map(len).max() — 100, df.headline.map(len).max() — 50,
df.headline.map(len).max()]

We will also have to provide batch_sizes for different buckets. The batch_sizes should have length len(bucket_boundaries) + 1

batch_sizes = [batch_size] * (len(boundaries) + 1)

The bucket_by_sequence_length API also needs a function to be passed that determines the length of the sentence.
Once the length of the sentence is know to the API, then it can put it in the proper bucket. Here in the ideal scenario, you would create batches of different sizes depending upon which bucket has more or fewer sentences, but here I have kept the batch sizes constant for all the buckets.

# This function determines the length of the sentence.
# This will be used by bucket_by_sequence_length to batch them according to their length.
def _element_length_fn(x, y=None):
return array_ops.shape(x)[0]

Now we have prepared all the parameters that are needed to call the bucket_by_sequence_length API, here comes our call to the API.

# Bucket_by_sequence_length returns a dataset transformation function that has to be applied using dataset.apply.
# Here the important parameter is pad_to_bucket_boundary. If this is set to true then, the sentences will be padded to
# the bucket boundaries provided. If set to False, it will pad the sentences to the maximum length found in the batch.
# Default value for padding is 0, so we do not need to supply anything extra here.
dataset = dataset.apply(tf.data.experimental.bucket_by_sequence_length(_element_length_fn, boundaries,
batch_sizes,
drop_remainder=True,
pad_to_bucket_boundary=True))

One important thing in the `boundaries` is to have the maximum length of the sentence of the dataset. If that is unknown to you, I will suggest making pad_to_bucket_boundary = False

Once we have the dataset properly batched and padded to have an equal shape for each bucket then we can split the dataset to train and test.
I was unable to get a better solution on splitting the Dataset than the answer provided here — https://stackoverflow.com/a/58452268/7220545

# Splitting the dataset for training and testing.
def is_test(x, _):
return x % 4 == 0
def is_train(x, y):
return not is_test(x, y)
recover = lambda x, y: y# Split the dataset for training.
test_dataset = dataset.enumerate() \
.filter(is_test) \
.map(recover)
# Split the dataset for testing/validation.
train_dataset = dataset.enumerate() \
.filter(is_train) \
.map(recover)

After this step, we have our dataset ready for training and validation.
The training of the model is out of the scope of this tutorial, but I have made the code available on GitHub that demonstrates the training of the model.
To run the code, you will have to download the dataset from Kaggle — https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection and paste it in the ./data folder.
Then you are good to go.

I hope you had a nice time reading the article and that it was useful to you.

--

--