How to use Dataset and Iterators in Tensorflow with code samples
From the time I have started using Tensorflow, I have always been feeding the data to my graph during training, testing or inferencing using the feed_dict mechanism of Session. This particular practice has been advised by Tensorflow developers to be strongly discontinued either during the training or repeatedly testing same series of dataset. The only particular scenario in which feed_dict mechanism is to be used is during inferencing of data during deployment. The replacement of feed_dict has taken place with Dataset and Iterator. The dataset can be created either with Numpy array or TFRecords or with text.
In this post, we will be exploring on Datasets and Iterators. We will start with how to create datasets using some source data and then apply various type of transformations to it. We will demonstrate on how to do training using various types of iterators with MNIST handwritten digits data on LeNet-5 model.
Note: The Tensorflow Dataset class can get very confusing with word meant for datasets like X_train, y_train etc. Hence, going forward in this article, I am referring ‘Dataset’ (capital D) as Tensorflow Dataset class and ‘dataset’ as dataset of X_train, y_train etc.
Datasets Creation
Datasets can be generated using multiple type of data sources like Numpy, TFRecords, text files, CSV files etc. The most commonly used practice for generating Datasets is from Numpy (or Tensors). Lets go through each of the functions provided by Tensorflow to generate them.
a) from_tensor_slices: This method accepts individual (or multiple) Numpy (or Tensors) objects. In case you are feeding multiple objects, pass them as tuple and make sure that all the objects have same size in zeroth dimension.
b) from_tensors: Just like from_tensor_slices, this method also accepts individual (or multiple) Numpy (or Tensors) objects. But this method doesn’t support batching of data, i.e all the data will be given out instantly. As a result, you can pass differently sized inputs at zeroth dimension if you are passing multiple objects. This method is useful in cases where dataset is very small or your learning model needs all the data at once.
c) from_generator: In this method, a generator function is passed as input. This method is useful in cases where you wish to generate the data at runtime and as such no raw data exists with you or in scenarios where your training data is extremely huge and it is not possible to store them in your disk. I would strongly encourage people to not use this method for the purpose of generating data augmentations.
Datasets Transformations
Once you have created the Dataset covering all the data (or scenarios, in some cases like, runtime data generation), it is time to apply various types of transformation. Let us go through some of commonly used transformations.
a) Batch: Batch corresponds to sequentially dividing your dataset by the specified batch size.
b) Repeat: Whatever Dataset you have generated, use this transformation to create duplicates of the existing data in your Dataset.
c) Shuffle: Shuffle transformation randomly shuffles the data in your Dataset.
d) Map: In Map transformation, you can apply some operations to all the individual data elements in your dataset. Use this particular transformation to apply various types of data augmentation. (You can check my other Medium post on image augmentation).
e) Filter: During the course of training, if you wish to filter out some elements from Dataset, use filter function.
The code example of various transformations being applied on a Dataset is shown next.
Ordering of transformation
The ordering of the application of the transformation is very important. Your model may learn differently for the same Dataset but differently ordered transformations. Take a look at the code sample in which it has been shown that different set of data is produced.
Building LeNet-5 Model
Before we start the iterators part, let us quickly build our LeNet-5 Model and extract the MNIST data. I have used Tensorflow’s Slim library to build the model in few lines. This is going to be the common code for all types of iterators we are going to work on next.
Iterators
Now, let’s start building up the iterators. Tensorflow has provided four types of iterators and each of them has a specific purpose and use-case behind it.
Regardless of the type of iterator, get_next function of iterator is used to create an operation in your Tensorflow graph which when run over a session, returns the values from the fed Dataset of iterator. Also, iterator doesn’t keep track of how many elements are present in the Dataset. Hence, it is normal to keep running the iterator’s get_next operation till Tensorflow’s tf.errors.OutOfRangeError exception is occurred. This is usually the skeleton code of how a Dataset and iterator looks like.
Next, let’s look into each type of iterator.
a) One-shot iterator
This is the most basic type of iterator. All the data with all types of transformations that is needed in the dataset has to be decided before the Dataset is fed into this iterator. One-shot iterator will iterate through all the elements present in Dataset and once exhausted, cannot be used anymore. As a result, the Dataset generated for this iterator can tend to occupy a lot of memory.
In the example above, we have generated the Dataset for a total of 10 epochs. Use this particular iterator only if your dataset is small in size or in cases where you would like to perform testing on your model only once.
b) Initializable
In One-shot iterator, we had the shortfall of repetition of same training dataset in memory and there was absence of periodically validating our model using validation dataset in our code. In initializable iterator we overcome these problems. Initializable iterator has to be initialized with dataset before it starts running. Take a look at the code.
As can be seen, using initializer operation, we have changed the dataset between training and validation using the same Dataset object.
This iterator is very ideal when you have to train your model with datasets which are split across multiple places and you are not able to accumulate them into one place.
c) Reinitializable
In initializable iterator, there was a shortfall of different datasets undergoing the same pipeline before the Dataset is fed into the iterator. This problem is overcome by reinitializable iterator as we have the ability to feed different types of Datasets thereby undergoing different pipelines. Only one care has to be taken is that different Datasets are of the same data type. Take a look at the code.
Notice that training Dataset object is undergoing additional augmentation which validation Dataset is not. You could have directly fed the training and validation datasets into Dataset objects but I have made use of placeholders just to show the flexibility.
d) Feedable
The reinitializable iterator gave the flexibility of assigning differently pipelined Datasets to iterator, but the iterator was inadequate to maintain the state (i.e till where the data has been emitted by individual iterator). In the code sample, I am showing how to use Feedable iterator.
Though not illustrated in above code sample, using the string handle, we can restart the particular point from where the data extraction was done while altering between different Datasets.
This iterator is ideal in scenarios where you are training simultaneously a model with different datasets and you need better control to decide which particular batch of dataset has to be fed next to model.
Datasets can be generated with TFRecords as well. I have written another article on how to create TFRecords and how to feed it into Datasets in another article. Do check it out as it is like part-2 of this article.
You can check the code used in this article from my Github repository.
Or if you wish to see the code running in notebook, you can check it in my shared Colab notebook.
Do leave your valuable comments on what do you feel about this article.