Interest in Tensorflow has increased steadily since its introduction in November 2015. A lesser-known component of Tensorflow is the TFRecord file format, Tensorflow’s own binary storage format.
If you are working with large datasets, using a binary file format for storage of your data can have a significant impact on the performance of your import pipeline and as a consequence on the training time of your model. Binary data takes up less space on disk, takes less time to copy and can be read much more efficiently from disk. This is especially true if your data is stored on spinning disks, due to the much lower read/write performance in comparison with SSDs.
However, pure performance isn’t the only advantage of the TFRecord file format. It is optimized for use with Tensorflow in multiple ways. To start with, it makes it easy to combine multiple datasets and integrates seamlessly with the data import and preprocessing functionality provided by the library. Especially for datasets that are too large to be stored fully in memory this is an advantage as only the data that is required at the time (e.g. a batch) is loaded from disk and then processed. Another major advantage of TFRecords is that it is possible to store sequence data — for instance, a time series or word encodings — in a way that allows for very efficient and (from a coding perspective) convenient import of this type of data. Check out the Reading Data guide to learn more about reading TFRecord files.
So, there are a lot of advantages to using TFRecords. But where there is light, there must be shadow and in the case of TFRecords the downside is that you have to convert your data to this format in the first place and only limited documentation is available on how to do that. An official tutorial and a number of articles about writing TFRecords exist, but I found that they only got me covered part of the way to solve my challenge.
In this post I will explain the components required to structure and write a TFRecord file, and explain in detail how to write different types of data. This will help you get started to tackle your own challenges.
A TFRecord file stores your data as a sequence of binary strings. This means you need to specify the structure of your data before you write it to the file. Tensorflow provides two components for this purpose: tf.train.Example and tf.train.SequenceExample. You have to store each sample of your data in one of these structures, then serialize it and use a tf.python_io.TFRecordWriter to write it to disk.
tf.train.Example isn’t a normal Python class, but a protocol buffer.
As a software developer, the main problem I had at the beginning was that many of the components in the Tensorflow API don’t have a description of the attributes or methods of the class. For instance, for tf.train.Example only a “.proto” file with cryptic structures called “message” is provided, along with examples in pseudocode. The reason for this is that tf.train.Example isn’t a normal Python class, but a protocol buffer. A protocol buffer is a method developed by Google to serialize structured data in an efficient way. I will now discuss the two main ways to structure Tensorflow TFRecords, give an overview of the components from a developers view and provide a detailed example of how to use tf.train.Example and tf.train.SequenceExample.
Movie recommendations using tf.train.Example
If your dataset consist of features, where each feature is a list of values of the same type, tf.train.Example is the right component to use.
Let’s use the movie recommendation application from the Tensorflow documentation as an example:
We have a number of features, each being a list where every entry has the same data type. In order to store these features in a TFRecord, we fist need to create the lists that constitute the features.
tf.train.BytesList, tf.train.FloatList, and tf.train.Int64List are at the core of a tf.train.Feature. All three have a single attribute value, which expects a list of respective bytes, float, and int.
Python strings need to be converted to bytes, (e.g. my_string.encode(‘utf-8’)) before they are stored in a tf.train.BytesList.
tf.train.Feature wraps a list of data of a specific type so Tensorflow can understand it. It has a single attribute, which is a union of bytes_list/float_list/int64_list. Being a union, the stored list can be of type tf.train.BytesList (attribute name bytes_list), tf.train.FloatList (attribute name float_list), or tf.train.Int64List (attribute name int64_list).
tf.train.Features is a collection of named features. It has a single attribute feature that expects a dictionary where the key is the name of the features and the value a tf.train.Feature.
tf.train.Example is one of the main components for structuring a TFRecord. An tf.train.Example stores features in a single attribute features of type tf.train.Features.
In contrast to the previous components, tf.python_io.TFRecordWriter actually is a Python class. It accepts a file path in its path attribute and creates a writer object that works just like any other file object. The TFRecordWriter class offers write, flush and close methods. The method write accepts a string as parameter and writes it to disk, meaning that structured data must be serialized first. To this end, tf.train.Example and tf.train.SequenceExample provide SerializeToString methods:
In our example, each TFRecord represents the movie ratings and corresponding suggestions of a single user (a single sample). Writing recommendations for all users in the dataset follows the same process. It is important that the type of a feature (e.g. float for the movie rating) is the same across all samples in the dataset. This conformance criterion and others are defined in the protocol buffer definition of tf.train.Example.
Here’s a complete example that writes the features to a TFRecord file, then reads the file back in and prints the parsed features.
Now that we’ve covered the structure of TFRecords, the process of reading them is straightforward:
- Read the TFRecord using a tf.TFRecordReader.
- Define the features you expect in the TFRecord by using tf.FixedLenFeature and tf.VarLenFeature, depending on what has been defined during the definition of tf.train.Example.
- Parse one tf.train.Example (one file) a time using tf.parse_single_example.
Movie recommendations using tf.train.SequenceExample
tf.train.SequenceExample is the right choice if you have features that consist of lists of identically typed data and maybe some contextual data.
Now, let’s take a slightly different set of data, again from the Tensorflow documentation:
We have a number of context features — Locale, Age, and Favorites — that are user specific and a list of movie recommendations of the user, which consist of Movie Name, Movie Rating, and Actors.
The data looks very similar — in the previous example we had a set of features, where each feature consisted of a single list. Each entry in the list represented the same information for a different movie, for instance the movie rating. This didn’t change, but now we also have Actors, which is a list of the actors with a role in the movie. This type of data cannot be stored in a tf.train.Example. We need a different type of structure for this kind of data, and Tensorflow provides it in the form of tf.train.SequenceExample. In contrast to tf.train.Example, it does not store a list of bytes, floats or int64s, but a *list of lists* of bytes, floats or int64s, and is thus suited for our dataset.
More formally, tf.train.SequenceExample has two attributes:
- context of type tf.train.Features
- features_lists of type tf.train.FeatureLists
The data from table “Context” is stored in context as tf.train.Features, just like we did for tf.train.Example. The data from table “Data” — Movie Name, Movie Rating, and Actors — are each stored in a separate tf.train.FeatureList each.
tf.train.FeatureList has a single attribute feature that expects a list with entries of type tf.train.Feature. At first, this might look similar to tf.train.Features, which also contains multiple entries of type tf.train.Feature, but there are two big differences. First, all of the features in the list must have the same internal list type. Second, while tf.train.Features is a dictionary containing (unordered) named features, tf.train.FeatureList is a list containing ordered unnamed features.
A typical example of data stored in a tf.train.FeatureList would be a time series where each tf.train.Feature in the list is a time step of the sequence, or the list of actors for several differnt movies.
tf.train.FeatureLists is a collection of named instances of tf.train.FeatureList. This component has a single attribute feature_list that expects a dict. The key of the dict is the name of the tf.train.FeatureList, while the value is the tf.train.FeatureList itself.
tf.train.SequenceExample, just like tf.train.Example, is one of the main components for structuring a TFRecord. In contrast to tf.train.Example, it has two attributes:
- context: this attribute expects type tf.train.Features. It contains information that is relevant for each of the features in the feature_list attribute. The behavior of context is identical to the features attribute of tf.train.Example.
- feature_lists: the type of this attribute is tf.train.FeatureLists. It contains lists of features, where each feature again is some kind of sequential data (e.g. a time series, or frames).
You can find more information about tf.train.SequenceExample in the protocol buffer definition. As a side note, while conformance criteria exist, they are not necessarily enforced — e.g. feature_list_invalid from the FeatureList example won’t throw a exception.
Here again is a full example:
Reading TFRecords based on tf.train.SequenceExample works the same as for tf.train.Examples. The only difference is that instead of just a single set of features, we need to define two: context and sequence features. Context features work exactly the same as shown before. Sequence features must be of type tf.VarLenFeature or tf.FixedLenSequenceFeature and are parsed using tf.parse_single_sequence_example.
Using Tensorflow TFRecords is a convenient way to get your data into your machine learning pipeline, but understanding all the bits and pieces can be daunting at the beginning. The examples in this post should clarify the whole process and get you started.
You find all of the code snippets I used and more on Github — feel free to copy and use them in any way you like.
I used a number of resources to create this story:
- The official tutorial shows a coding example of how to write image data to a TFRecord using tf.train.Example.
- This tutorial from Machine Learning Guru uses the example from the official tutorial and goes into more depth.
- This StackOverflow post shows how to use tf.train.SequenceExample.