Introducing GulpIO

A bespoke storage format for deep learning on videos

In this blog post we introduce our latest open source offering: GulpIO. A bespoke storage format for deep learning on video datasets. It’s primary use-case is to accelerate loading large video datasets from disk to better utilize GPU resource while training deep learning models. Compared to training with JPEG images, we were able to achieve a ~9x-10x speed up when loading the same data in GulpIO format from magnetic disks.

Introduction

When we first started training deep learning models on videos we noticed the following effect:

GPUs under-utilized

As you can see, the utilization of the GPUs on this machine is fluctuating significantly between being utilized and being idle — in fact it is idle for a large part of the time. Our hunch in this case is that the GPUs are starving. What does this mean? Essentially, this is a term coined in the CPU context and describes the fact that we are unable to saturate the compute unit (CPU/GPU) with data. This means it is sitting idle for a significant portion of time because it simply does not have any data to operate on. Loading the data from disk to memory and from memory to CPU caches or GPU memory seems to be the bottleneck. When digging a little deeper, it turns out, that the official TensorFlow Performance Guide contains the following two paragraphs:

“One common cause of poor performance is underutilizing GPUs, or essentially “starving” them of data by not setting up an efficient pipeline.
Another simple way to check if a GPU is underutilized is to run watch nvidia-smi, and if GPU utilization is not approaching 100% then the GPU is not getting data fast enough.”

While we are not using Tensorflow primarily, but PyTorch instead, the arguments and insights are equally applicable. Furthermore this is an evidence, that our reasoning about our bottleneck is on the right track.

As a side-note: this problem becomes more and more evident as the size of the dataset and the volume of the mini-batches increases, i.e. when going from images to videos. For example, in the ImageNet case the size of the dataset is around 160 GB. The Kinetics video dataset on the other hand is around 1 TB. Most of the deep learning compute-resources available at TwentyBN are equipped with around 128 to 512 GB of RAM. This means that ImageNet will easily fit into the file system cache of many of these machines. This means training will be slow during the first epoch. After that, all the images will be cached in memory anyway and you will no longer need to access the disk to load samples. For the Kinetics dataset on the other hand, they will not fit into the file system cache. Since you will be iterating through the entire training set during an epoch you will be hitting the disk for each and every sample. Obviously, the problem is exacerbated when using data parallelism and training across multiple GPUs. The number of disk accesses will increase linearly with the number of GPUs. Disk access might not be a big problem for image datasets, however, it gets worse for video since each instance is a collection of images and incidentally requires multiple disk accesses per instance.

While participating in the Kinetics challenge we stumbled upon this problem once again. We decided to embark on a quest to discover the ideal storage format for our video data. Until then, we were using ffmpeg to burst the videos into JPEG frames which we stored as individual files. Bursting at training time was too compute-intensive and so storing each video as just a bunch of JPEGs (JBOJs) on disk was the best solution we had seen so far. During our explorations we also looked at solutions like PyTables, LMDB and the MxNets RecordIO but decided that these would be too complicated to use and hard to reason about due to complex nature of the formats. Furthermore, we were under immense time-pressure to obtain some reasonable results and the Kinetics deadline was quickly approaching. So we burned the midnight oil and devised a really simple but effective solution which we dubbed GulpIO. A very hacky — but working — prototype was coded in a few hours and so we started gulping the Kinetics video dataset. Initial results were very promising. The format provided much faster reading and training times and our initial measurements indicated an increase of 9.0x-16.0x for magnetic disks and 2.0x-1.7x for solid state disks. We were very happy with the results and continued to use the format throughout the challenge.

After the challenge we put in some effort to clean up the code-base, removed all the copy-and-paster duplicates, added some tests and essentially applied some industry best practices for software engineering. During the implementation phase we received lots of valuable feedback internally and the format was quickly adopted throughout the company by our deep learning researchers and engineers. Today, we are releasing this code as an open source project!

<tongue-in-cheek>
Our approach here reminds me a little of Dr. House, if you have a hunch about the illness then treat the patient accordingly. If he recovers, then your hunch was right, if not, well, then keep “hunching”. In this case, our patient recovered.
</tongue-in-cheek>
Acceptable GPU utilization

Features

GulpIO supports the following features:

Fast data loading from disk to memory.

  • Optimized for read-performance.
  • Sequential/linear data format reduces delays caused by seek time.
  • Data organization in chunks is beneficial for most file systems.
  • Allows for both serial (fast) and random (slower) access.
  • Gives reasonable performance even with little system memory, a small file system cache and a slow magnetic disk.
  • Improved multi-GPU training when using data parallelism.

Easy to hack and modify

  • Index file is easy to replace (e.g. with a B+ tree) since data and metadata are separated.
  • The format is so simple that readers in alternative languages are easy and straightforward to implement.
  • Small, highly extensible code-base.

Generic video and image data format.

  • Single data interface for multiple datasets.
  • Easy join and combine operations across datasets.
  • Included data loader is independent from the deep learning framework. A nice example can be found in this notebook.

Efficient data storage.

  • Compression (JPEG for frames and images) for a minimized storage footprint.
  • Succinct representation with minimal overhead.

Efficient data analysis.

  • Decoupled metadata files to perform independent data analysis.
  • Can load metadata independently of the data.

Efficient data distribution.

  • Coherent representation for easy distribution among sharing parties.
  • Version control among sharing parties.
  • Chunk files can easily be copied across the network.

Easy dataset appending with new data samples.

  • Updates are applied by creating new GulpIO chunks.

Format Description

Basically the format is very, very simple. We concatenate all the frames of a video as JPEG images into a single data-file and save the offsets and metadata in a secondary meta-file. Using JPEG compression reduces the file size significantly compared to storing the raw pixel values of each frame. Using a simple format makes it easy to implement and easy to reason about, which is important when working so close to the metal. The downside is that since it is special-purpose it is no longer generally applicable. For example, storing single-image data is still feasible but audio data should probably be saved differently. An important feature of the format is that multiple videos can be stored in a single gulp chunk. This means when gulping a dataset, we can end up with a handful — say 20 or so — gulp chunks. This is great news for most file systems and additionally ensures that we don’t run out of inodes. Furthermore, having all videos and all frames stored sequentially can be very beneficial when reading from magnetic storage since it circumvents many delays caused by seeking. A second important feature is that every frame is padded to 4 bytes because data alignment does tend to improve performance on most architectures and incidentally, RecordIO implements this too.

As outlined above, when gulping a dataset, two complementary files are created for every chunk: a *.gulp data-file that contains the actual data and a *.gmeta meta-file that contains the metadata and indexing infos.

The layout of the *.gulp file (when storing videos) is as follows:

GulpIO data-file format

The layout of the *.gmeta is a mapping, where each id representing a video is mapped to two further mappings: frame_info and meta_data. The meta_info maps to arbitrary, user-defined metadata and frame_info maps to a triplet containing the offset (index) into the data-file, the number of bytes used for padding and the total length of the frame (including padding): [<offset>, <padding>, <total_length>]. The frame_info is required to recover the frames from the data-file: Based on the location information, one can seek along the data-file and find the starting point of the frame. Then, one can read binary string from the start point to the next frame to retrieve that frame.

In case of video data, each sample in the data-file corresponds to a frame and the meta-file includes information per frame. This frame based approach allows GulpIO to handle image recognition problems as well, albeit GulpIO mainly targets video recognition.

Usage Examples

We designed a simple API that allows Pythonic read access to GulpIO based datasets.

In addition to this, we also provide a data loader which is independent from the deep learning framework, it simply returns the samples as pure Numpy tensors.

Lastly, in order to ingest an arbitrary dataset, we implemented something like an adapter pattern. Essentially, you have to inherit from the following abstract adapter.

You can then use the GulpIngestor class with your adapter to easily ingest your dataset.

The advantage of the approach is that we are future-proof because we can cater to pretty much any dataset. It doesn’t matter if your labels are in CSV, JSON or YAML format. It doesn’t matter if your data is stored on the local disk, on Amazon S3 or Google Cloud Storage. Writing an appropriate adapter will enable you to quickly ingest your data.

Beyond that we also provide some ready made command-line utilities for known video and image datasets. In fact, if you have an adapter, there exists a command-line utility, gulp_register_adapter, which will wrap your adapter and produce a custom command-line utility. While this may not work perfectly, it will certainly yield a useful starting point for further development.

Benchmarks

In order to demonstrate and experiment with the GulpIO code, we are additionally releasing a benchmarking repository. This repository contains a fully-fledged deep learning model implementation in PyTorch which you can train on the 20BN-Jester dataset. Importantly, this repository contains two loaders, a JPEG one and a GulpIO one, which can be compared against each other, both as individual loaders and as part of a training pipeline. All scripts and code used during these benchmarks are available in the repository for you to try!

Initial results when benchmarking the data loaders in isolation using a magnetic disk indicate that the GulpIO loader is roughly one order of magnitude faster than the JPEG loader.

All the results reported here were run on a desktop dual GPU system with the following specs:

  • 2x GTX 1080 Ti
  • Hexacore Intel i7–6850K Processor
  • 128 GB RAM
  • 3TB Western Digital disk
  • MSI x99A Motherboard

For benchmarking the data loaders fetched 50 batches each of size: torch.Size([10, 3, 18, 84, 84]). The measurements are in seconds.

Run 1

$ nocache python data_loader_jpeg.py
61.415191650390625
$ nocache python data_loader_gulpio.py
5.9158337116241455

Run 2

$ nocache python data_loader_jpeg.py
58.36166548728943
$ nocache python data_loader_gulpio.py
6.112927436828613

As you can see, we achieved a 10x faster loading time under the given circumstances.

In order to replicate somewhat realistic conditions we also benchmarked the data loaders as part of a deep learning training pipeline, as mentioned above. While the model is a fairly simple and standard conv-net which probably won’t give great accuracy, the example still serves to highlight the advantage of using GulpIO.

JPEG Loader

GulpIO Loader

As you can see, using GulpIO leads to 9x faster training under our circumstances.

Summary

In this blog post we presented our latest open source offering: GulpIO. The blogpost contains a description of the newly developed file format that allows a better utilization of the GPUs. In initial benchmarks we observed a ~9x-10x speed up in comparison when training on spinning hard drives with a JPEG loader. The project is still in it’s early stages but has already become the de-facto standard for very large video datasets at TwentyBN. We are looking forward to discussing future use-cases and receiving constructive criticism from the community. We encourage you to get in touch: submit issues and pull-requests on our Github repositories, run your own benchmarks and discuss with us on our google group!

Enjoy data!