TFRecords - how to use sharding

Rodrigo Brechard
2 min readNov 7, 2019

--

Image by Gerd Altmann from Pixabay

You probably already read a lot about TFRecords and how to use them. And if that is not the case, there are several webs you can check, like this medium post or the official tensorflow guide. But you came here looking for help when it comes about sharding. If you have a big data set it is something highly recommended, otherwise you will not be able to shuffle the data correctly, as they explain here.

So how do we create them? It is actually very easy. First we have to decide how many shards we want. In my case, I’m working with images and since it is recommended that each shard is 100–200mb I found that 800 images per shard was a good number. Second we iterate over all the files an every 800 images we create a different record. But the most important thing is the naming of those files. They need a pattern so that they can be read later by TensorFlow. And this pattern is as simple as “*-of-*” (where the first “*” is the shard and the second one the number of shards). Of course, if you don’t like this patter you can change it.

This will give us files with a naming like: “COCO_train_00000-of-00147.records” so to read them our tfrecords_pattern_path will be “COCO_train_*-of-*.records”.

This way you first randomize the files and then each part will be again randomized. And that’s it! Short and simple. If your dataset is really huge you may have to set some parallelization when creating the shards. As a reference, processing COCO (163,960 images) takes 7 minutes.

--

--