Introducing TensorFlow Datasets

TensorFlow
Feb 26, 2019 · 4 min read
Image for post
Image for post

Public datasets fuel the machine learning research rocket (h/t Andrew Ng), but it’s still too difficult to simply get those datasets into your machine learning pipeline. Every researcher goes through the pain of writing one-off scripts to download and prepare every dataset they work with, which all have different source formats and complexities. Not anymore.

Today, we’re pleased to introduce TensorFlow Datasets (GitHub) which exposes public research datasets as tf.data.Datasets and as NumPy arrays. It does all the grungy work of fetching the source data and preparing it into a common format on disk, and it uses the tf.data API to build high-performance input pipelines, which are TensorFlow 2.0-ready and can be used with tf.keras models. We’re launching with 29 popular research datasets such as MNIST, Street View House Numbers, the 1 Billion Word Language Model Benchmark, and the Large Movie Reviews Dataset, and will add more in the months to come; we hope that you join in and add a dataset yourself.

tl;dr

Try tfds out in a Colab notebook.

tfds.load and DatasetBuilder

  • Where to download the data from and how to extract it and write it to a standard format (DatasetBuilder.download_and_prepare).
  • How to load it from disk (DatasetBuilder.as_dataset).
  • And all the information about the dataset, like the names, types, and shapes of all the features, the number of records in each split, the source URLs, citation for the dataset or associated paper, etc. (DatasetBuilder.info).

You can directly instantiate any of the DatasetBuilders or fetch them by string with tfds.builder:

as_dataset() accepts a batch_size argument which will give you batches of examples instead of one example at a time. For small datasets that fit in memory, you can pass batch_size=-1 to get the entire dataset at once as a tf.Tensor. All tf.data.Datasets can easily be converted to iterables of NumPy arrays using tfds.as_numpy().

As a convenience, you can do all the above with tfds.load, which fetches the DatasetBuilder by name, calls download_and_prepare(), and calls as_dataset().

You can also easily get the DatasetInfo object from tfds.load by passing with_info=True. See the API documentation for all the options.

Dataset Versioning

Note that while we do guarantee the data values and splits are identical given the same version, we do not currently guarantee the ordering of records for the same version.

Dataset Configuration

See the section on dataset configuration in our documentation on adding a dataset.

Text Datasets and Vocabularies

  • ByteTextEncoder for byte/character-level encodings
  • TokenTextEncoder for word-level encodings based on a vocabulary file
  • SubwordTextEncoder for subword-level encodings (and the ability to construct the subword vocabulary tuned to a particular text corpus) with a byte-level fallback so that it’s fully invertible. For example, “hello world” could get split into [“he”, “llo”, “ “, “wor”, “ld”] and then integer-encoded. Subwords are a happy medium between word-level and byte-level encodings and are popular in some natural language research projects.

The encoders, along with their vocabulary sizes, can be accessed through DatasetInfo:

Both TensorFlow and TensorFlow Datasets will be working to improve text support even further in the future.

Getting started

We expect to be adding datasets in the coming months, and we hope that the community will join in. Open a GitHub Issue to request a dataset, vote on which datasets should be added next, discuss implementation, or ask for help. And Pull Requests very welcome! Add a popular dataset to contribute to the community, or if you have your own data, contribute it to TFDS to make your data famous!

Now that data is easy, happy modeling!

Acknowledgements

We’d like to thank Stefan Webb of Oxford for allowing us to use the tensorflow-datasets PyPI name. Thanks Stefan!

We’d also like to thank Lukasz Kaiser and the Tensor2Tensor project for inspiring and guiding tensorflow/datasets. Thanks Lukasz! T2T will be migrating to tensorflow/datasets soon.

TensorFlow

TensorFlow is an end-to-end open source platform for…

TensorFlow

Written by

TensorFlow is a fast, flexible, and scalable open-source machine learning library for research and production.

TensorFlow

TensorFlow is an end-to-end open source platform for machine learning.

TensorFlow

Written by

TensorFlow is a fast, flexible, and scalable open-source machine learning library for research and production.

TensorFlow

TensorFlow is an end-to-end open source platform for machine learning.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store