Introducing TensorFlow Datasets
Public datasets fuel the machine learning research rocket (h/t Andrew Ng), but it’s still too difficult to simply get those datasets into your machine learning pipeline. Every researcher goes through the pain of writing one-off scripts to download and prepare every dataset they work with, which all have different source formats and complexities. Not anymore.
Today, we’re pleased to introduce TensorFlow Datasets (GitHub) which exposes public research datasets as
tf.data.Datasets and as NumPy arrays. It does all the grungy work of fetching the source data and preparing it into a common format on disk, and it uses the
tf.data API to build high-performance input pipelines, which are TensorFlow 2.0-ready and can be used with
tf.keras models. We’re launching with 29 popular research datasets such as MNIST, Street View House Numbers, the 1 Billion Word Language Model Benchmark, and the Large Movie Reviews Dataset, and will add more in the months to come; we hope that you join in and add a dataset yourself.
tfds out in a Colab notebook.
Every dataset is exposed as a DatasetBuilder, which knows:
- Where to download the data from and how to extract it and write it to a standard format (
- How to load it from disk (
- And all the information about the dataset, like the names, types, and shapes of all the features, the number of records in each split, the source URLs, citation for the dataset or associated paper, etc. (
You can directly instantiate any of the DatasetBuilders or fetch them by string with
as_dataset() accepts a
batch_size argument which will give you batches of examples instead of one example at a time. For small datasets that fit in memory, you can pass
batch_size=-1 to get the entire dataset at once as a
tf.data.Datasets can easily be converted to iterables of NumPy arrays using
As a convenience, you can do all the above with
tfds.load, which fetches the DatasetBuilder by name, calls
download_and_prepare(), and calls
Every dataset is versioned (
builder.info.version) so that you can rest assured that the data doesn’t change underneath you and that results are reproducible. For now, we guarantee that if the data changes, the version will be incremented.
Note that while we do guarantee the data values and splits are identical given the same version, we do not currently guarantee the ordering of records for the same version.
Datasets with different variants are configured with named BuilderConfigs. For example, the Large Movie Review Dataset (
tfds.text.IMDBReviews) could have different encodings for the input text (for example, plain text, or a character encoding, or a subword encoding). The built-in configurations are listed with the dataset documentation and can be addressed by string, or you can pass in your own configuration.
See the section on dataset configuration in our documentation on adding a dataset.
Text Datasets and Vocabularies
Text datasets can be often be painful to work with because of different encodings and vocabulary files.
tensorflow-datasets makes it much easier. It’s shipping with many text tasks and includes three kinds of TextEncoders, all of which support Unicode:
ByteTextEncoderfor byte/character-level encodings
TokenTextEncoderfor word-level encodings based on a vocabulary file
SubwordTextEncoderfor subword-level encodings (and the ability to construct the subword vocabulary tuned to a particular text corpus) with a byte-level fallback so that it’s fully invertible. For example, “hello world” could get split into [“he”, “llo”, “ “, “wor”, “ld”] and then integer-encoded. Subwords are a happy medium between word-level and byte-level encodings and are popular in some natural language research projects.
The encoders, along with their vocabulary sizes, can be accessed through
Both TensorFlow and TensorFlow Datasets will be working to improve text support even further in the future.
Our documentation site is the best place to start using
tensorflow-datasets. Here are some additional pointers for getting started:
We expect to be adding datasets in the coming months, and we hope that the community will join in. Open a GitHub Issue to request a dataset, vote on which datasets should be added next, discuss implementation, or ask for help. And Pull Requests very welcome! Add a popular dataset to contribute to the community, or if you have your own data, contribute it to TFDS to make your data famous!
Now that data is easy, happy modeling!
TensorFlow Datasets was a team effort. Our core developers are Etienne Pot,
Afroz Mohiuddin, Pierre Ruyssen, Marcin Michalski, and Ryan Sepassi. We’d
also like to thank Jiri Simsa for his help with tf.data, and Martin Wicke
for his support of the project. Thanks all!
We’d like to thank Stefan Webb of Oxford for allowing us to use the
tensorflow-datasets PyPI name. Thanks Stefan!
We’d also like to thank Lukasz Kaiser and the Tensor2Tensor project for inspiring and guiding tensorflow/datasets. Thanks Lukasz! T2T will be migrating to tensorflow/datasets soon.