Getting full details about fastai curated datasets

Getting the Most Out of Fastai Curated Datasets

Mark Ryan
The Startup
Published in
4 min readFeb 14, 2021

--

If you want to learn about deep learning you really can’t go wrong with the fastai framework. This framework is at the heart of a set of courses and is now the topic of a book written by the fastai leader Jeremy Howard.

Here are some of the benefits of fastai:

  • Ease of entry — even beginners can create high-performing deep learning models in a few lines of code with fastai.
  • Intelligent defaults throughout — if you want to take the happy path, fastai makes good default choices for many settings, allowing you to get a working model with minimal hassle.
  • Turnkey development environments, including Google Colab and Paperspace Gradient, that let you get started with fastai with almost no setup.

One of the ways that fastai makes it easy to get started is by providing a set of curated datasets. These datasets can be accessed easily without having to worry about details of their structure or location.

It’s worth noting that Keras also provides a set of curated datasets that overlaps with the fastai curated datasets. For example, both platforms include MNIST in their set of curated dataset.

What sets fastai apart is its number of curated datasets (50+ vs. 7 for Keras) and the variety of applications they cover, including:

  • tabular data
  • recommender systems
  • natural language
  • image data

Benefits of the fastai curated datasets

Let’s review some of the benefits of fastai’s curated datasets by going looking at how you would ingest and examine one of these datasets. Here’s how you initially ingest the FLOWERS image dataset:

path = untar_data(URLs.FLOWERS)

You can examine the directory structure of the dataset using the ls() function on the path variable:

This tells us how the dataset is organized:

  • train.txt — metadata about the training set
  • valid.text — metadata about the validation set
  • test.txt — metadata about the test set
  • jpg — directory containing the image files

Now that you know the structure of the dataset, you can examine it. For example, you can ingest the metadata files to examine their structure:

You can apply the path variable to convenience functions to display individual images from the dataset:

In sum, by combining the fastai curated datasets with the convenience functions provided by fastai, you get a really convenient way to ingest and examine a broad variety of datasets with minimum hassle so you can focus on the downstream steps of your deep learning project, such as training your model.

Getting the full story on fastai curated datasets

Now that we have established some of the benefits of fastai’s curated datasets, how can we find out more about the complete set of curated datasets? There are two obvious options to learn more about these datasets:

The problem is that neither of these resources tells the whole story. The documentation is not complete, and it does not map the datasets to the names you need to use the code. For example, above we used FLOWERS to identify the dataset in the code, but the dataset documentation doesn’t tell you that identifier is associated with Oxford 102 Flowers.

How can you get complete details about all the fastai curated datasets? To get more details about the fastai curated datasets, you can use the magic command “??URLs” to see the source for the URLs class:

Examining this source gives you two benefits:

  1. Complete URLs for the datasets — for example, by working through the code we can determine the complete URL for a dataset. For example, for FLOWERS we can work backwards through the source for URLs:

This way we can determine that the complete URL for the FLOWERS curated dataset is:

https://s3.amazonaws.com/fast-ai-imageclas/oxford-102-flowers.tgz

2. Complete set of curated datasets:

  • The documentation lists 26 curated datasets.
  • The source for the URLs class lists a total of over 50 curated datasets, including all the variations of datasets such as MINST and IMAGENETTE.

Conclusions

The curated datasets are a big benefit of fastai. They make it easy to get at a wide variety of datasets that cover all the application types supported by fastai. Particularly if you are a beginner, the curated datasets help you unleash fastai’s promise of a fully-fledged deep learning model with a handful of lines of code. By examining the source of the URLs class you can get full details about these curated datasets.

Video on this topic: https://youtu.be/z6sL_qkcaSg

Video on fastai vs. Keras: https://youtu.be/3d6rGGyPR5c

--

--

Mark Ryan
The Startup

Technical writing manager at Google. Opinions expressed are my own.