Using the fastai Data Block API

Tom McKenzie
6 min readNov 13, 2018

--

Updated for release v1.0.24

The new fastai library has made the training of neural networks so seamless that often the most challenging part of the project will be feeding in the data in the right format to be processed. That’s where the new “data block API” comes in. Below I’m going to step through the API to understand what it’s doing and how to use it to quickly get your deep learning projects up and running!

Datasets can be messy. For example, not every set of images will be kept in an immaculate ImageNet-like folder structure (where every class has its own sub-folder). The data block API is designed to be extremely flexible, and can do a lot of heavy lifting for you, saving you from Stack Overflow-ing Linux commands in order to shuffle files around your hard drive like a street magician doing a card trick.

What are we actually trying to create?

In fastai the data-containing object that we need to feed to a neural network is called a DataBunch. This is called a ‘bunch’ because it bunches together several PyTorch classes into one. In PyTorch there are two primary data objects: the DataSet (which contains all of the data items together with their associated label(s)), and the DataLoader (which gives chunks of the items in the DataSet to the model in ‘batches’ ). For a typical supervised learning problem we will want a ‘training set’ and a ‘validation set’, with a separateDataSet and DataLoader for each. (as well as an optional ‘test set’, which we will ignore here for simplicity) All of these are bundled up into the fastai DataBunch!

Because a series of operations are usually performed when preparing data for machine learning, a “pipeline” style is used in the data block API. Methods are chained together consecutively, performing (usually) one specific operation each. Therefore, the order in which we call these methods is important. We can use the data block API to handle various parts of preparing the data by answering the following questions:

  • where the items are and how do we create them?
  • how to split them into training and validation sets?
  • how to label them?
  • what transformations or re-sizing to apply?
  • what type of problem (e.g. multi-label, segmentation, etc)? (inferred automatically)
  • how to bundle all of these into a DataBunch?
The data block API helps us turn our data items into a DataBunch that is read to use in fastai models.

Usage example: RSNA Pneumonia dataset

Let’s look at the ‘RSNA Pneumonia’ dataset of chest x-ray images, which can be downloaded on Kaggle here. From the competition page:

“In this challenge competitors are predicting whether pneumonia exists in a given image. They do so by predicting bounding boxes around areas of the lung. Samples without bounding boxes are negative and contain no definitive evidence of pneumonia. Samples with bounding boxes indicate evidence of pneumonia.”

Although this is an object-detection challenge, for simplicity I’ll convert it to a classification problem — i.e. is there evidence of pneumonia or not. (Although it may be argued that with the data block API, preparing the data for object detection might be equally simple!)

After downloading the image files and placing in a folder called train_images, they look like this:

train_images/

├── 0000a175–0e68–4ca4-b1af-167204a7e0bc.dcm

├── 0005d3cc-3c3f-40b9–93c3–46231c3eb813.dcm

├── 000686d7-f4fc-448d-97a0–44fa9c5d3aa6.dcm

Opening up the accompanying csv containing bounding box locations and labels, we see the following structure:

Annotation file for the RSNA Pneumonia dataset.

It seems that the image files are named with the patientId, while the presence or absence of pneumonia is denoted in the Target field by a 1 or 0.

Note: Because the images are .dcm files we’ll need to write a custom function to open them using the pydicom package.

Now we can start using the data block API to create our DataBunch. First, we need to tell fastai where to find the images. Because we have a csv file containing the file names we can use thefrom_csv() method to collect all of the image files. In the csv file, the extension type (.dcm) needs to be appended to ‘patientId’, which we can pass as the suffix. The file locations for each ‘item’ are stored in an ImageItemList object — or more generally the ItemList base class.

This is just a collection of paths to the images:

ImageItemList (30227 items)
['/home/data/rsna-pneumonia/original/0004cfab-14fd-4e49-80ba-63a80b6bddd6.dcm'
'/home/data/rsna-pneumonia/original/00313ee0-9eaa-42f4-b0ab-c148ed3241cd.dcm'
'/home/data/rsna-pneumonia/original/00322d4d-1c29-4943-afc9-b6754be640eb.dcm' … ]

We can now split these into training and validation sets:

We can use several other approaches here if using data that we don’t want to split randomly, or if the dataset is already structured into training and validation folders. The default split in the call above is 20% for validation; just pass a float here to change the splitting percentage (e.g. 0.15 for 15%). Splitting the ImageItemList returns us an ItemLists object.

We now need to tell fastai how it should label these images. Using a Pandas dataframe obtained from the labelling csv (shown above) we can use the method label_from_df() to label the items. We’ll also tell it to label the images using the ‘Target’ column:

This now returns us a LabelLists object. Note that for a multi-label problem here we can pass a column containing multiple class labels and just specify how the labels are separated in the column using the sep argument (the default for sep is None). There are lots of other methods that can be called here if you don’t have a labelling csv or if your files are in a different structure. These include options like giving every item the same label (label_from_const), using the name of the folder the item is in as its label (label_from_folder), extracting part of the filename using a regex to get the label (label_from_re), or defining a custom function to extract the labels (label_from_func).

Next, we need to apply any data transformations, i.e. data augmentation techniques to help avoid overfitting during training. For “top-down” style images (e.g. satellite images, x-rays) we can use the fastai get_transforms() function, while tweaking the default settings a bit:

Then apply these transformation settings to our ImageSplitDatasets object, while also passing a size parameter to re-size the images:

As we can see, this operation doesn’t change the object’s type. Now we are ready for the final step — creating the DataBunch! The type of problem you are trying to solve (e.g. multi-label classification, object detection, image segmentation, etc) will be automatically inferred from the data you have feed it so far when creating the DataBunch.

Awesome! We now have our DataBunch (in our case an ImageDataBunch), which can be passed directly as the data object to our fastai models during training. We can check that the DataBunch has been created correctly by looking at the first item in the training dataset:

This shows the image dimensions together with the associated label. We can even look at some of the images using the show_batch() function:

Output from the above data.show_batch() call.

All together now

Of course, we can chain all of these calls together into a single code block. Note here that the outer parentheses are only to allow multi-line Python code without requiring \ line breaks.

Pretty nice!

Summary

We can see that the data block API passes objects along the chain, applying certain operations at each stage. The chain of objects created can be summarised as:

ItemList ➡️ ItemLists ➡️ LabelLists ➡️ DataBunch

The example shown above is just one way that each of these steps can be performed; the flexibility of the data block API comes from the wide range of functions each of the objects can call depending on how your data is structured, labelled, or named! 💥

Disclaimer: fastai is moving quickly! This was current for v1.0.24. I’ll try to update this post with any updates to the library or to my understanding 😜

--

--

Tom McKenzie

Former synthetic chemist now enjoying design-focused data science.