Exploring Fast.ai’s DataBlock API

Alexander Rofail
5 min readSep 4, 2021

--

“In this day and age age, nothing costs more than information” — Gigolo Joe from Artificial Intelligence (2001)

We all know data is the new oil, and that oil powers the engine of the new digital world — ML/DL models. Many of the non-technical AI “thought leaders” (whatever that means) often pontificate on how important data is to optimal AI engineering. You basically can’t turn anywhere without someone talking about data in some capacity. I feel that we’ve turned data into too much of an abstract mythical juice that’s just easy to talk about to make yourself sound intelligent. Frankly, I just get frustrated at this point when I hear someone talking about “data” without going into specifics (what the objects are, the classes of objects (if vision discussion), where it’s stored, how it’s stored (format), what answers can be extrapolated from it, or what better questions can be asked from it.)

This post doesn’t exactly cover everything in the above parenthesis but I do want to explore more granularly how we actually work with data in AI development. We’ll use Fast.ai, which has become by far my favorite library over the last 18 months, and PyTorch.

You can not throw a bunch of data at someone and say “make the computer do some AI” in the same way you can not throw a bunch of books and notes at a human and say “learn the thing”. The data itself you are going to use to “do some AI” can be either structured or unstructured (topic for another time) but when we are going to do supervised learning, we need to present our data to the model in a certain structured way. This is why we need the datablock, datasets, and dataloaders. Meaning, we cant just insert some files or make a query to a DB and shove our data into a model.

For this post we will be using one of the many datasets available to download in the Fast.ai library. Our dataset is called “PETS” and contains .jpg images of cats and dogs (not NFTs). PETS is a dataset containing two classes: cats and dogs, 37 categories, and roughly 200 images per class.

So we have all this data, now we need to package it properly so that our model can understand what it’s looking at and why. I think of the DataBlock as a suitcase, suitcases are optimal when we intentionally organize the contents within it. First we download our PETS data and then initialize an empty DataBlock (suitcase).

EDIT: I forgot to add this incredibly useful “questionnaire” that the fast.ai team introduces us to for deciding how to construct a datablock:

  • what are the types of our inputs and targets?
  • where is the data?
  • how do we know if a sample is in the training or the validation set?
  • how do we get an image?
  • how do we know the label of an image?
  • do we want to apply a function to a given sample?
  • do we want to apply a function to a batch after it’s created?

In the above code snippet we used a get_image_files function to pull the images from all the subfolders. Now we can start think about how to actually pack our suitcase. We need to provide four things: input/labels, a get_items function, and a splitter function. The first step in packing is to create a dataset using fast.ai’s Datasets. This will return to us a tuple of input and target.

In the first cell above we have not established an input and target, to the tuple just returns the same thing in both positions. So how do we establish an input and target? With a label function.

No that we have a label function we can pass that to the get_y parameter of the DataBlock. So now the tuple will return the .jpg file and a label of either ‘cat’ or ‘dog’.

We can get even more granular and tell the DataBlock the type of our inputs and targets. In this case, in the blocks parameter of the DataBlock API we will pass and ImageBlock and CategoryBlock to define that our inputs are images and our targets are categories.

The DataBlock API is also extremely flexible, in our case we only have one type of input and one type of target but we can adjust for having multiple input and target types by passing a n_inp parameter to the DataBlock and also pass a get_x or get_y to explain how to process each type.

There are three steps left for the scope of this post: splitting training and validation sets, establishing some transformations to be done on the images, and finally establish a DataLoaders object that can be used for training.

Passing in Fast.ai’s Random Splitter

Fast.ai has a nifty RandomSplitter() that we can use. The RandomSplitter() essentially takes a valid_pct parameter between 0 and 1 and randomly splits the data to a validation set based on the percentage you decide.

The transformation step is a key component before establishing a dataloaders object and beginning to train. We want all our images to be of the same size, so we pass in an item_tfms parameter that will resize all our images to a given size (224 in this case.)

Now we can finally establish a dataloaders object that can be used for training a mode. And we can even call show_batch() to get a little preview of our images and labels.

This was the most basic walk through of fast.ai’s DataBlock API in 31 cells of code. There is a ton of flexibility available based on the data you are working with but it’s nice to start with the most basic workable example to get a feel for the foundations. In the end, our neatly packed and organized suitcase can be considered ready for travel through a model in the form of a DataLoader.

--

--

Alexander Rofail

Things I like: Startups, AI, Bow Hunting, Powerlifting, Martial Arts.