FastAI V2 DataBlocks API — code overview : A gentle introduction

Aman Arora
Nov 2 · 7 min read

One of the top things Jeremy says to do:

Please run the code, really run the code. Don’t go deep on theory. Play with the code, see what goes in and what comes out.

I’ll admit — I am not one of the best coders out there, in fact, it really takes me a lot of time to understand what is really going on. I first heard about FastAI a year ago around Nov 2018, and since then I have been an ardent follower of the course. I spent a lot of time watching the lecture videos and pretty much all my learnings have come from being associated with FastAI in some form — whether it’s the forums or the code, there’s REALLY plenty to learn!

In the coming few weeks or months, starting today, I will be going through the code in FastAI V2 (this series builds on top of Jeremy’s code walkthrus) — if I have already struggled and spent a lot of time learning something, the way I look at it, perhaps, you shouldn’t spend the same amount of time, right? It just makes sense to document pretty much everything I’ve learnt, and hopefully, just hopefully, I’ll really be able to save your time.

Also, I’d also like to say thank you to Jeremy for the things I’ve learnt from him — his patience to respond to the forum posts, kaggle tutorials (including medical imaging), twitter account updates and of course FastAI. While I have never met him, I consider him to be my mentor and follow his advice diligently — you should try it too.

With that said, let’s begin. In this article, we will be doing a deep dive into the datablock object in lesson 1-pets and at the DataBlock API as a whole.

Please note that this is a code walk-thru, and rather not a tutorial on how to use the DataBlock API. For that, please follow the FastAI course.


DataBlock API Overview

DataBlock API : high level overview

Essentially, the DatBlock API always returns a DataBunch object. Our end goal is to start with a DataBlock object and return a DataBunch object — which as can be read from the documentation here, is a Basic wrapper around several DataLoaders.

This is done in five steps as shown in the overview above:

  1. Build the DataBlock — you pass in path, items, item_tfms, batch_tfms at this stage. Inside the library, each set of transforms might have a set of defaults which are defined inside the TransformBlock object. (another article on TransformBlock and defaults coming soon).
  2. Once the DataBlock object has been initialised, pass the DataBlock object to DataBunch class’s from_dblock method also passing through the various transforms.
  3. from_dblock calls the databunch method defined in DataBlock class itself. So, essentially, we pass dblock to from_dblock inside DataBunch which in turn calls databunch inside DataBlock. It might sound confusing, but it really isn’t — this will be made clearer once we start to look at the code.
  4. The first step inside the databunch method in dblock is to create a DataSource object. Also, item_tfms, batch_tfms and kwargs get passed to databunch method inside DataSource.
  5. The databunch method inside DataSource gets called, which is inherited from FilteredBase class. This method returns the DataBunch object which was instantiated by passing in the dataloaders.

It’s okay if the above 5 points don’t make a lot of sense, what I really want you to get a sense of is how closely DataBlock, DataBunch and DataSource work with each other. A DataBlock can be used to create a DataBunch and a DataBunch can be used to create a DataSource.

Let’s not just give up yet and soldier on! Let’s look at the code!

DataBlock API Code

I generally use vim to navigate through source code — it is really easy to do so with tags.

Let’s use the example shown in “Lesson 1 — What’s your pet”, to study the DataBlock API in detail.

We create the databunch in the lesson like so:

So the first thing that’s done is that the ImageDataBunch.from_name_re method get’s called. Ok, no problem!

The from_name_re method calls the from_name_func inside ImageDataBunch class itself. Please note that item_tfms, bs, batch_tfms get passed as kwargs by the from_name_re to from_name_func.

Step-1

At this stage, we form a DataBlock object by passing in some defaults blocks first— ImageBlock & CategoryBlock.

These blocks essentially contain a set of defaults that are defined by the FastAI library internally for us. You can read more about the default transforms here.

DataBlock API `default` Transforms

We can confirm the above transforms get set by calling default_type_tfms, default_batch_tfms & default_batch_tfms on the dblock object.

Great! So far, we have seen that when we call ImageDataBunch.from_re method get’s called, the first thing that happens is that DataBlock object called dblock get’s formed having the defaults as above.

This is end of step-1 defined in the API Overview.

Step-2

What follows, is that we pass in this dblock object to from_dblock method defined inside the DataBunch class.

So, let’s have a look at the from_dblock and try to understand what goes on here.

Aha! So the from_dblock method, calls the databunch method inside of DataBllock class itself passing in type_tfms, item_tfms and batch_tfms.

This is really end of step-2 in the DataBlock API overview.

Step-3 and Step-4

Inside, the databunch the first step to occur is that a datasource object gets created and returned as dsrc.

If you are unaware about the DataSource object, please refer to this post and also this one to get an intuition of the hierarchy of things inside FastAI. It’s beautiful, isn’t it? This is really it for step-3 and step-4 in the datablock API overview.

A DataSource object dsrc was created, item_tfms, batch_tfms and kwargs defined. And finally passed through to the databunch method inside DataSource.

Step-5

The databunch method is actually defined FilteredBase and since DataSource inherits from FilteredBase, it also get’s a databunch method.

All, we do inside this method is we define a bunch of dls from train and valid and really pass it to the DataBunch constructor to finally return the DataBunch object.

Therefore, the dbunch object that is formed in Lesson-1 is really this DataBunch object that is contructed inside the DataSource class and returned.

This is end of step-5.


This is the very first article of many to come in the coming weeks. Over the next few weeks, I will be looking at Vision module inside FastAI and we will together look at various different Transforms, Augmentations, Metrics, Data Loaders and also the Learner object.

The goal is to start with a high level API such as the DataBlock API but to really reach the lowest layers by digging into it until we reach PyTorch.

Thank you for reading!

Aman Arora

Written by

https://github.com/arora-aman123 LinkedIn.com/aroraaman

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade