FastAI V2 DataBlocks API — code overview : A gentle introduction
One of the top things Jeremy says to do:
Please run the code, really run the code. Don’t go deep on theory. Play with the code, see what goes in and what comes out.
I’ll admit — I am not one of the best coders out there, in fact, it really takes me a lot of time to understand what is really going on. I first heard about FastAI a year ago around Nov 2018, and since then I have been an ardent follower of the course. I spent a lot of time watching the lecture videos and pretty much all my learnings have come from being associated with FastAI in some form — whether it’s the forums or the code, there’s REALLY plenty to learn!
In the coming few weeks or months, starting today, I will be going through the code in FastAI V2 (this series builds on top of Jeremy’s code walkthrus) — if I have already struggled and spent a lot of time learning something, the way I look at it, perhaps, you shouldn’t spend the same amount of time, right? It just makes sense to document pretty much everything I’ve learnt, and hopefully, just hopefully, I’ll really be able to save your time.
Also, I’d also like to say thank you to Jeremy for the things I’ve learnt from him — his patience to respond to the forum posts, kaggle tutorials (including medical imaging), twitter account updates and of course FastAI. While I have never met him, I consider him to be my mentor and follow his advice diligently — you should try it too.
With that said, let’s begin. In this article, we will be doing a deep dive into the datablock object in lesson 1-pets and at the DataBlock API as a whole.
Please note that this is a code walk-thru, and rather not a tutorial on how to use the
DataBlockAPI. For that, please follow the FastAI course.
DataBlock API Overview

Essentially, the DatBlock API always returns a DataBunch object. Our end goal is to start with a DataBlock object and return a DataBunch object — which as can be read from the documentation here, is a Basic wrapper around several DataLoaders.
This is done in five steps as shown in the overview above:
- Build the
DataBlock— you pass inpath,items,item_tfms,batch_tfmsat this stage. Inside the library, each set of transforms might have a set of defaults which are defined inside theTransformBlockobject. (another article onTransformBlockanddefaultscoming soon). - Once the
DataBlockobject has been initialised, pass theDataBlockobject toDataBunchclass’sfrom_dblockmethod also passing through the various transforms. from_dblockcalls thedatabunchmethod defined inDataBlockclass itself. So, essentially, we passdblocktofrom_dblockinsideDataBunchwhich in turn callsdatabunchinsideDataBlock. It might sound confusing, but it really isn’t — this will be made clearer once we start to look at the code.- The first step inside the
databunchmethod indblockis to create aDataSourceobject. Also,item_tfms,batch_tfmsandkwargsget passed todatabunchmethod insideDataSource. - The
databunchmethod insideDataSourcegets called, which is inherited fromFilteredBaseclass. This method returns theDataBunchobject which was instantiated by passing in the dataloaders.
It’s okay if the above 5 points don’t make a lot of sense, what I really want you to get a sense of is how closely DataBlock, DataBunch and DataSource work with each other. A DataBlock can be used to create a DataBunch and a DataBunch can be used to create a DataSource.
Let’s not just give up yet and soldier on! Let’s look at the code!
DataBlock API Code
I generally use vim to navigate through source code — it is really easy to do so with tags.
Let’s use the example shown in “Lesson 1 — What’s your pet”, to study the DataBlock API in detail.
We create the databunch in the lesson like so:
dbunch = ImageDataBunch.from_name_re(path, fnames, pat, item_tfms=RandomResizedCrop(460, min_scale=0.75), bs=bs,
batch_tfms=[*aug_transforms(size=224, max_warp=0), Normalize(*imagenet_stats)])So the first thing that’s done is that the ImageDataBunch.from_name_re method get’s called. Ok, no problem!
@classmethod
@delegates(DataBunch.from_dblock)def from_name_re(cls, path, fnames, pat, **kwargs):
"Create from list of `fnames` in `path`s with re expression `pat`."
return cls.from_name_func(path, fnames, RegexLabeller(pat), **kwargs)
The from_name_re method calls the from_name_func inside ImageDataBunch class itself. Please note that item_tfms, bs, batch_tfms get passed as kwargs by the from_name_re to from_name_func.
@classmethod
@delegates(DataBunch.from_dblock)def from_name_func(cls, path, fnames, label_func, valid_pct=0.2,
seed=None, **kwargs):"Create from list of `fnames` in `path`s with `label_func`."
dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
splitter=RandomSplitter(valid_pct, seed=seed),
get_y=label_func)
return cls.from_dblock(dblock, fnames, path=path, **kwargs)
Step-1
At this stage, we form a DataBlock object by passing in some defaults blocks first— ImageBlock & CategoryBlock.
These blocks essentially contain a set of defaults that are defined by the FastAI library internally for us. You can read more about the default transforms here.

We can confirm the above transforms get set by calling default_type_tfms, default_batch_tfms & default_batch_tfms on the dblock object.
dblock.default_type_tfms
>> [<bound method PILBase.create of <class 'local.vision.core.PILImage'>>],(#1) [Categorize: True (object,object) -> encodes (object,object) -> decodes]]dblock.default_item_tfms
>> [ToTensor: False (PILMask,object) -> encodes
(PILBase,object) -> encodes ]dblock.default_batch_tfms
>> [Cuda: False (object,object) -> encodes (object,object) -> decodes,IntToFloatTensor: True (TensorMask,object) -> encodes
(TensorImage,object) -> encodes (TensorImage,object) -> decodes]
Great! So far, we have seen that when we call ImageDataBunch.from_re method get’s called, the first thing that happens is that DataBlock object called dblock get’s formed having the defaults as above.
This is end of step-1 defined in the API Overview.
Step-2
What follows, is that we pass in this dblock object to from_dblock method defined inside the DataBunch class.
return cls.from_dblock(dblock, fnames, path=path, **kwargs)So, let’s have a look at the from_dblock and try to understand what goes on here.
@classmethod
@delegates(TfmdDL.__init__)def from_dblock(cls, dblock, source, path='.', type_tfms=None, item_tfms=None, batch_tfms=None, **kwargs):
return dblock.databunch(source, path=path, type_tfms=type_tfms, item_tfms=item_tfms, batch_tfms=batch_tfms, **kwargs)
Aha! So the from_dblock method, calls the databunch method inside of DataBllock class itself passing in type_tfms, item_tfms and batch_tfms.
This is really end of step-2 in the DataBlock API overview.
Step-3 and Step-4
class DataBlock():
"Generic container to quickly build `DataSource` and `DataBunch`"get_x=get_items=splitter=get_y = None
dl_type = TfmdDL
_methods = 'get_items splitter get_y get_x'.split()def __init__(self, blocks=None, dl_type=None, getters=None, n_inp=None, **kwargs):
blocks = L(getattr(self,'blocks',(TransformBlock,TransformBlock)) if blocks is None else blocks)
blocks = L(b() if callable(b) else b for b in blocks)
self.default_type_tfms = blocks.attrgot('type_tfms', L())
self.default_item_tfms = _merge_tfms(*blocks.attrgot('item_tfms', L()))
self.default_batch_tfms = _merge_tfms(*blocks.attrgot('batch_tfms', L()))
for t in blocks: self.dl_type = getattr(t, 'dl_type', self.dl_type)
if dl_type is not None: self.dl_type = dl_type
self.databunch = delegates(self.dl_type.__init__)(self.databunch)
self.dbunch_kwargs = merge(*blocks.attrgot('dbunch_kwargs', {}))
self.n_inp,self.getters = n_inp,L(getters)
if getters is not None: assert self.get_x is None and self.get_y is None
assert not kwargsdef datasource(self, source, type_tfms=None):
self.source = source
items = (self.get_items or noop)(source)
if isinstance(items,tuple):
items = L(items).zip()
labellers = [itemgetter(i) for i in range_of(self.default_type_tfms)]
else: labellers = [noop] * len(self.default_type_tfms)
splits = (self.splitter or noop)(items)
if self.get_x: labellers[0] = self.get_x
if self.get_y: labellers[1] = self.get_y
if self.getters: labellers = self.getters
if type_tfms is None: type_tfms = [L() for t in self.default_type_tfms]
type_tfms = L([self.default_type_tfms, type_tfms, labellers]).map_zip(
lambda tt,tfm,l: L(l) + _merge_tfms(tt, tfm))
return DataSource(items, tfms=type_tfms, splits=splits, dl_type=self.dl_type, n_inp=self.n_inp)def databunch(self, source, path='.', type_tfms=None, item_tfms=None, batch_tfms=None, **kwargs):
dsrc = self.datasource(source, type_tfms=type_tfms)
item_tfms = _merge_tfms(self.default_item_tfms, item_tfms)
batch_tfms = _merge_tfms(self.default_batch_tfms, batch_tfms)
kwargs = {**self.dbunch_kwargs, **kwargs}
return dsrc.databunch(path=path, after_item=item_tfms, after_batch=batch_tfms, **kwargs)_docs = dict(datasource="Create a `Datasource` from `source` with `type_tfms`",
databunch="Create a `DataBunch` from `source` with `item_tfms` and `batch_tfms`")
Inside, the databunch the first step to occur is that a datasource object gets created and returned as dsrc.
If you are unaware about the DataSource object, please refer to this post and also this one to get an intuition of the hierarchy of things inside FastAI. It’s beautiful, isn’t it? This is really it for step-3 and step-4 in the datablock API overview.
dsrc = self.datasource(source, type_tfms=type_tfms)
item_tfms = _merge_tfms(self.default_item_tfms, item_tfms)
batch_tfms = _merge_tfms(self.default_batch_tfms, batch_tfms)
kwargs = {**self.dbunch_kwargs, **kwargs}
return dsrc.databunch(path=path, after_item=item_tfms, after_batch=batch_tfms, **kwargs)A DataSource object dsrc was created, item_tfms, batch_tfms and kwargs defined. And finally passed through to the databunch method inside DataSource.
Step-5
The databunch method is actually defined FilteredBase and since DataSource inherits from FilteredBase, it also get’s a databunch method.
def databunch(self, bs=16, val_bs=None, shuffle_train=True, n=None, path='.', dl_type=None, dl_kwargs=None, **kwargs):if dl_kwargs is None: dl_kwargs = [{}] * self.n_subsets
ns = self.n_subsets-1
bss = [bs] + [2*bs]*ns if val_bs is None else [bs] + [val_bs]*ns
shuffles = [shuffle_train] + [False]*ns
if dl_type is None: dl_type = self._dl_type
dls = [dl_type(self.subset(i), bs=b, shuffle=s, drop_last=s, n=n if i==0 else None, **kwargs, **dk)
for i,(b,s,dk) in enumerate(zip(bss,shuffles,dl_kwargs))]
return DataBunch(*dls, path=path)
All, we do inside this method is we define a bunch of dls from train and valid and really pass it to the DataBunch constructor to finally return the DataBunch object.
Therefore, the dbunch object that is formed in Lesson-1 is really this DataBunch object that is contructed inside the DataSource class and returned.
This is end of step-5.
This is the very first article of many to come in the coming weeks. Over the next few weeks, I will be looking at Vision module inside FastAI and we will together look at various different Transforms, Augmentations, Metrics, Data Loaders and also the Learner object.
The goal is to start with a high level API such as the DataBlock API but to really reach the lowest layers by digging into it until we reach PyTorch.
Thank you for reading!
