Constructing a DataBlock using Fastai

Ashik Shaffi
Apr 3 · 5 min read
Photo by Iker Urteaga on Unsplash

These were the notes I took while reading the book Deep Learning for Coders and taking the fastai course, the notes below will be a mix of book text and my understanding of how things work. This is from the Chapter 6: Other Computer Vision Problems.

We have a DataFrame object how do we convert this into a DataLoaders object? We usually build a data block and with that create a DataLoaders object. Here in this tutorial will go step by step to build a DataBlock for a Multi-label Classification problem.

Before getting started to build a DataBlock, we should make sure that we understand the below 4 jargons.

A collection that returns a tuple of the independent and dependent variable for a single item.

An iterator that provides a stream of mini-batches, where each mini-batch is a pair of a batch of independent variables and a batch of dependent variables

On top of these fastai brings two classes for bringing the training and validation sets together. They are called Datasets and DataLoaders.

An iterator that contains a training Dataset and validation Dataset → tuple of the independent and dependent variables for both train and valid set.

An object that contains a training DataLoader and a validation DataLoader.

Since the DataLoader is built on top of a Dataset and adds additional functionality to it, (collating multiple items into a mini-batch). It’s often easier to start by creating and testing Datasetsand then look at DataLoaders after knowing Datasetsworking.

Let’s start creating from scratch!!!

At first, we create an empty DataBlock object, and to create a Datasets from this we need a source, it can be an image, data frame, or any kind of data you want to fit in.

It’s a bit confusing here why it’s been printed out two times? Well, that's by default, the data block assumes we have two things:

  • Input
  • Target

But we don’t need all this shit, so we are going to grab the appropriate fields from the DataFrame, which we can do bypassing get_x and get_y function.

But the independent variable (x) will need to be converted into a complete path so that we can open it as an image, and the dependent variable should be split on the space character. Because they are tied together but in reality, they are two different objects.

Alrighty! We got the path now but how do we open the images and do the conversion on tensors?

For this we need to use a set of transforms, block types will provide us with those transforms. The blocks we have seen so far are ImageBlock and CategoryBlocksure we can use the same but with one exception. We can't use them CategoryBlock in this problem, well it expects for only a single integer but in our case, we need to have multiple labels for each item.

To solve this we use MultiCategoryBlockcan expect to receive a list of strings as have in this case.

If we take a look, the list of categories is not encoded in the same way that it was for the regular CategoryBlock well in that case we had a single integer representing which category was present.

For instance, is it a cat? (or) is this an Abyssinian cat?

In that case, we had a list of 0s with a 1s in any position where that category is present.

In here if there is an 1 in the second and fourth positions, that means vocab items two and four are present in this image. This is known as one-hot encoding.

The reason we can’t easily just use a list of category indices is that each list would be of a different length, and Pytorch requires tensors, where everything has to be the same length.

With Numpy arrays and Pytorch Tensors and fastai’s L class, we can index directly using a list or vector.

For now, we have ignored the column is_valid up until now, which means that DataBlock has been using a random split by default. To choose the elements of our validation set, we need to write a function and pass it to the splitter.

It will take the items (in here our whole DataFrame) and must return two (or more) lists of integers.

As we discussed a DataLoader collates the items from a Dataset into a mini-batch. This is a tuple of tensors, where each tensor simply stacks the items from that location in the Dataset item.

Now we gotta make sure every item is of the same size before putting them into a DataLoaders, we use RandomResizedCrop.

dblock = DataBlock(blocks=(ImageBlock , MultiCategoryBlock) , 
splitter = splitter ,
get_x = get_x ,
get_y = get_y ,
item_tfms = RandomResizedCrop(128 , min_scale=0.353))
# Putting into a Dataloader
dls = dblock.dataloaders(df)
# Displaying a sample of our data
dls.show_batch()

There wasn’t much of a change instead we built the DataBlock from scratch and used a MultiCategoryBlock instead of the CategoryBlock. As we know the DataLoader collates the items from a Dataset into a mini-batch.

Our data is now ready for the model, nothing is going to change even with them Learner but this time we are going to use a new Loss Function, which is called Binary Cross Entropy.

MLearning.ai

Data Scientists must think like an artist when finding a solution

Sign up for AI & ART

By MLearning.ai

A weekly collection of the best news and resources on AI & ART Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

MLearning.ai

Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.

Ashik Shaffi

Written by

MLearning.ai

Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store