Train a cnn with the fastai library

Published in

Deena Does Data Science

15 min readDec 19, 2018

Note: This article is not polished, but gives an insight into how I push forward through a project and learn along the way.

Below is a step by step process of getting images into an ImageDataBunch followed by the first round of making a learn object with resnet34 and fitting a model.

This blog goes well with Lesson 2 of Fastai Deep Learning Part 1 Course-v3.

My Dataset

Breast histopathology images from Kaggle:

https://www.kaggle.com/paultimothymooney/breast-histopathology-images/

This dataset is designed to identify parts of tissue biopsies that indicate the presence or absence of invasive ductal carcinoma (IDC), i.e. cancer. The label “0” means no cancer. The label “1” means cancerous cells were observed in the tissue biopsy image.

A summary of accuracy achieved by others in public Kaggle Kernals

Marsh achieved an accuracy and F1 score of about 88% based on a classification threshold of 0.5 using TensorFlow.
Bukun got 85.5% accuracy with SciKitLearn and Keras.
Paul Mooney got 84.6% accuracy with TensorFlow.
Md. Rezwanul Haque did not complete the project.

It should be noted that the creators of the paper relating to this dataset used a classification threshold of 0.29. (I don’t know what this means yet!)

Step 1: Make a dataframe with paths to the image files and class labels

To make a dataframe, navigate to a directory where you want a new folder called “models” to be created. In my case I’m choosing the base folder of the course-v3 repo.

Make a list of all the paths to each image file and view it with code similar to that below. Replace data_GPU/breast_histo_PTM/before/ with your own folder structure. The function Path() was added in Python 3.4. Read the pathlib doc page to see additional ways to obtain file path information. The function rglob() allows you to recursively go through all paths generated by Path() and only keep the ones that have the string '*.png'.

fpaths = list(Path('./data_GPU/breast_histo_PTM/before/').rglob('*.png')); fpaths

Turn the list of paths into a pandas Series and view the first few rows.

fpaths_s = pd.Series(fpaths,name='fpaths'); fpaths_s.head()

From the list of file paths, make a list of file stems and view the first 5. A file stem is the name of the file without the ‘.png’ part.

fstems = []for path in fpaths:
    fstem = path.stem
    fstems.append(fstem)fstems[:5]

Make another Series, this time with the file stems, and view the head.

fstems_s = pd.Series(fstems, name='fstems'); fstems_s.head()

Concatenate the two series into a dataframe and view the head.

df = pd.concat([fpaths_s, fstems_s], axis=1); df.head()

Each path in the fpaths column of the dataframe is a Path object. One easy way to get the label for the class of each image is to parse it out of the path, but first you need to turn the file path into a string.

Create a new column with path strings

The .apply() function takes a function as an argument and, when performed on a dataframe column, it applies that function to the variable in each row. Here each variable is a Path object. To feed .apply() a function we use lambda. Writing lambda in python (and many other languages) is a way to tell the computer that you are about to define a variable (x) and a function (str(x)). Here, the Path object will be assigned to the variable x. Then the str() function will be performed on it. The result will be deposited into the new column assigned before the = sign.

df['fpaths_str'] = df['fpaths'].apply(lambda x: str(x)); df.head()

Parse the class label from the path string

Use the split() function to separate the path string by the forward slash character ('/'), expand the outputs into a list, and put the output of your desired index (4 in the example below) into a new dataframe column.

df['class_label'] = df['fpaths_str'].str.split('/', expand=True)[4]; df.head()

Export your dataframe to a csv file, so that you can upload it to new notebooks.

df.to_csv('./projects/breast-histo-PTM/dfs/20181213-path-class.csv')

Import your csv to a dataframe

df = pd.read_csv('./projects/breast-histo-PTM/dfs/20181213-path-class.csv'); df.head()

Note that any numbers that were originally strings are now integers or floats.

Step 2: Make a databunch

A databunch is composed of datasets and dataloaders. There is one dataset and one dataloader for each group: train, valid, and (optionally) test. A dataset includes your actual images and any labels that go with them. A dataloader resizes the images, applies transforms (when appropriate), and passes the images to the processing unit in batches.

Each dataset (e.g. train_ds & valid_ds) has an ItemList (x)(ImageItemList for images) and a CategoryList (y). The ImageItemList identifies the type of item, which is an Image, and lists the dimensions of the tensor(?). The dimensions of the images I am using are: 3, 50, 50. Three is for the three colors of a color picture (RGB). The images are square, so 50 pixels wide and 50 pixels high. Note that the order of the hight and width can confuse you.

To make a databunch from a dataframe, use fastai’s data_block API. Fastai calls this an API because it is an intermediary that is used to pipe data from its storage place into the learning process of your choice. The ‘data_block’ term indicates that a databunch is made by bringing together many blocks of code. For images, creating a databunch can take five steps. It starts with (1) the creation of an ImageItemList, then (2) a function to split up the images into a training and a validation set, this is followed by (3) a function to label the images, then (4) methods to transform the images, and finally (5) it is all bundled up into a databunch.

The first step in creating a databunch is to make an ImageItemList. In this case we’ll make it from the paths that are in the column labeled ‘fpaths’.

ImageItemList.from_df(path, df, cols) takes three arguments. Upon concatenation, the path and cols variables need to represent the absolute path to each image. The cols argument should be fed the dataframe column name that holds the final part of the path to your images and it needs to be a string. The end of the path variable will dictate where a new directory, called “models” that will contain your trained convolutional neural network (cnn) weights, will be created. For my project, I navigated into the base folder in my “course-v3” repo.

Define the three arguments for ImageItemList:

path = Path('.') or path = '.' both work
df = df (see above)
cols = ‘fpaths' this is the column name within my ‘df’ dataframe. Note that the folder ‘data_GPU’ is inside my ‘course-v3’ folder.

Be sure to navigate to your chosen base folder, in my case ‘course-v3’, before running the next line.

data = ImageItemList.from_df(path=path, df=df, cols=cols); data

A successful output will look like this:

ImageItemList (277524 items)
[Image (3, 50, 50), Image (3, 50, 50), Image (3, 50, 50), Image (3, 50, 50), Image (3, 50, 50)]...
Path: .

One great way to get to know the inner workings of any python package is to use the tab key. In a new field in your jupyter notebook type data. and press tab. Pressing tab after a . is the same as typing dir(data) and running the code. The dir() function returns a list of the attributes and methods of any object. The list below shows the attributes and methods available for data after creating an ImageItemList. To determine what each returns, simply type the full phrase, eg data.convert_mode, and run the expression in a jupyter notebook cell. Attributes return values, while functions return something that starts with “<bound method”.

data.analyze_pred     RETURNS: <bound method ...
data.convert_mode     RETURNS: 'RGB'
data.copy_new         RETURNS: <bound method ...
data.filter_by_(folder, func, rand)  RETURNS: <bound method ...
data.from_(csv, df, folder)          RETURNS: <bound method ...
data.get              RETURNS: <bound method ...
data.get_label_cls    RETURNS: <bound method ...
data.items            RETURNS: array([paths to .pngs], dtype='<U79')
data.label_(cls, const, empty)
data.label_from_(df, folder, func, list, re)
data.new              RETURNS: <bound method ...
data.num_parts        RETURNS: 0
data.open             RETURNS: <bound method ...
data.path             RETURNS: PosixPath('.')
data.process          RETURNS: <bound method ...
data.process_one      RETURNS: <bound method ...
data.processor        RETURNS: <bound method ...
data.random_split_by_pct       RETURNS: <bound method ...
data.reconstruct      RETURNS: <bound method ...
data.show_xys         RETURNS: <bound method ...
data.show_xyzs        RETURNS: <bound method ...
data.sizes            RETURNS: {0: torch.Size([50, 50]), 1: ...4: )}
data.split_by_(files, fname_file, folder, idx, idxs, list, valid_func)           RETURNS: <bound method ...
data.split_from_df    RETURNS: <bound method ...
data.to_text          RETURNS: <bound method ...
data.use_partial_data RETURNS: <bound method ...
data.x                RETURNS: <bound method ...
data.xtra             RETURNS: the whole dataframe

The list above gives you an idea of what expressions you can perform next. To learn more about each of the functions you can put ItemList or ImageItemList before the dot and use ?? or doc(), for example:

??ItemList.analyze_pred

?? returns the signature, which tells you what arguments the function takes, returns the docstring, and returns a small snippet of source code

doc(ItemList.analyze_pred)

doc() returns a link to the appropriate fastai docs page and a link to the full source code in github

I don’t think a single ? or help() are useful with the fastai library, since they tell you less than the info provided by the two information retrieval methods above. I prefer using doc() because going to the source code modules helps me understand how everything fits together.

However doc() doesn’t always work! When, for example, adding ItemList before the dot (.) doesn’t work (as it won’t for some attributes and functions below) your best bet is to look in the same module of the source code for the last function where it did work. For example doc(ItemList.transform) returns AttributeError: type object 'ItemList' has no attribute 'transform'. This error is thrown because an intermediate step returned objects that were inherited by new classes and .transform() is within one or many of those new classes. Luckily within fastai, the data_block.py file contains all the functions that are used to make a databunch, so you can search the data_block source code file directly to find where the transform() function is defined. Note that doc(ItemLists.transform), doc(LabelList.transform), and even doc(LabelLists.transform) all take you to the same data_block.py source code file. Also note that the highlighted line that fastai takes you to is not always relevant and you need to search through the .py file to find the function you are interested in learning more about.

When you see anything with process or processor think ‘pre-processes’. Pre-processes are methods that you need to make sure you apply while all of your data is still together, meaning before you split it into training and validation sets. For example, if you were using tabular data and wanted to fill NaNs with an average you need that average to be the same in your training and validation datasets. With images, if you wanted to normalize the RBG channels, maybe you would need to do that before splitting your data.

As you can see in bold, in the list above, after performing .from_df() you can .label or .split your images into training and validation folders.

If you label first:

data = data.label_from_df(cols='class_label')

data becomes a non-subscriptable object, but data.(tab) indicates that you can perform the following:

data.c               RETURNS: object has no len()
data.export
data.item            
data.load_empty
data.new             RETURNS: object is not subscriptable
data.predict
data.process
data.set_item
data.tfm_y           RETURNS: False
data.tfmargs         RETURNS: {}
data.tfms            RUNS 
data.to_(csv, df)
data.transform       RETURNS: object is not subscriptable
data.transform_y     RETURNS: object is not subscriptable
data.x               RETURNS: ImageItemList (# items) [Image (3, 50, 50), ...] Path: .
data.y               RETURNS: object is not subscriptable

If you split first:

data = data.random_split_by_pct(valid_pct=0.2, seed=10)

data returns a Train and a Valid ImageItemList, and you can perform the following:

data.label_from_lists
data.lists              RETURNS: ImageItemLists for Train & Valid
data.path               RETURNS: PosixPath('.')
data.test
data.train              RETURNS: ImageItemList of train
data.transform          RETURNS: <bound method ... of ItemLists;...
data.transform_y        RETURNS: <bound method ... of ItemLists;...
data.valid              RETURNS: ImageItemList of valid

In lesson 7, Jeremy says we should split first, then label. Note that the above list of data. functions does not include data.label_from_df. This is probably a bug, so I reported it here. (Check out my short blog with useful links on how to contribute to fastai.) Indeed, it was a bug and it was fixed within 24 hours— thanks Sylvain Gugger! The fix required manually stipulating some extra attributes and methods that could be performed on the object. The dir() function did not perform as desired on its own because __getattr__ was redefined in the ItemLists() class.

It seems a little weird to label after splitting, since if you have a class with a small number of images, you may end up with a training or validation dataset that don’t have any of one class. I posted to the forum to see if there is a way to stipulate that you want to split each class by a certain percent and got some suggestions. I won’t purse this now, since I don’t need it for my current dataset.

The function.random_split_by_pct() does what it says. It splits your data into two folders one labeled “train” and one labeled “valid”. If you want a “test” set, you have to add it separately. Note that you will not be able to see these folders on your hard drive because you have not saved them. The information is stored in your data object. The split function is given two arguments:

valid_pct=0.2 tells the split function to put 20% of the images into a folder labeled “valid” and 80% of the images into a folder labeled “train”
seed=10 Setting a seed undoes the randomness. Random functions in python aren’t really random, they get their randomness by following a pattern from a different starting point each time. If you set the starting point, aka seed, to the same place each time, you will get the same images in each set. It is important to include a seed, so that you can compare the accuracy of one run to a later analysis

Code recap

To recap, I have now performed the following commands, in succession:

data = ImageItemList.from_df(path=path, df=df, cols='fpaths')
data = data.random_split_by_pct(valid_pct=0.2, seed=10)
data = data.label_from_df(cols='class_label')

The list below shows the attributes and methods that are available after splitting, then labelling.

data.add_test          RETURNS: <bound method...
data.add_test_folder   RETURNS: <bound method...
data.databunch         RETURNS: <bound method...
data.get_processors    RETURNS: <bound method...
data.label_from_lists  RETURNS: <bound method...
data.lists             RETURNS: x: ImageItemList and y: CategoryList for train and valid LabelLists (not clearly labeled)
data.path              RETURNS: PosixPath('.')
data.process           RETURNS: x: ImageItemList and y: CategoryList for train and valid LabelLists (clearly labeled)
data.test
data.train             RETURNS: x: ImageItemList and y: CategoryList for train LabelList
data.transform         RETURNS: <bound method...
data.transform_y       RETURNS: <bound method...
data.valid             RETURNS: x: ImageItemList and y: CategoryList for train LabelList

Technically, you could perform any of the ‘bound’ methods next, including going straight to a databunch. Skipping .transform() is attractive because fastai has experimented and selected default settings that work best for… I’m not sure what — they might be best for differentiating between cat and dog breeds, which is definitely not the same as histology images! However, even though you can make a databunch object without including .transforms() your databunch object does not have all the attributes needed for downstream methods like data.show_batch() or learn.fit_one_cycle(). Therefore we have to apply some transformations.

Transformations (aka image augmentation or transforms)

Transforms were covered by Jeremy in fastai DL1 course-v3, lesson 6 (see notebook “lesson6-pets-more”). During this lesson, Jeremey mentioned that there is a lot to be determined in terms of which transformations improve the models of different types of images that we want to classify.

In the data_block.py module, the transform() function is first defined within the class ItemLists(). It takes the arguments self, tfms, and **kwargs. Note that tfms is a tuple of transformation lists (TfmList,TfmList). The first TfmList is applied to the training set and the second TfmList is applied to the validation set. The transform() function is applied to the x variables (the ImageItemList) of the sets.

In practice, to add the transform block to the data processing path use the following code, which has 3 arguments:

data = data.transform(tfms=tfms, size=49, padding_mode=padding_mode)

tfms the argument tfms needs to be defined with the function get_transforms() (see below for more details)
size=49 Setting a size as a kwarg is important here. Jeremy mentioned that factors of 7 are the best use of computer processing power. This has something to do with the model (e.g. resnet34) that you will use to fit the data. I started with images that were 50x50 pixels and I thought this would make each picture into a tensor of 3x49x49, but my ImageItemList still says each image is (3, 50, 50)
padding_mode This argument can be set to equal any of the following strings: ‘reflection’, ‘zeros’, …. The default is ‘reflection’. ‘zeros’ is easiest to see because it is black

Let’s define the argument tfms. These are transformations that you want to have performed randomly to some of your images before they are passed into your neural net. Below is a list of all the options that fastai offers in the function get_transforms(). The actions of these variables are described in the fastai vision docs. The code is in the file path fastai/fastai/fastai/vision/transform.py.

tfms = get_transforms(do_flip=True, 
                      flip_vert=True, 
                      max_rotate=4., 
                      max_zoom=1.1, 
                      max_lighting=0.2, 
                      max_warp=0., 
                      p_affine=0.75, 
                      p_lighting=0.75)

get_transforms() is even more than these variables. It also performs crop_pad() and rand_crop(). These ensure your images are made into squares. For now (i.e. while we are still in DL1) all pictures need to be squares. If you downloaded your images from the internet, chances are they are different sized rectangles. To squarify images, you need to perform .get_transforms() or feed .transform() the tuple ([rand_crop()], [crop_pad()]) in place of .get_transforms(). [rand_crop()] is applied to the training set and [crop_pad()] is applied to the validation set of images. In data_block format, this would look like the code below:

data = data.transform(([rand_crop()], [crop_pad()]),
                      size=49,
                      padding_mode=padding_mode)

To do:

Use code in lesson 6, under “Data Augmentation” header to visualize transforms …

try changing the rotation (and see if I can change the padding from zeros to reflection)
try changing the zoom (and see if images get blurrier)
what’s the difference between max_lighting & p_lighting?
what does p_affine do?

I will skip the list of possible functions that can be seen with data.(tab) after .transform() because it is very similar to the ones above.

Databunch

data = data.databunch(bs=128, num_workers=4)

The .databunch() function creates data loaders for each of the datasets using the given batch size. LabelLists.databunch calls _bunch.create(), which I believe somehow calls the DataBunch.create() function from fastai’s basic_data.py module. .databunch() can be given many arguments, the arguments will be passed to DataBunch.create(): path, bs, num_workers, and collate_fn.

path will override self.path presumably from the path given in the first ItemList block
bs=128 Batch size: the default is 64. Since my images are relatively small, I doubled the batch size. If you get an out of memory error, it is a good idea to decrease batch size
num_workers is the number of CPUs to use and the default is 4
I don’t understand what thecollate_fn is or does. It seems like there is a pytorch default called default_collate

One epoch runs all of your data through the neural net in batches of the size you determine here.

Normalize

Normalization can be added to the data block if necessary.

Question: for images, should this be done before splitting?

data = data.normalize()

Recap

A cleaner way to program the creation of a databunch is:

df=(see dataframe built above)
path='.'
tfms=get_transforms()
padding_mode='reflection'data = (ImageItemList.from_df(df=df, path=path, cols='fpaths')
                     .random_split_by_pct(valid_pct=0.2, seed=10)
                     .label_from_df(cols='class_label')
                     .transform(tfms, size=49, padding_mode)
                     .databunch(bs=128, num_workers=4))

data.(tab) now has tons of attributes and methods:

data.add_tfm
data.batch_size
data.batch_stats
data.create
data.create_from_ll
data.device
data.dl
data.dls
data.export
data.from_(csv,df,folder,lists,name_func,name_re)
data.load_empty
data.load_func
data.normalize
data.one_batch
data.one_item
data.path
data.show_batch
data.single_dl
data.single_ds
data.single_from_classes
data.test_dl
data.test_ds
data.tfms
data.train_dl
data.train_ds
data.valid_dl        RETURNS: DeviceDataLoader(dl= ...
data.valid_ds        RETURNS: LabelList x:ImageItemList (# of items) Image (3, 50, 50) y:CategoryList (# of items)

data.classes and data.c aren’t in the list, but they are performable (probably another bug). data.classes returns a list of class labels. data.c returns the number of classes (there will be more to this number in DL2).

You can also call len() on the datasets, e.g. len(data.train_ds) to see how many images are in each dataset.

View your data

data.show_batch(rows=3, figsize=(7,7), hide_axis=False)

rows The show_batch function is set up so that it always shows the same number of rows and columns. Meaning that if you assign rows=3, then you are also assigning columns=3, so you will see a 3x3 set of images
figsize 7 by 7 fits well on a screen
hide_axis shows the tics on the side of the images when set to false (this should have been called show_axis, so True would show them!)

Train a model

First, create a learn object

learn = create_cnn(data, models.resnet34, metrics=[error_rate, accuracy])

Second, fit the model with fit_one_cycle

learn.fit_one_cycle(6)

6 is the number of epochs to run

Learn about fit_one_cycle from fastai’s train.py module.

What is a learner? From fastai’s basic_train.py module: the learner class is a “trainer for `model` using `data` to minimize `loss_func` with optimizer `opt_func`.”

What is fit? “Fit the model on this learner with `lr` learning rate, `wd` weight decay for `epochs` with `callbacks`.”

Experimenting with `.transforms()`

To avoid transformations, I first tried creating a data object without including .transform() as shown below, but I got an error.

data = (ImageItemList.from_df(df=df, path=path, cols='fpaths')
                     .random_split_by_pct(valid_pct=0.2, seed=10)
                     .label_from_df(cols='class_label')
                     .databunch(bs=128))

Next, I fed the following three arguments to .transform(), then performed .fit_one_cycle(6) to see which returned the highest accuracy after 1 and 6 epochs

FFandZero

tfms = get_transforms(do_flip=False, 
                      flip_vert=False, 
                      max_rotate=0., 
                      max_zoom=0., 
                      max_lighting=0., 
                      max_warp=0., 
                      p_affine=0., 
                      p_lighting=0.)data = data.transform(tfms, size-49, padding_mode='reflection')

Tfm defaults, excpt max_warp=0

I had to make max_warp to zero since an error was returned when it was not zero. That turned out to be due to a bug in PyTorch and I could have used the nightly update to fix it. (See my post and Sylvain’s response on the fastai forum.)

tfms = get_transforms(do_flip=True, 
                      flip_vert=True, 
                      max_rotate=10., 
                      max_zoom=1.1, 
                      max_lighting=0.2, 
                      max_warp=0., 
                      p_affine=0.75, 
                      p_lighting=0.75)data = data.transform(tfms, size-49, padding_mode='reflection')

rand_crop crop_pad Only

data = data.transform(([rand_crop()], [crop_pad()]), 
                      size-49, 
                      padding_mode='reflection')

As you can see in the results table above, not using any transformations worked better than using the transformation defaults (without warping) after both 1 and 6 epochs. I expected “rand_crop and crop_pad Only” to return results similar to the absence of transformations. I’m not sure why they are different and will need to inspect the code further (and post to the forum to get help) to figure out why they differ.

Train a cnn with the fastai library

Step 1: Make a dataframe with paths to the image files and class labels

Step 2: Make a databunch

To do:

Experimenting with .transforms()

Written by Deena Blumenkrantz

Experimenting with `.transforms()`