Data Cleaning in Deep Learning using FastAI!

Ijaz Khan
unpack
Published in
3 min readOct 26, 2020

Without data, deep learning would not be a topic of discussion today, because data is the food for deep learning models. Deep learning models results are not only based on the model’s architecture and parameters, but also on the right data. what does it mean by the right data here?

The right data is the one, which is labelled correctly. The data which have wrong labelling problems confuse the model and results in poor output.

Data gathering for for deep learning models is often done by querying the web. In this way huge amount of data can be collected but there is a big chance that many of the data items maybe labelled incorrectly, which can confuse the deep learning models to give erroneous results.

Therefore, the collected data always need cleaning. If we start thinking of how to clean and correctly label the hundreds or thousands of data items manually? Its gonna take huge time.

I will share my own experience with data cleaning using fastai! really it was time saving.

I decided to to make “vehicles classifier”. A classifier which can classify the following types of vehicle (car, truck, jeep, motorbike, rickshaw, and Random Picture if none of the other vehicles).

I used the key given by Microsoft Azure Bing Image Search key e.g. key = os.environ.get(‘AZURE_SEARCH_KEY’, ‘XXX’).

I downloaded 150 images for each vehicle item and saved them. I trained the resnet18 model which gave me poor results with 5 epochs. I tried to increase the number of epoch but still ,it was not improving well. so i used confusion matrix first, to find out whats going wrong.

It can be seen from the confusion matrix that many labels are predicted incorrectly, but thanks to the FASTAI shortcuts, i was able to see what data is labelled wrong.
i used the “interp.plot_top_losses(20, nrows=5)” to see which image was predicted with greater loss. The image below shows the output.

We can see from the above output that most of the images are labelled incorrectly, which show the model is not trained well, but the fastai majic begins where it helps you to easily re-label your data or delete wrong data.

FASTAI gives us “ImageClassifierCleaner()” function, which pops out a window, just like the one in the image below.

As can be seen that , we have been given options to “delete”, “rename” or “keep” the labels according to the vehicle types.

From my example, the most errors came when i used the query “Random picture”. It downloaded many random pictures, including all types of vehicles. which confused the model.

with few times of using fastai ImageClassifierCleaner() function, i managed to correct my wrong labels somehow and my model results improved.

Now my model can predict the new unseen pictures correctly. See the picture below!

i uploaded a new image of “rickshaw” and see the prediction results below.

Although fastai gave us shortcuts to clean our data easily, but i would say its very important to use the correct and specific query to collect data from web, which can save a bit of your time.

--

--