Diving into deep learning — Part I — Ensembling to build world-class image classifier — Top 3 in leaderboard

11 min readApr 7, 2018

Experiments solving tough business problems hosted on Hackerearth by FactorBranded Data Warriors with the [Power of fastai library]

[Precaution: A wall of text with analysis incoming. Also, includes some unwanted monologue. The below heavily uses fastai libraries. If you have some experience with fastai libs, the code will make a lot of sense]

Problem statement background:

Automated system to classify T-shirts based on print or pattern type. We are given 70,000 images with their categories ( Image classification problem) with a validation set of over 15,000 images for us to test on Hackerearth

An image of myntra website with the categories — Abstract/Biker/Checked/Solid etc

Unlike the other problems,there is a potential to use this model built to classify the current T-shirts in myntra(which excited me). And also, myntra can analyse what dress patterns are trending on social media and have inventory based of it (70% of dresses designed are apparently not sold). Understanding the pattern has huge business implications for myntra

Note:

Took a 3 day break from work to solve this problem and the delhivery’s problem (Both the problems required considerable efforts to solve). After mulling my head over the problem,I was able to get around ~72% accuracy (7th/2300 currently — Dated 23/3/2018) with a Densenet (link) approach. Getting 45GB on to my paperspace server was arduous followed by getting huge images onto the convnet (cropping/resizing and performing proper preprocessing) and removing faulty/unresponsive links.

This blog is not about how I got to 72% but rather the steps I am taking to improve the deep learning model. And,Transfer learning from VGG16/Resnet18 gave about 65–70% accuracy

Challenge overview and analysis of results that give 72% accuracy:

There are subtle patterns which are primarily left to the judgement of the person categorising. Building a model to extract the essence of why the image was categorised as such is the challenge.

Split the data 80–20 (20% of 70,000 as the cross validation set) and the confusion matrix is as follows:

30% of data given is Solids(Yeahhhhh!!! we have 94% accuracy on Solids). The problem is categorising Graphic and Typography. Graphic is the fourth largest category preceded by Typography, which makes up the third largest category.

Why is this problem hard? Because of THIS!

The above can be Typography(has text) or Graphic(printed shirt) or Abstract (Abstract pattern on hand) or stripes (The band on right hand looks similar to stripes) but is actually Biker and to make the matters worse the background is of different color with weird pattern

A look at some of the wrongly classified images:

Is that a pearl?? and Whats up with the guy’s pose?lol

Plan for the night ahead: Break into top 3 perhaps? Let’s try.

When in doubt,always do ENSEMBLE

As I continue to apply deep learning for computer vision problems, I have always been recommended to do ensemble of high performing architectures to win the competition. After all, the winning architecture for 2016 imagenet competition was an ensemble of 6 models. Lets get our hands dirty with ensembling by first figuring out individual high performing archs

How do we find high performing architectures:

Find resizing or cropping which feels intuitive.
Rescale to image size for which results can be analysed quickly.
Tune hyperparameters ensuring that we dont overfit
Reiterate for different variations of architecture(Densenet/Resnet/Inception-Resnet/Resnext) choosing the best performing parameters for each architecture
Finally, ensemble them. Ensembling is the easiest part, as simple as ([A]+[B]+[C]+[D])/4

Step 1 : Resizing or cropping

Lets take the image above and do some cropping

A crop of 400px from top , 100px from bottom,50px each on sides gives the following.

The above must be a reasonable image to feed into the network, since in most of the images the person is facing in similar direction. A small modification to transforms.py to get the above implementation:

Function scale_min — Since the images are stored as numpy format, we can just slice them to crop

r,c,*_ = im.shape
im = im[400:r-100,50:c-50,:]
r,c,*_ = im.shape

Step — 2: Rescaling image

There are few limitations of Paperspace P4000 instances. Images of size 320x320 with a batch-size of 64 takes about 15 min to the preprocessing (As Jeremy had noted in Class 8(Will share link when fast.ai part 2 course of 2018 is opensourced), the bottleneck was indeed happening during data augmentation.)

Note: The default data augmentation is CropType.CENTER, we have to feed the entire image (but rescaled), hence CropType.NO makes sense for this problem.

Step-3: Reiterating over different architectures — Given the constraints of P4000 instance, we can only do training over final layers (precompute=True).

Dataset split : (54676,13668)

Architecture 1 : Densenet

1.1 — Densenet 121 (Code):

Insights after training over Dn121

Accuracy of 75.24(For cross-validation set)with 0.4 as dropout parameters

0.005 was a good learning rate

Confusion matrix for Dn121- Dropout at 0.4

The diagonal data are looking brighter (good sign)

Drop-out parameters at 0.20 marginally fair better although val_loss is a bit higher for them. Confusion matrix when ps = 0.2

1.2 — Densenet 161 (Code): — Kernel crashed when size was 320x320(heartbreak -.- )

Running again with num_workers = 2 as a workaround (Preprocessing is now @ 30 min)

A score of 74.09 when ps is at 0.2

Confusion matrix(normalize = True)

A score of 73.4 when ps=0.4( Doesnot generalize better)

1.3 — Densenet 201 (Code):

Got a maximum good score of 75.24

Result — Densenet121 and Densenet201 fair the same (75.24 with 61%&58% accuracy over graphic) — Note to self: The validation data used for DN121 and DN201 are same? and getting 75.24 implies both deeper and shallow network have similar pattern discerning capabilities with respect to this problem

Using Densenet121 — Reasoning: Partially better classification of graphic (0.61 vs 0.58)

Why does 0.2 PS parameter give better results?

Makes me consider the possibility that overfitting a bit is perhaps not a bad idea? The usual dropout parameters are at 0.5 (50% of values are dropped), dropping 20% implies we are reducing generalisation for accuracy.

Perhaps an ensemble of Overfit Dn121 with Generalized Dn121 will give better results? — TODO

Architecture 2: Wide Residual Networks(Paper)

2.0- One flavour of architecture ( WRN50)

Can only fit based on dropout parameters. Again, constraints.

Dropout at 0.40 gives a maximum score of 73.4% accuracy

Dropout at 0.20 gives a score of 74.6% accuracy (Again, more parameters -> Better result) We are starting to see a pattern

I wanted to make sure what was being fed into the CNN was actually a cropped image(should have done this earlier)we desired. Helper method to verify the image being fed

def show_img(im, figsize=None, ax=None):
    if not ax: fig,ax = plt.subplots(figsize=figsize)
    ax.imshow(im)
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
    return ax
x,y=next(iter(data.val_dl))
show_img(data.val_ds.denorm(to_np(x))[0]);

Voila,

Getting back to WRN:

Lets take a look at Confusion Matrix(CM from hereon),

Still Graphic is @ 56% accuracy , Typography has better accuracy..hmm.

For WRN, Dropout @ 0.2 seems the best.

Result : Choose WRN, @ 0.2

Architecture 3: InceptionNetV4 — A variation of GoogleNet which won 2015 imagenet challenge(Code)

Dropout @ 0.4 , Gives an accuracy of 67% vs Dropout @ 0.2 gives 71% accuracy

Lets look at CM when ps = 0.4

Result : 71% accuracy is not that impressive. Ignore Inception net for this problem.

Architecture 3: InceptionResnet (Paper)(Code)

Analysis

Accuracy plateau’s at 67% (for PS = 0.4) and 71%( for PS = 0.2)

CM when PS = 0.4

CM when PS = 0.2

Apparently , an ensemble of InceptionV4 and InceptionResnet Outperformed Resnet151. Since we get 71% accuracy with InceptionV4, InceptionResnet, we can safely ignore Resnet architectures for this problem

From paper(Linked above) Inception-v4 + implies , it contains One Pure Inceptionv4 with InceptionResnet

Result : Safely ignore InceptionV4, InceptionResnet, Resnet architectures

Architecture 4: ResNext (Paper) — Final Architecture(Code)

4.1 : ResNext50

From the above tries, its clear that PS = 0.2 (or maybe 0.1 or lower — TODO)will give the best results

Accuracy plateaued at 70.6 (When PS = 0.4)

CM at PS= 0.2 (Accuracy of 73.7) — 3% increase?

Accuracy of Graphic @ 54% and Humour is @ 54% (Interesting)

Result: Use ResNext, again with PS = 0.2 , better results

Note : Since accuracy varies with dropout parameters (about 3%),we have to choose the best dropout parameters

Step 3.1 : Experiments to choose best dropout parameters

How? Take our best performing architecture and vary dropout appropriately — Dn121

Digging into source code, default dropout parameters are at [0.25,0.50], meaning 25% of parameters are randomly dropped in initial layers, while 50% are dropped at final layers? -Not sure-TODO: Dig more.

Done for the night. Will follow up on the analysis, when time permits.

Its Sunday and Back to choosing the best dropout parameters

Tried default[0.25,0.5],[0.1],[0.2],[0.4],[0.6]. Apparently, 0.2 does marginally better

For 0.1, the training loss went as low as 0.2 but validation loss did-not reduce(Overfitting)

0.2 is the best

Step 3.2 : Experiments to choose best size to feed into the network

Intuitively, removing the 400px from top and 100px from bottom, 50px each from sides make sense. But, what if the hands and their various position causes our classifier to classify wrongly:

Example:

Lets take the image on the left, cropping 400px and 100px from top would give two colours in middle, and this will not be classified as Solid (but is actually is)

The best way is perhaps do a object detection and find where shirts are (with a bounding box around the shirt and crop it), and feed it into the network. — TODO

For now, trying another variation, taking the centre 320x320 (most of the features can be interpreted from centre of the shirts? — Lets try.

For fastai lib, the data augmentation happens for a specific architecture and specific size and are stored in tmp folder as numpy arrays and running 320x320 again would-not do the actually augmentation needed but reuse the tmp data(Not sure about this) Edit: Data augmentation happens during runtime

Lets try 310x310 and check the images we feed into the network (with crop_type = CropType.CENTER)

Accuracy was @ ~74% but what caught my eye was the accuracy of Solid (94%, our best performing model gave about 91%), which makes total sense because, cropping at centre removes the features learnt because of weird background pattern or some accessories worn by the person posing. Accuracy of colorblocked is @ 54% vs 60% (because the network doesnot see the entire shirt) -> Again, Getting a feel of why Ensembling will work.

Time for Ensemble

Attempt One : Ensemble DN121 and DN201 and Check scores on validation set on hackerearth and our cross validation sets (Code)

76.8 accuracy over an ensemble of dn121 and dn201( OHHHHHHHH! ENSEMBLE WORKS!)

Best CM as of now (With considerable improvements to floral,Checked,ColorBlocked)

Time to submit to Hackerearth to check on their validation set [Lets hope ensemble workssss!]

Hmm…Got a score of 72.887 (marginally lower than my best score of 72.941)Although CM looks way better.

Attempt Two : Ensemble Dn121 and Dn201 with ResNext (Code)

77.2 accuracy — A slight improvement

Result on Hackerearth- 73.5 — (0.5% improvement :) )- My first ensembling which improved results

Attempt Three : Ensemble Dn121 and Dn201 with ResNext (but with validation split of 95%,5%)

77.4 % accuracy — A even slighter improvement — But this is over small validation set

Result on Hackerearth — 73.599 (Approx .1 difference) No substantial improvement

Attempt Four :Final- Ensemble Dn121,Dn201,WRN,ResNext (Code)

Got 77.6% accuracy. Switched back to 80/20 split.

Result on Hackerearth — 74.05%(A bit better)

An improvement of ~1.8% with ensemble, our ensembled models are not very different, might be the reason

Things we can further do:

Important: We are training over the final layers. Unfreeze and train over initial layers using differential learning rates
Differential learning parameters
Augment better ( rotate/flip?)
Try different sizes? (512x512 or 1024x1024)
Change optimiziers. ( Adagrad/Adam, We use SGD with momentum)

Going production level (Things needed):

Object detection of shirts in an image ,draw a bounding box, and crop them and feed to network
Identify text in each shirt and build a text classifier to find if they are sports/sports and jersey/humour/ Superhero or simply Graphic

Things I could have done better

Spend time with data — As I looked at few images of the dataset, deciphering pattern alone does not suffice. There are many categories where the content of Text involved matters. Eg. If the Shirts have New york, it is categorized as “People and Places” , and “Arsenal” would be Sports and jersey — Difficult to learn these via Neural Networks and we have to build a text detection algorithm- We do bad with People and Places, Sports and Team jersey and Varsity (All of which have some text (Varsity implies numbers on shirts )
Save and load weights
Commenting a bit better in code?
Predicting in a vectorized way

Things wrong with validation set on Hackerearth

~300 links are missing but they are not considered , Although, setting as Striped gives a marginally better score
Multiclass and not single class image classification

Example:

This image can be colour-blocked, text, or stripes. Building a model which predicts the highest probability would probably consider this as a Graphic/Typography

3. Similar images in different classes ( From training set, One is in Music, other is in Peoples and Places)

Final Best Confusion Matrix that we have.Lets check out our model on the offline validation set that we will be provided with before contest ends(First is @ 76% now, => 150 more images are classified correctly).

Note :

There were various data-leaks that I used to get to ~75–76% accuracy, mainly relating to Superheroes and Sport teams, the url links had the names of the superheroes in them, Eg. Batman / Dawn of justice etc — Not sure if this was allowed.

Blogging this has helped me track my progress.If anyone has read until here, Thank you for taking the time. Would appreciate any feedback. This is way better than my whiteboard approach

For part 2, Will be posting my entire approach for Delhivery contest and currently first at the leaderboard