Towards Advanced Accommodation: Deep Learning for Photos Classification — Part 2

Published in

Airy ♥ Science

6 min readOct 4, 2019

We have described the problem, transfer learning, oversampling, and data augmentation on the previous post [13]. In this post, we will describe slightly more advanced techniques that we have used in our image classification engine. First we will discuss about choosing optimizer, selecting learning rate, unfreeze partial layers, and adding dropout to regularize our model.

Choosing Optimizer: AdamW, amsgrad, and RAdam

The problem of Adam is its convergence [11] and for some tasks, it has also been reported to take a long time to converge if not properly tuned [10].

There has been Adam fixes since then, amsgrad is one of them. The idea is to regularize the gradient. However, as reported by fast.ai [10], turns out it has disappointing performance. From their experiments, they showed that amsgrad always performs worse than Adam.

AdamW [15] idea is to decouple the computation of weight decay and loss update. They argue that while L2 normalization and weight decay is the same for SGD, it is not the same with momentum-based optimizer, like Adam. They proposed that the weight decay position in Adam should be implemented differently [10]. You may find AdamW implementation at native PyTorch.

RAdam [12] basic idea is to rectify the momentum, especially at the beginning of training. Although there are many successful cases of Adam with deep learning, Adam still introduces large variance because of this exponential moving average function (introduced as “momentum” when updating learning rate). This phenomenon leads to inconsistent gradient distribution that will be used to update weights on backpropagation. Hence RAdam tries to minimize that phenomenon by “rectifying” the momentum.

Vanilla Adam and Adam With Warmup Comparison [12] showing that Adam has a large variance in the startup stage.

Using RAdam does not necessarily increase our accuracy, yet it significantly speeds up the convergence from 40–50 epochs to 5–10 epochs! This will be beneficial to get multiple results from various models and with various schemes. Moreover, the paper author has provided implementation of RAdam in PyTorch [9].

Learning Rate

You might wonder, “How much should I set my learning rate in conjunction with specific optimizer?” First, you should not set your learning rate too small because you may wait too long and finally get bored. On the other hand, you should not set it too large because you might miss the local optimum.

Tip: Watch out for typos! In our experience, we accidentally set our learning rate to be 1e3 instead of 1e-3 !

If you are using Adam [1] optimizer, you might want to try Karpathy’s magic number [2]. Now, how sure are we that it will be the right number? Leslie [3] proposed simpler method than well-known grid search.

Decaying Learning Rate

We initially experimented with a relatively big learning rate and tried to decay it as the epochs go larger. The big idea is to learn slower as the model approaches the local optimum, as explained by Andrew Ng [6]. One of the simplest things you could try is to decay your learning rate using StepLR scheduler from PyTorch [7].

Cyclical Learning Rate and Learning Rate Range Test

Cyclical Learning Rate idea:

Select the minimum and maximum learning rate + step size/epochs.
For each step, try to train with increasing learning rate + decreasing learning rate

LR range idea:

Start with small learning rate
For each step, increase the learning rate
See for increasing validation loss → overfitting

Protip: For PyTorch users, you might want to check out [4] and fast.ai [5] instead of building your own LR Finder.

We used LR Finder method to select the approximately best learning rate to be deployed with the optimizer. Using this LRFinder is straightforward as shown in the code below.

Now we choose the learning rate to be between 1e-2 and ~6e-2 because the plot shows it could still decrease the loss within that range.

Unfreezing Higher Layers

The usual transfer learning approach is to train a base network and then copy its first n layers to the first n layers of a target network. The remaining layers of the target network are then randomly initialized and trained toward the target task.

Yosinski et al. [8] described the usual way to do transfer learning. In this task, we let 20% of the higher layer to be trained for our photo classification engine. Example code for Pytorch can be seen below.

As you might see in the code above, we have a slightly different approach than suggested by the paper. Instead of initializing the high-level cells with random weights, we start with pre-trained weights and let the network learn afterward.

Adding Dropout to The Last Layer

It is well known in transfer learning, to change the last layer to match the number of classes. What we have found in dealing with imbalance class is, oversampling + data augmentation could lead to overfitting for classes with a few samples. One of the easiest yet powerful approaches that we could have missed is to add a Dropout [14] layer at the last layer.

Results and Conclusion

Using transfer learning with Densenet169 architecture combined with oversampling, data augmentation, layer unfreezing, RAdam, and dropout results in 91% f1-score. Using techniques described in this post and the previous post, we have achieved a very good result with relatively small data (3,000 images). However, there are some images that this model failed to classify correctly, but most of them are usually hard, even for humans to tell the difference.

Closing Remarks

At Airy, especially on Engineering, we embrace 3 big values, BIC: Bold, Innovative, and Customer-Centric. We are striving to combine the current state of the art technology in accommodation space in the spirit of BIC. If you have the same passion with us, come and join us!

Acknowledgment

I want to thank Winson Waisakurnia for giving out ideas. Ali Akbar who have reviewed this blog post and prepared Kaggle classroom, so our team could join this internal competition. And all Airy data team members that has collaboratively help on tagging these photos and making this initiative happened.

References

Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13
https://twitter.com/karpathy/status/801621764144971776
Smith, Leslie N. (2018). A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay. arXiv
https://github.com/davidtvs/pytorch-lr-finder
https://github.com/fastai/fastai
Ng, Andrew. Learning Rate Decay (C2W2L09). https://www.youtube.com/watch?v=QzulmoOg2JE
https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.StepLR
Yosinski J, Clune J, Bengio Y, and Lipson H. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems 27 (NIPS ’14), NIPS Foundation, 2014
https://github.com/LiyuanLucasLiu/RAdam/blob/master/radam.py
https://www.fast.ai/2018/07/02/adam-weight-decay/
Reddi, S., Kale, S., Kumar, S.On the Convergence Of ADAM and Beyond. ICLR 2018
Liu, Liyuan, et al.On the Variance of the Adaptive Learning Rate and Beyond. arXiv 2019.
https://medium.com/airy-science/towards-advanced-accommodation-photos-classification-part-1-a64b542b31ed
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from
Overfitting. Journal of Machine Learning Research 2014.
Loshchilov, I., Hutter, H. Decoupled Weight Decay Regularization. ICLR 2019
https://pytorch.org/docs/stable/_modules/torch/optim/adamw.html