Week 7 — Landmark Identifier

Serhat Sağlık
bbm406f17
Published in
3 min readJan 3, 2018

After weeks of experimenting with different neural network models and optimization algorithms we’ve decided to use ResNet50 model with Stochastic Gradient Descent optimizer and we’ve made some changes on our code for it.

ResNet is a short name for Residual Network.

To solve the problem of deep neural networks being more difficult to train, researchers Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun from Microsoft developed Deep Residual Learning for Image Classification in 2015.

Residual Networks have shortcut connections. Shortcut connections (connections that skips one or more layers) perform identity mapping and the result is added to the output of the stacked layers, effetively reducing the complexity of having a deep network while keeping the accuracy gains that comes with it. For detailed information you can check their paper: https://arxiv.org/pdf/1512.03385.pdf

We didn’t want to train our network from scratch as it would take a lot of time. Instead we used ResNet50 weights that’s been trained on ImageNet dataset to fine-tune them to best suit our needs.

We used ResNet50 as our base model and added our own softmax output layer for our 176 classes.

ResNet50 has 174 layers. At the beginning we froze all ResNet50 layers to train our output layers for a few epochs to give the model a general idea of our classes. Then we unfroze the last identity blocks and started fine-tuning resnet50, not touching the first 153 blocks.

For more accurate training we used Keras’ ImageDataGenerator library. With that library we are able to rotate, shift and zoom an image and use the created images to increase our training set’s size.

We’ve tried different learning rates for our model. And at the end we decided to use a learning rate of 0.0001 as it gave the best results.

We saved the weights after each epoch. After around 100 epochs it converged and gave us these results:

loss: 0.0825 - acc: 0.9793 - top_5_categorical_accuracy: 0.9994 
val_loss: 2.1778 - val_acc: 0.6548
val_top_5_categorical_accuracy: 0.8372

Some of the reasons behind the difference between validation accuracy/loss and training accuracy/loss are the small size of our dataset (~24k training ~6k test images) and the noise we failed to clean. It can learn the given images’ classes with an outstanding accuracy but can’t apply its knowledge to different real life images with the same accuracy.

Our valuable project member Ege Ucak, behind Tarihi Asansor, İzmir

For example:
We have a total of 244 training images (2 times the number of average images per class) for Tarihi Asansor and most of them are full sized images from the front. There are only 2 similar pictures in our training set. So when the network looks at Ege’s photo it recognizes the street and buildings around it before it recognizes the top of the Tarihi Asansor building and guesses the class as Fransız Sokağı (La Rue Française) with 33% certainty.

Fransız Sokağı (La Rue Française), Istanbul

This showed us the importance of the size and variety of the dataset.

--

--