[Week7— Eat & Count]

Eat & Count
bbm406f16
Published in
4 min readJan 14, 2017

In our Week-5 blog post, we have talked about training a ConvNet from scratch and shared our preliminary results. We have used AlexNet model and trained it on 10 classes from Food-101 dataset. The preliminary result that we get from this model is about 50% accuracy. But it is time consuming even on 10 classes.

Transfer Learning and VGG-16

The solution is using transfer learning. In transfer learning, one can use pre-trained weights unlike randomly initialized weights in from scratch training. To implement transfer learning on TensorFlow backend, the most suitable method we encounter is Keras. Keras is a deep learning library works both on TensorFlow and Theano frameworks. VGG-16, VGG-19, Resnet50 and Inception v3 are the pretrained networks provided by Keras.

Image Credit: François Collet

First we pre-trained VGG-16 model following the steps in this blog post. There were two cases to apply fine-tuning that shown in this blog post. First one was using bottleneck features.

Let’s say we want to fine-tune only last block of the network, that is the weights will not be updated up to the last block. We can store the output of the Conv block 5 using one forward pass. Then we can use this output as input feature of the last block and train as a separate network.

The other method is freezing layers that we don’t want to fine-tune. Simply set learning rates on these layers to zero so that there cannot be update on these layers. In this case we have to run entire network so it may cause slower convergence. In conclusion, using bottleneck features leads us to faster convergence but we need extra memory for storing these features. For example, the output of Conv block 4 for a single image will be [512x11x11] if the input image dimension is [150x150x3]. Multiplying it with number of training images 75000 and float size 4 bytes we get about 17 GBs. This exceeds our memory.

We did experiments on VGG-16 model for 20 classes and 100 classes. For 20 classes we get 70% accuracy. First results for 101 classes we get 36% accuracy. It seemed to take a long time to reach convergence. So we changed the network with ResNet50.

ResNet

Let’s try to explain the idea behind ResNet. In theory deeper networks gives better results. But there is a problem in deep learning named vanishing gradient problem. If gradients become too small when backpropagating, representing these values in computers is getting harder. As we remember backpropagation algorithm, the information we get from error is flowing through network using chain rule. Since the gradients are too small, the multiplication of these gradients will be much smaller. As a result first layers in the network are not going to be updated anymore. This problem may be worse if the number of layers increases.

There are some solutions to this problems such as ReLU activation units, batch normalization. ReLUs may reduce the effect of vanishing gradient problem. But still gradient can be zero when the input to neuron is negative. This scenario can occur when large negative bias term is learned.

In conclusion, deeper networks have more representational power because of hierarchical feature representation at each layer. The problems we talked about before, in practice, lead to the opposite situation. At this point, the idea behind the Residual Networks (Deep Residual Learning for Image Recognition by He et al) allows deeper neural networks to be effectively trained.

Image is taken from Lab41’s post. Unchanged version of this image is in Deep Networks with Stochastic Depth paper.

The proposed technique in the paper is based on learning at least the identity of the given input. If some function f(x) can be learned, it will sum up with id(x) = x so that the learned function will be h(x) = f(x) + x.

With this idea, deeper neural networks can be trained successfully. In the original paper they evaluated a residual net with a depth of up to 152 layers which is 8× deeper than VGG nets but still having lower complexity. They won the ImageNet competition with a 3.57 top-5 error rate.

We evaluated ResNet50 which has 50 layers of depth. For 101 classes the accuracy is 72%. Then we have applied more aggresive data augmentation with mean subtraction, normalization, rotation, horizontal-flip and shear mapping. After that, accuracy is increased to 74%.

On the subject of deep learning based food recognition there is recently a paper written by Liu et al. They achieved 77.4% accuracy using Inception module on Food-101 dataset. In order to increase the accuracy we are planning to employ a deeper network.

Resources

--

--