Logo detection using YOLOv2
Logo recognition in images and videos is the key problem in a wide range of applications, such as copyright infringement detection, vehicle logo for intelligent traffic-control systems, augmented reality, contextual advertise placement and others. Today, we will look at logos not just from recognition point of view but also from localisation point of view. To this end, we will use a state-of-the-art objection detection method called YOLO9000[1], also known as YOLOv2 (But, there is a small difference between them), to detect logos.
What is object detection?
Object detection is the process of detecting where in the image is a particular object. The below image depicts this:
Here, the algorithm isn’t just identifying the objects(dog, bicycle, car) that are present in the image but also locating them.
How to perform object detection?
Over the years, people have used various methods to tackle the task of object detection. It can be separated into two approaches- the classical approach and the deep learning approach. The classical approach involved using keypoint-based detectors and descriptors, bag of words, Local Feature-Based Recognition, and others.
Things took a big turn in 2012 when AlexNet[2] rekindled the research interest in Convolutional Neural Network(CNN). Since then there has been a lot of work in using CNNs for various tasks related to images. There are three main methods that are currently noteworthy for Object Detection task:
- R-CNN: Region-based Convolutional Neural Networks[3]
- SSD: Single Shot MultiBox Detector[4]
- YOLO: You only look once
Of the three, YOLO (rather YOLOv2) is the current state-of-the-art method for object detection and this is what we will talk about today and use it for the task of Logo Detection.
What is YOLO?
YOLO stands for You Only Look Once. It is a real-time object detection system. There are currently two versions of it- YOLOv1[5] and YOLOv2. YOLOv1 was first introduced in 2015. It was an important work as the model could process images in real-time at 45 frames per second- which was pretty amazing compared to other object detection methods at that time. Here is a comparison of how YOLOv1 fared compared to other methods at that time:
Clearly, YOLOv1 performed a lot faster compared to the other methods. But, it’s detections suffered by ~10 mAP compared to Faster R-CNN VGG-16[6]. This was one of the motive behind the 2nd version of YOLO, which was introduced in late 2016. YOLOv2 outperforms all the other methods in both speed and detection. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP (Mean Average Precision).
How does YOLOv2 work?
YOLO suffered from a variety of shortcomings relative to state-of-the-art detection systems. Error analysis of YOLO compared to Fast R-CNN[7] showed that YOLO makes significant number of localization errors. Furthermore, YOLO had relatively low recall compared to region proposal-based methods. So, the authors focused mainly on improving recall and localization while maintaining classification accuracy.
Computer vision generally trends towards larger, deeper networks. Better performance often hinges on training larger networks or ensembling multiple models together. However, with YOLOv2, they wanted a more accurate detector that is still fast. Instead of scaling up the network, they simplified the network to make the representation easier to learn. Here’s how they improved YOLO’s performance:
Batch Normalization:
They added batch normalization[8] after every convolutional layer in YOLOv1. This itself resulted in 2% increment in mAP. Batch Normalisation improves the model convergence while regularising it. Due to this, they removed the Dropout[9] layer which was used in YOLOv1.
High Resolution Classifier:
General way in which object detection works is, the model is pretrained on imagenet[10] for classification. Then for detection, the network is resized to higher resolution especially to detect smaller objects in a scene.
YOLOv2 was pretrained on Imagenet (224x224), then the network was resized and finetuned for classification on higher resolution images (448x448). This resulted in ~4% increase in mAP.
Anchor Boxes:
There are two ways of predicting the bounding boxes- directly predicting the bounding box of the object or using a set of pre-defined bounding boxes (Anchor box) to predict the actual bounding box of the object.
YOLOv1 predicts the coordinates of bounding boxes directly using fully connected layers on top of the convolutional feature extractor. But, it makes a significant amount of localisation error. It is easier to predict the offset based on anchor boxes than to predict the co-ordinates directly.
Dimension clusters:
Instead of using predefined anchor boxes, they looked at the bounding boxes in training data (VOC 2007[11], COCO[12]) and performed K-means clustering on those boxes. This resulted in Dimension clusters.
When they ran K-means clustering on VOC 2008 and COCO training data, this is what they obtained:
The graph on the left shows how much the Dimension clusters overlap with the training data’s bounding boxes without any manipulation. Using these cluster centers helped increase mAP by 5%. The below table shows how Dimension clusters fare compared to pre-defined anchor boxes:
At only 5 priors, the centroids perform similar to 9 anchor boxes with an average IOU of 61.0 compared to 60.9. By using 9 centroids, a much higher average IOU is seen.
Direct location prediction:
When they used anchor boxes with YOLO, they encountered the issue of model instabilities especially during initial iterations. Instead of predicting offsets, YOLOv2 follows the approach of YOLOv1 and predicts location coordinates relative to the location of the grid cell. Using dimension clusters along with directly predicting the bounding box center location improves YOLO by almost 5% over the version with anchor boxes.
Fine-grained features:
YOLOv2 predicts the detections using the 13 x 13 feature map. This is sufficient for identifying large objects but not smaller objects. To better localize smaller objects, a passthrough layer that takes features from an earlier layer at 26 x 26 resolution is concatenated with the lower resolution features. This gives a 1% increase in performance.
Multi-scale training:
YOLOv1 used 448 x 448 resolution for the input. In YOLOv2, they resize the input image randomly to different resolutions between 320 x 320 to 608 x 608 (the resolution is always a multiple of 32). This multi-scale training can be thought of like augmentation, it forces the network to learn to predict well across a variety of input dimensions. This increased the mAP by 1.5%.
How does YOLOv2 fare against other methods?
The graph and table below show how different methods perform with respect to precision and speed on VOC 2007 dataset and VOC 2007 + 2012 dataset respectively:
What makes YOLOv2 fast?
Most (maybe all) applications for detections such as robotics, self-driving cars require and rely on low latency prediction from their object detection algorithms. While most algorithms rely on VGG-16 [13] as their base feature extractor, they tend to be slow as VGG-16 is a complex network which requires 30.69 billion floating point operations for a single pass over a single image at 224 × 224 resolution. YOLOv1 relied on a custom network based on the Googlenet architecture [14], which was faster compared to VGG-16 as it used only 8.52 billion operations for a forward pass. However, it’s accuracy is slightly worse than VGG- 16. For single-crop, top-5 accuracy at 224 × 224, YOLO’s custom model gets 88.0% ImageNet compared to 90.0% for VGG-16.
To overcome the shortcomings of YOLOv1, YOLOv2 uses a new architecture named Darknet-19. It has 19 convolutional layers and 5 max-pooling layers. Darknet-19 only requires 5.58 billion operations to process an image yet achieves 72.9% top-1 accuracy and 91.2% top-5 accuracy on ImageNet. The full network is shown below:
For details about how the model was trained for classification and detection, please refer the original paper.
How does YOLO9000 differ from YOLOv2?
This is a very interesting mechanism they propose- They train jointly on both classification and detection data. The method uses images labelled for detection to learn detection-specific information like bounding box coordinate prediction and objectness as well as how to classify common objects. It uses images that only have class labels to expand the number of categories it can detect. How cool is that!
During training, the images from both detection and classification datasets are mixed. When the network sees an image labelled for detection, it backpropagates based on the full YOLOv2 loss function. When it sees a classification image, it backpropagates loss from the classification specific parts of the architecture. To understand how they use hierarchical classification, how they combine the dataset and how they perform the joint classification and detection, please refer to section 4 of the original paper.
YOLO9000 differs from YOLOv2 where YOLO9000 was trained to be a large scale detector by using the combined dataset using the COCO detection dataset and the top 9000 classes from the full ImageNet release. YOLO9000 uses the base YOLOv2 architecture but only 3 priors instead of 5 to limit the output size. Using the joint training, YOLO9000 learns to find objects in images using the detection data in COCO and it learns to classify a wide variety of these objects using data from ImageNet. YOLO9000 gets 19.7 mAP overall with 16.0 mAP on the disjoint 156 object classes that it has never seen any labelled detection data for when evaluated on the ImageNet detection task.
Logo Detection Dataset
For the task of Logo Detection, FlickrLogos-47[15] has been used. It consists of real-world images collected from Flickr depicting company logos in various circumstances. Along with the images, it consists of annotations for the task of object detection. FlickrLogos-47 consists of 47 logo classes:
Adidas (Symbol), Adidas (Text), Aldi, Apple, Becks (Symbol), Becks (Text), BMW, Carlsberg (Symbol), Carlsberg (Text), Chimay (Symbol), Chimay (Text), Coca-Cola, Corona (Symbol), Corona (Text), DHL, Erdinger (Symbol), Erdinger (Text), Esso (Symbol), Esso (Text), Fedex, Ferrari, Ford, Foster’s (Symbol), Foster’s (Text), Google, Guiness (Symbol), Guiness (Text), Heineken, HP, Milka (Symbol), Milka (Text), Nvidia (Symbol), Nvidia (Text), Paulaner (Symbol), Paulaner (Text), Pepsi (Symbol), Pepsi (Text), Ritter Sport, Shell, Singha (Symbol), Singha (Text), Starbucks, Stella Artois (Symbol), Stella Artois (Text), Texaco, Tsingtao (Symbol) Tsingtao (Text) and UPS.
The dataset is maintained by the Multimedia Computing and Computer Vision Lab, Augsburg University. I would like to thank Christian Eggert for providing the dataset and Augsburg University for maintaining it, without which this work would not be possible.
Implementation
I have used the AlexeyAB’s implementation of YOLOv2. Mainly because it provides good instructions for training YOLOv2 on custom data, which in this case was FlickrLogo-47 dataset. It also provides an implementation for windows.
The logo detection model was trained using the weights for the convolutional layers of Darknet-19 that were pre-trained on Imagenet.
All the training were performed on Nvidia Tesla K80.
Results
The training was performed with the standard configuration of YOLOv2 VOC 2.0(yolo-voc.2.0.cfg). It has a learning rate of 0.0001 and batch size of 64. I have trained it for a total of 24000 iteration, where 1 iteration corresponds to 1 batch of images(64 images in my case).
In the below table, you will see the average Intersection-over-Union and Recall scores on the test set. The best scores are obtained in the 10000th iteration where an Average IoU of 48.03% and Recall of 58.11% was obtained.
Since the scores were steadily increasing till 10000th iteration, I increased the learning rate to 0.001 to see if that would speed up the convergence, but it affected the score drastically. I continued training it with 0.0001 learning rate till 20000th iterations, since there was no increase in the scores, I tried reducing the learning rate by a factor of 2 for the iterations 20000 till 23000. There was not significant change in the performance. Then I increased the learning rate to 0.0002, but it did not show any increase in performance.
The weights for the 10,000th iteration can be found here. The code to convert the FlickrLogo47 Dataset annotation to YOLOv2 annotations can be found in this repository.
References
[1] YOLO9000: Better, Faster, Stronger- https://arxiv.org/abs/1612.08242
[2] ImageNet Classification with Deep Convolutional Neural Networks- https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
[3] Rich feature hierarchies for accurate object detection and semantic segmentation- https://arxiv.org/abs/1311.2524
[4] SSD: Single Shot MultiBox Detector- https://arxiv.org/abs/1512.02325
[5] You Only Look Once: Unified, Real-Time Object Detection- https://arxiv.org/abs/1506.02640
[6] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks- https://arxiv.org/abs/1506.01497
[7] Fast R-CNN- https://arxiv.org/abs/1504.08083
[8] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift- https://arxiv.org/abs/1502.03167
[9] Dropout: A Simple Way to Prevent Neural Networks from Overfitting- http://jmlr.org/papers/v15/srivastava14a.html
[10] ImageNet: A Large-Scale Hierarchical Image Database- http://www.image-net.org/papers/imagenet_cvpr09.pdf
[11] The PASCAL Visual Object Classes (VOC) Challenge- http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham10.pdf
[12] Microsoft COCO: Common Objects in Context- https://arxiv.org/abs/1405.0312
[13] Very Deep Convolutional Networks for Large-Scale Image Recognition- https://arxiv.org/abs/1409.1556
[14] Going deeper with convolutions- https://arxiv.org/abs/1409.4842
[15] FlickrLogo-47 Dataset- http://www.multimedia-computing.de/flickrlogos/