Classifying Logos in Images
https://www.iosb.fraunhofer.de/servlet/is/78045/samples.jpg

Classifying Logos in Images with Convolutionary Neural Networks

Ann-Kristin Juschka
NEW IT Engineering
Published in
7 min readAug 16, 2019

--

Why should you read about machine learning and in particular this post?

The father of artificial intelligence (AI), John McCarthy, called AI “the science and engineering of making intelligent machines”. Machine learning is the part of AI that is nowadays already reality. Here computers learn like humans from experience (= data) instead of being explicitly programmed.
As more and more data is available on which today’s powerful computers can be trained, machine learning is now used everywhere: From customized video recommendations, image and speech recognition in medical research, spam and fraud detection, stock market analysis and video games to self-driving cars.
As a follow-up on our review of the first term of Udacity’s Machine Learning Engineer Nanodegree, this blog post explains in detail how in our final deep learning project of the second term we implement Convolutional Neural Networks with Python’s library Keras in order to classify logos in images. We are also interested in this topic as we later wish to analyze sentiments in twitter tweets with pictures where a given company’s logo is detected.

1. What image data do we use to train our Convolutional Neural Network?

Finding an interesting dataset is always the first step for any machine learning problem. Suited for our later purpose of classifying logos in arbitrary images from twitter tweets, we get access to the large Logos in the Wild Dataset with 11,052 images containing different logo classes. Instead of downloading the images from the URLs provided with this dataset, we get 9,428 of them as a subset of the QMUL-OpenLogo Dataset. First, we load the images and label them with the target brand given by the corresponding folder name:

After each training step, it makes sense to validate the resulting parameters and machine learning algorithm on a separate subset of the image files, the validation set. We define a checkpointer that saves the weight parameters with the best result on this validation set. In the end, after finishing the whole training process we want to test our machine learning algorithm on another separate subset, the test set. Thus we split our image dataset into training, validation and test set, and define the checkpointer as follows:

As last preprocessing step, for Keras’ Convolutional Neural Networks we convert each image to 224 x 224 pixels with the 3 RGB channels red, green and blue in order to interpret the 6,033 training, 1,509 validation and 1,886 test image files as three 4D tensors with respective shape (number of images, 224, 224, 3).

2. Why & how to use Convolutional Neural Networks for our logo classification problem?

In the second step, we decide which machine learning algorithm we want to use. As CNNs are a specific type of neural networks that are used most commonly for analyzing images, let us first display a simple neural network:

A neural network with two hidden layers

Roughly speaking, a neural network is a graph consisting of one input layer, hidden layers and one output layer whose nodes are connected with each other. Here our inputs are the features of our input data, e.g., the RGB values of each single pixel of our images. The output nodes correspond to the possible logo classes in our dataset. The connecting edges have different importance given by their weight values, and the output of any node is computed by applying an activation function, like the ones below, to the weighted sum of its respective input values.

The activation functions ReLU and Softmax in case of two input values
The activation functions ReLU and Softmax in case of two input values (for others, see https://en.wikipedia.org/wiki/Activation_function)

While we will use the Rectified linear unit (ReLU) activation for hidden layers, the Softmax activation is used for the output layer to obtain a probability in the interval [0,1] for each output node. When “training the neural network”, the computer itself repeatedly optimizes the weights of the connecting edges so that the neural network’s output becomes as close as possible to the true logo classes.

When analyzing images, the amount of parameters in a classical neural net becomes very large: If an image measures 224 x 224 pixels and we record for each pixel its 3 RGB values as input, then we have 224 x 224 x 3 = 150,528 input nodes so that the neural net needs to optimize a large amount of weight parameters for all fully connecting edges. Further, the input nodes are arranged as a vector so the neural net does not know any regional patters of the image. These problems motivate Convolutional Neural Nets like this one:

A convolutional neural network with three feature maps

In our example, the input nodes of the CNN are arranged in a 4x4 matrix shape. Then we define three 2x2 filters for the four input regions, where each region is only connected with the three correspondingly colored nodes in the hidden layer. Note that the 3 filters define 3 feature maps, which each detects a feature like a vertical, horizontal or diagonal line in the four regions. Or, in our use case, such a feature may be detecting a Starbucks, Burger King or Telekom logo.

As the weights parameters for the connections from the input layer to the convolutional layer are the same for each of the three filter, here we have only 2 x 2 x 3 = 12 weight parameters to be optimized in the training process thanks to fewer connecting edges and the CNN “understands” the regional contributions of the input data. Of course, in addition the training process optimizes the 12 x 9 = 108 weight parameters for the final fully connected output layer.

With respect to our concrete example of classifying small logos in images, a further advantage of CNNs is that they are translation invariant, i.e., they recognize logos in any region of the image.

3. Training a Convolutional Neural Net for Classifying Logos in Keras

We are almost ready to define our own CNN architecture from scratch. For doing so, next to convolutional layers defined above, we also make use of pooling layers. There are two common types that both reduce the dimensions of the feature maps: max pooling and global average pooling layers.

Max pooling layer with 2x2 pool size and two feature maps
Max pooling layer with 2x2 pool size and two feature maps

A max pooling layer maps a pooling window in a feature map to its maximal value. E.g., the two green 2x2 pooling windows in the two feature maps on the left hand side are mapped to their respective maximal values 0.6 and 0.7 in the max pooling layer on the right hand side. We note that the dimension of the feature maps got reduced by 2.

Global average pooling layer with two feature maps
Global average pooling layer with two feature maps

Similarly, a global average pooling layer maps each feature map to the average value of the values of its nodes. E.g., the average values for the blue feature map on the right hand side is -0.1–0.2–0.4–0.1 = - 0.2. Here we point out that the resulting feature maps consist of a single node so global average pooling layers reduce the dimension drastically.

Now we are ready to define our CNN architecture as follows:

Finally, we are ready to train our CNN model. For this, we choose a specific type of optimization algorithm called Adam, cross entropy as error function for calculating how close we are to the true logo classes, and the standard metric accuracy = number of correct prediction/number of all predictions.

4. Evaluating our trained CNN model

For the source code, see https://gist.github.com/ajuschka/5150cd989b10990b77ac32f7cf41b680

After training, we make use of Python’s library Matplotlib to plot how our chosen metric accuracy develops over training epochs. We notice that while the accuracy, computed on the training set after each epoch, keeps increasing, the accuracy computed on the validation set remains more or less around 30%. So even though we added a dropout layer after the convolutionary layer with the largest number of parameters to reduce the chance of overfitting, this seems to happen. That is, our CNN model tends to memorize our training data so that its predictions on unseen data, in this case the validation set, is not that good.

Thus it would make sense to now collect more training data or finetune some parameters like learning rate, activation function, or number of nodes and filters used in our architecture. However, for our capstone project we did not achieve much better results than this, so due to space limits we skip the finetuning step here.

Finally, with cnn_model.evaluate(test_tensors, test_targets, verbose=0)[1], we obtain that our CNN model achieves an accuracy of 31.18% on the test set.

5. Predicting Logo classes with our trained CNN

Having successfully trained our CNN model, we are ready to predict some logo classes using Keras’ function predict_classses:

For more results on finetuning parameters and other neural network architectures that we developed for our logo classification capstone project, we refer the reader to our submitted Jupyter notebook and capstone project report in our Github repository. There you also find instructions for using Tensorflow’s Object Detecting API for Faster R-CNN models and this blog post as Jupyter notebook.

--

--

Ann-Kristin Juschka
NEW IT Engineering

Software Engineer and math Ph.D. working at Accenture Technology