Logo Detection Using PyTorch

Published in

Diving in Deep

9 min readJun 21, 2018

I wrote this blog to wrap up my first ever public talk at PyCon Thailand 2018 and to add some more details . Download all materials.

To run notebook on Google Colab free GPU read here

Ad Tech

Advertising technology, commonly known as “Ad Tech”, has been used by brands, vendors, and agencies to analyze and get insights from potential customers’ activities online. In the past year, machine learning and deep learning became a major tools for Ad Tech. For example, an image recognition system is used to identify the targets from brands, products, and logos on publicly posted images. The easiest way to identify brand from images is by its logo. In deep learning, this a kind of image classification problem.

Deep Learning

In deep learning, it’s all about create, train, and deploy network. Let’s see some basic networks.

Single Layer Neural Network

It’s a fundamental network. There’s one layer with one neural unit or neuron. The neuron performs dot product with input and its weight, and applies activation function on it. The very popular activation today is ReLU, Rectified Linear Unit, which performs function max(0, x).

Fully Connected Network

It’s a multi-layer with multi-neuron network. Neurons between layers are all connected. You can have any number of hidden layers and neurons

Convolutional Neural Network (CNN)

This is an important and widely used for image recognition. There’re filter and activation map. Filter, with the same dept as of image, slide over all location. It performs dot product with neurons in the filter-cover area and weights which’re shared across area. The result is an activation map.

Max Pooling

Max pooling selects the maximum value it covers. As shown in the picture, it’s 2 x 2-pixel max pooling with stride = 2 (the step it moves at a time).

Create Network

We can create network by using what we mentioned earlier for image classification. For example, LeNet-5, in this picture, is network used to classify hand-writing digits 0–9. It’s consisted of CNN, pooling, and Fully Connected Network.

Loss Functions

Loss function is function to measure the difference between the network’s output and the target. For image classification, we apply softmax function and use cross-entropy loss as loss function. For example, if the target y_i is class 2 (one-hot embedding is [0, 0, 1]), the loss will be 0.532

Decent Gradients

It’s an algorithm used to minimize loss or optimize the network. If we plot graph between loss or cost (J(w)) and any weight (w), gradient is the slope which measure how mush loss changes if weight changes. If gradient is positive, decrease weight will decrease loss. If gradient is negative, increase weight will decrease loss.

Source: Python Machine Learning 2nd Edition by Sebastian Raschka, Packt Publishing Ltd. 2017

We can update each weight by using this equation. The learning rate is hyper parameter which we have to define. If it’s too high, the loss will overshoot and never find minimum (see upper-right picture).

Update weight : w += - learning_rate * gradient

Network Training Loop

Now we do training loop. Compute outputs, loss, gradients, and update parameters (weights and biases). We repeat the process till we got the minimum loss.

PyTorch

As you can see, deep learning requires a lot of works and computations. The effective way is to use deep learning framework.
PyTorch is deep learning framework for Python. It comes with Autograd-an auto-compute gradients. It’s equipped with tools to create and train deep learning easily and efficiently. It also supports GPU (Graphic Processing Unit). It supports Linux, Mac, and Windows and easily to install (see pytorch.org)

Tensor

Tensor is fundamental data structure of PyTorch. It’s multi-dimensional matrix, similar to numpy’s ndarrays but able to run on GPU to accelerate computing.

Project Pipelines

Now we’re going to build logo detection with PyTorch. You can download jupyter notebook here if you want to get straight to the code.

1. Get the Data

We use Flickrlogos-32 dataset from https://www.uni-augsburg.de/en/fakultaet/fai/informatik/prof/mmc/research/datensatze/flickrlogos/ (see Download section)

32 brands + no-logo
Adidas, Aldi, Apple, Becks, BMW, Carlsberg, Chimay, Coca-Cola, Corona, DHL, Erdinger, Esso, Fedex, Ferrari, Ford, Foster’s, Google, Guiness, Heineken, HP, Milka, Nvidia, Paulaner, Pepsi, Ritter Sport, Shell, Singha, Starbucks, Stella Artois, Texaco, Tsingtao and UPS.
There’re 320 logo images for training, 960 logo images for validation, 3,960 images for testing, and 3,000 no-logo images.

Import required library.
Define directories.
Create load_datasets utility function to load dataset from FLICLLOGOS_URL and unzip to SOURCE_DIR.

2. Prepare Data for Network

Create list_image_paths utility function to read relative image file paths from text file into list variables.
Add train_logo_relpaths and half of val_logo_relpaths to train_relpaths.
Add train_logo_relpaths and the other half of val_logo_relpaths to val_relpaths.

We’re going to use datasets.ImageFolder( ) which preferred directory structure as dataset/classes/img.jpg
Create prepare_datasets utility function to copy image files according to the lists of relative paths to preferred directory structure.

Import torch and torchvision libraries
Define data_transforms which resize, convert to tensor, and normalize inputs by mean and standard deviation of training dataset.
Create datasets using torchvision.datasets.ImageFolder with arguments-dataset directories and data_transform.
Create dataloaders using DataLoader with batch size = 32 on each dataset.

We create imshow utility function to display image.

3. Create Network

We will create LeNet-5-liked network for our image classification.

Create our network by subclass nn.Module
Define __init__ method by create conv1 layer using nn.Conv2d with arguments-3 in-channels, 6 out-channels, 5 x 5-pixel filter with stride=1 as default, conv2 layer with arguments–6 in-channels, 16 out-channels, and same size filter.
Create pool layer using nn.MaxPool2d with arguments-2 x 2-pixel filter, stride=2.
Create fully connected layer fc1 using nn.Linear with arguments-(16 * 53 * 53) in-features, 120 out-features, fc2 with arguments-120 in-features, 84 out-features, and fc3 with arguments-84 in-features, 33 out-features.

Define forward method for inputs.
Forward inputs through conv1 layer, apply nn.functional.relu, and forward through pool layer.
Forward inputs through conv2 layer, apply nn.functional.relu, and forward through pool layer.
Flatten x to two-dimensional tensor-[number of instances, 16*53*53] using view method.
Forward through fc1, apply relu, forward through fc2, apply relu, and forward through fc3 without relu because we will use softmax instead.

Before instantiate our network, use torch.device to detect GPU if it’s available otherwise use CPU.
Then set our network to device

4. Train the Network

Import torch.optim
Define criterion or loss function using nn.CrossEntropyLoss which already incorporates softmax function with entropy loss.
Use optim.SGD or Stochastic Gradient Decent as optimizer with lr (learning rate) = 0.001.

Create train_val function to train network on training dataset and evaluate network on validation dataset.
Set model or network to train mode for training and to eval mode for evaluating.

Source: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html

Set inputs and labels to device.
Set beginning gradients zero using optimizer.zero_grad to prevent adding up gradients for each iteration.
Set torch.set_grad_enabled as True if it’s training.
Compute outputs.
Compute loss.
If it’s training, compute gradients by using loss.backward and update parameters by using optimizer.step.

The function returns the model with best accuracy on validation dataset.

Train and validate the network.
It took about 11 minutes on one GPU with best accuracy of 61.87% which seems not best at all.

5. Evaluate

We evaluate on test dataset using test function which’s very similar to validate function.
We get better accuracy on test set.

There’re many things we can do to improve our own created network. But there’s one practice in deep learning that very useful and effective. It’s called transfer learning.

Transfer Learning

Transfer learning is a machine learning technique where a model trained on one task is re-purposed on a second related task.

Source: https://machinelearningmastery.com/transfer-learning-for-deep-learning/

Simply speaking, we can run our data on any other pretrained networks with some tweaks. It saves a lot of time and very effective.

How?

Select the pretrained network.
Match our data to the network input’s format.
Replace the output layer.
Retrain the network.

ResNet18

This is ResNet18. You can see how complex the network is. It could take many days and a lot of computing power to train it from scratch.

Source: https://arxiv.org/pdf/1512.03385.pdf

Match the network input’s format

We normalize our datasets with mean and standard deviation of ResNet18’s training dataset.

Load Pretrained Network

Load ResNet18 using torchvision.models.resnet18.
The output layer fc is the layer we’re going to replace.

Replace output layer

Replace fc with new fc-512 in-features and 33 out-features.

Retrain the Network

Retrain the network using train_val( ).
The accuracy on validation dataset is much better.

Evaluate on Test Dataset

Evaluate the network on test dataset using test().

As Fixed Feature Extractor

The previous transfer learning is called fine-tuning. We can use transfer learning as fixed feature extractor by freezing all the network except the final layer. We need to set requires_grad == False to freeze the parameters so that the gradients are not computed in the backward( ).