Object Localization using Keras

Roy Ganz
Analytics Vidhya
Published in
4 min readJul 13, 2020

In this blog, I will explain the task of object localization and how to implement a CNN based architecture that solves this task.

Resources:
code: https://colab.research.google.com/drive/169pJ-xECBWDW9Q92naaNE3oRyBr7D-uh#scrollTo=IjFHABNA7nZL
images: https://drive.google.com/drive/folders/1YsxDywjUZi-l_YgzXt__9__eUU-I_6EH?usp=sharing

Agenda

  • What is Object Localization?
  • How to perform Object Localization using Deep Neural Networks
  • step-by-step Keras implementation

Object Localization

Object localization is the name of the task of “classification with localization”. Namely, given an image, classify the object that appears in it, and find its location in the image, usually by using a bounding-box. In Object Localization, only a single object can appear in the image. If more than one object can appear, the task is called “Object Detection”.

Classification vs Object Localization

How to perform Object Localization using DNNs?

Object Localization can be treated as a regression problem - predicting a continuous value, such as a weight or a salary. For instance, we can represent our output (a bounding-box) as a tuple of size 4, as follows:
(x,y, height, width)
- (x,y): the coordination of the left-top corner of the bounding box
- height: the height of the bounding box
- width: the width of the bounding box

(x,y,height,width) as a bounding-box representation

Hence, we can use a CNN architecture that will output 4 numbers, but what will be our architecture?

First, we will notice that these 4 numbers have some constraints: (x.y) must be inside the image and so do x+width and y+height. We will scale the image width and height to be 1.0. To make sure that the CNN outputs will be in the range [0,1], we will use the sigmoid activation layer- it will enforce that (x,y) will be inside the image, but not necessarily x+width and y+height. This property will be learned by the network during the training process.

suggested architecture:

Given an image, the network will output a 4 numbers representation bounding box

What about the loss function? the output of a sigmoid can be treated as probabilistic values, and therefore we can use binary_crossentropy loss.

binary_crossentropy formula

Usually, this loss is being used with binary values as the ground-truth ({0,1}), but it doesn’t have to be this way- we can use values from [0,1]. For our use, the ground-truth values are indeed in the range [0,1], since it represents location inside an image and dimensions.

Step-by-Step Keras implementation

We will solve an Object Localization task in three steps:
1) Synthetic usecase- Detect white blobs on a black background
2) semi-synthetic usecase- Detect cats on a black background
3) final usecase- Detect cats on a natural background

  1. Synthetic usecase:
    first of all, we will import and define the followings:

Then, we will define our CNN based model. We will perform transfer learning: using a pre-trained version of VGG.

The model’s architecture

Our next step will be creating a dataset using a python generator. It will create batches of white blobs over black background.

X is a batch of images and Y is the ground-truth bounding boxes. for each image, we will create a white blob, and we will make sure that it will be inside the borders of the image.
an example of the circle generator

Now, we can train the model to predict the ground truth bounding-box, and visualize its performance:

2. Semi-synthetic usecase:
For this usecase, we will use the same architecture, with a different type of images - images of a cat over a black background. This will be achieved using a different generator:

For each image, this generator will resize the cat to a random size and will draw it in a random position over a black background image.

The output of the above is:

Now, we can define a similar model, train it and examine its results:

Our trained model is capable of locating a cat in black background images:

3. Final usecase
In this usecase, we will draw our cat over natural backgrounds. Similarly to the 2nd usecase, we will need to implement a new image generator - each image will consist of a random-sized cat in a random position. The cat will be drawn over a natural image, that will be its background. The natural image will be drawn randomly from a set of images.

This generator draws per each image a random natural background image from a given set. It will resize the cat to a random size and draw it over the natural background in a random position.

The natural_cat_gen() output will be similar to:

Next, we will define a new model, train it and visualize some of its results:

--

--

Roy Ganz
Analytics Vidhya

MSc student at Technion for Deep Learning and Computer Vision