Finding “Waldo” using a simple Convolutional Neural Network

Kaneel Senevirathne
Analytics Vidhya
Published in
9 min readAug 31, 2021

I have owned dozens of “Where’s Waldo?” books when I was young and I am sure you have too. If you don’t know what it is, “Where’s Waldo?” (sometimes called “Where’s Wally?”) is a series of books consisting detailed double paged cartoon illustrations showing hundreds of people doing a variety of amusing things at a given location. The challenge for us is to find a fictional character named “Waldo” (or Wally) who is hidden amongst other people in the illustration. Waldo is well known for wearing a red-and-white striped shirt, bobble hat, and glasses and is very challenging to find.

Recently, after learning some new object localization and classification techniques, I thought of going back to basics and building an object localization and classification tutorial from scratch using the Tensorflow functional API. First, I created some sample images, then implemented a convolutional neural network and finally trained the model to see how good it can predict where Waldo is in the image. I also introduced another character “Wilma”, to see if the model can classify between the two characters while finding where they are in the image. In this blog, I am describing the process from data synthesizing to model training. So, if you are a beginner to object localization this would be an easy and fun tutorial. If you would like to go and check the Jupyter Notebook, click here.

Synthesizing data

First we need to synthesize some data so that the neural network can learn. In order to do this, we will use a sample background image, and randomly place “Waldo” or “Wilma” in the image.

The background image used to synthesize data
Images of Waldo (left) and Wilma (right) used to synthesize sample images

In order to create a sample image we will use the following function. The function imports the background image, images of Waldo & Wilma and then place either Waldo or Wilma in some random location on the background image (Note that the upper left corner of the image will coincide with this random location). The function then outputs the synthesized image, the random location where we placed the person (x, y coordinates as a tuple) and who we placed (if its Waldo or Wilma).

#create a function to generate images
def generate_sample_image():

#background image
background_im = Image.open(background_dir)
background_im = background_im.resize((500, 350))
#background_im = Image.new("RGB", (500, 350), (255, 255, 255))

#waldo
waldo_im = Image.open(waldo_dir)
waldo_im = waldo_im.resize((60, 100))

#wilma
wilma_im = Image.open(wilma_dir)
wilma_im = wilma_im.resize((60, 100))

#select x and y coordinates randomly we'll select between (0, 430) and (0, 250)
col = np.random.randint(0, 410)
row = np.random.randint(0, 230)

#pic randomly between waldo and wilma. If 1 we will select waldo. if 0 we wills elect wilma
rand_person = np.random.choice([0, 1], p = [0.5, 0.5])

if rand_person == 1:

background_im.paste(waldo_im, (col, row), mask = waldo_im)
cat = 'Waldo'

else:

background_im.paste(wilma_im, (col, row), mask = wilma_im)
cat = 'Wilma'

return np.array(background_im).astype('uint8'), (col, row), rand_person, cat

Below we have a sample image created by our function. Check for yourself, if you can find Waldo or Wilma in this image!

Sample image generated by our function

If you didn’t find either Waldo or Wilma in this picture, don’t worry!, our algorithm will help us find them. (In this example Waldo is in between the two umbrellas)

Creating bounding boxes

A bounding box is a common way of showing where an object is in an image. Typically, we will provide the coordinates of the bounding box to the model, and train the it to predict the coordinates of the bounding box. Ideally, a fully trained model will predict a bounding box that overlaps the actual bounding box.

Next we will create a bounding box to visualize the location of Waldo and Wilma. As I said before, the upper left corner of each image of Waldo or Wilma, will coincide with the random location. So the upper left corner of the bounding box will be this random location generated by the generate_sample_image function.

def plot_bounding_box(image, gt_coords, pred_coords = None):

#convert image to array
image = Image.fromarray(image)
draw = ImageDraw.Draw(image)
draw.rectangle((gt_coords[0], gt_coords[1], gt_coords[0] + 60, gt_coords[1] + 100), outline = 'green', width = 5)

if pred_coords:

draw.rectangle((pred_coords[0], pred_coords[1], pred_coords[0] + 60, pred_coords[1] + 100), outline = 'red', width = 5)

return image

This function inputs the image, the true coordinates and the predicted coordinates (only if given) and then outputs the image with the bounding boxes at the given locations. Below is an image generated by this function when we give the actual location of the person appearing in the image and the predicted location. Note that the true image will be represented by a green rectangle and the predicted image will be represented by a red rectangle.

Actual and Predicted bounding boxes generated by the function

Data generator

Next we will create a data generator to our model. This data generator function is used to continuously feed images to our model. We will generate the training images, the class (If it’s an image containing Waldo or Wilma) and the position of the bounding box (both x and y coordinates of this location). Because, our bounding boxes are rectangles of the same size, we will only need to feed the upper left coordinate of the bounding box to the model. Our data generator function will feed images in batches and here we have used a default value of 16 for the batch size.

#data generator function 
def generate_data(batch_size = 16):

while True:

#create empty arrays for the generated data
x_batch = np.zeros((batch_size, 350, 500, 3))
y_batch = np.zeros((batch_size, 1))
boundary_box = np.zeros((batch_size, 2))

for i in range(batch_size):

#generate an example image
sample_im, pos, person, _ = generate_sample_image()

#put the images to the arrays
x_batch[i] = sample_im/255 #normalize
y_batch[i] = person
boundary_box[i, 0] = pos[0]
boundary_box[i, 1] = pos[1]

yield {'image': x_batch} , {'class': y_batch, 'box': boundary_box}

Implementing the model using Tensorflow

Our model has three main parts. We will define a function for each part using the Tensorflow functional API. The first part extract the features, the second part creates a regression output and the final part creates a classification output. Below is an overview of our model.

An overview of our model

Like mentioned earlier, in the first part of the model, we have to detect features of the training images. In order to do this, we will use multiple convolutional layers, each followed by batch normalization and max pooling. Below is the function that performs this part of the model.

#create the model
def convolutional_block(inputs):

x = tf.keras.layers.Conv2D(16, 3, padding = 'same', activation = 'relu')(inputs)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.MaxPool2D(2)(x)

x = tf.keras.layers.Conv2D(32, 3, padding = 'same', activation = 'relu')(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.MaxPool2D(2)(x)

x = tf.keras.layers.Conv2D(64, 6, padding = 'valid', activation = 'relu')(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.MaxPool2D(2)(x)

x = tf.keras.layers.Conv2D(64, 6, padding = 'valid', activation = 'relu')(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.MaxPool2D(2)(x)

return x

Next we will define the regression block. The regression block contains the output for the box location. We will first flatten the output we get from the convolutional block and then use a few dense layers followed by a final dense layer with no activation.

def regression_block(x):

x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Dense(1024, activation = 'relu')(x)
x = tf.keras.layers.Dense(512, activation = 'relu')(x)
x = tf.keras.layers.Dense(2, name = 'box')(x)

return x

Finally we will define the classification block. This will also have a flatten layer followed by a few Dense layers and a final dense layer with a single neuron and a sigmoid activation. We use a sigmoid activation because we are performing binary classification.

def classification_block(x):

x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Dense(1024, activation = 'relu')(x)
x = tf.keras.layers.Dense(512, activation = 'relu')(x)
x = tf.keras.layers.Dense(1, activation = 'sigmoid', name = 'class')(x)

return x

Finally, we will define the inputs and outputs and initiate the model.

#create the model instance
inputs = tf.keras.Input((350, 500, 3))
#conv block
x = convolutional_block(inputs)
#outputs
box_output = regression_block(x)
class_output = classification_block(x)
#model instance
model = tf.keras.Model(inputs = inputs, outputs = [class_output, box_output])

Here is a summary plot of the model. (Note that the first few convolution layers are not illustrated)

Plot of the model

Compile and train our model

Now let’s compile and train our model. Before we do this, we will create some custom callback functions to visualize our output during training. We will create a function that make predictions and plot them every two epochs while training. This will allow us to see how the model improves it’s predictions with more training.

The function test_model below, creates three example images and then uses the current model to predict the class and the bounding box location. This function will plot the true bounding box in ‘green’ and the predicted bounding box in ‘red’. The function will also print the predicted class in the x label of the image.

#custom function to visualize the predictions after epochs
def test_model():

fig, ax = plt.subplots(1, 3, figsize = (15, 5))

for i in range(3):

#get sample image
sample_im, pos, _, cat = generate_sample_image()
sample_image_normalized = sample_im.reshape(1, 350, 500, 3)/255
predicted_class, predicted_box = model.predict(sample_image_normalized)

if predicted_class > 0.5:
predicted_class = 'Waldo'
else:
predicted_class = 'Wilma'

#assign color
col = 'green' if (predicted_class == cat) else 'red'

#get bounding boxes
im = plot_bounding_box(sample_im, pos, (predicted_box[0][0], predicted_box[0][1]))

#plot image
ax[i].imshow(im)
ax[i].set_xticks([])
ax[i].set_yticks([])
ax[i].set_ylabel('True: ' + cat, color = 'green')
ax[i].set_xlabel('Predicted: ' + predicted_class, color = col)

plt.show()

class VisCallback(tf.keras.callbacks.Callback):

def on_epoch_end(self, epoch, logs = None):

if epoch % 2 == 0:

test_model()

We will also create a function to reduce the learning rate with time. This way we will be able to avoid reaching sub optimal weights as well as train the model faster.

#learning rate scheduler
def lr_schedule(epoch, lr):

if (epoch + 1) % 5 == 0:

lr *= 0.2

return max(lr, 3e-7)

Now lets finally compile the model and train. We will use the Adam optimizer, ‘binary crossentropy’ as the loss for the classification (because we are doing binary classification) and ‘mean squared error’ as the loss for the regression.

#compile 
model.compile(optimizer = tf.keras.optimizers.Adam(), loss = {'class': 'binary_crossentropy', 'box': 'mse'}, \
metrics = {'class': 'accuracy', 'box': 'mse'})
#fit the model
model.fit(generate_data(), epochs = 10, steps_per_epoch = 100, callbacks = [VisCallback(), tf.keras.callbacks.LearningRateScheduler(lr_schedule)])

Here are some of the class predictions and boundary box predictions during training.

At the end of epoch 1: x label represents model prediction and the y label represents the actual class. The red box is the predicted boundary box and the green box is the true boundary box.
At the end of epoch 5: x label represents model prediction and the y label represents the actual class. The red box is the predicted boundary box and the green box is the true boundary box.
At the end of epoch 9: x label represents model prediction and the y label represents the actual class. The red box is the predicted boundary box and the green box is the true boundary box.

Here we can clearly see that the model is learning when the epochs increase. After the first epoch, we only see one correct classification. If we focus on the boundary boxes, the predicted boundary boxes are very inaccurate. We can see that there is a slight increase in epoch 5. After epoch 9 we see that the model is able to correctly classify all three test cases and locate the boundary boxes with ease. Below are some more testing examples after the model was fully trained (10 epochs).

Predictions by the fully trained model

Conclusion

Here, I have recreated the classic game “Where’s Waldo?” to show how we can use Deep Learning for object localization and classification. We synthesized data, created bounding boxes, created a data generator and finally implemented a convolutional neural network to classify and localize our fictional characters, Waldo & Wilma. We can also test the generalizability of our model by using real images of “Where’s Waldo?”. However, since we used the same background for all images our model could be biased to our current image, so in order to further improve our model and make it more generalizable we can use other backgrounds to diversify our training dataset.

Hope you enjoyed the article. Thank you for reading!

--

--