Can a Computer Answer “Where’s Waldo?” | Using Machine Learning to Find Waldo

13 min readApr 25, 2023

I remember pouring hours upon hours into answering the age-old question Where’s Waldo? For those unfamiliar with Where’s Waldo?, the premise is to find a character named Waldo, or one of his counterparts like the British Wally, within a chaotic scene. Even though it sounds pretty straight forward, finding Waldo can get tough in scenes like the one below:

Sample Scene from a “*Where’s Waldo?” Book [1]*

Though I was sometimes successful (and most of the time not) in finding him, the recent advent of machine learning becoming more and more prominent has made me question the possibility of a computer finding Waldo faster than I could. With image classification gaining traction, it seems natural that a machine could find him much faster than we could, so, here we explore some different approaches to finding Waldo.

Computer Vision Approach

Computer vision is a field of data science centered around image processing, performing linear operations on an image to accomplish some task or operation. Although outside of the scope of the machine learning portion of data science, it was interesting to see the ability of this method to find Waldo in such a short time frame, when many machine learning algorithms failed to converge altogether. As such, we won’t go too much in depth of the computer vision algorithms, but will still provide a strong overview. Separating an image into separate color channels and performing some operation on this image, we can isolate the area of the image that (hopefully) contains Waldo. For this section, we will use this Where’s Waldo? puzzle.

Sample Image for Finding Waldo with Computer Vision [5]

Method

The most distinguished features of Waldo are his glasses, hat, jeans, and, most notably, his red and white stripes. Using the Mahotas library for Computer Vision in Python [5], we split the image into red, green, and blue channels and then perform convolution on the red minus the white. This selects the most prominent region of red against white in the image, which we can dilate in the original image. The code looks like this:

import mahotas.demos
wally = mahotas.demos.load('Wally')

wfloat = wally.astype(float)
r,g,b = wfloat.transpose((2,0,1))

w = wfloat.mean(2)

pattern = np.ones((24,16), float)
for i in range(2):
  pattern[i::4] = -1

v = mahotas.convolve(r-w, pattern)

mask = (v == v.max())
mask = mahotas.dilate(mask, np.ones((48,24)))
wally -= .8*wally * ~mask[:,:,None]

Results

And, this code finds Waldo quickly and effectively!

However, while this is great for a quick solution, finding Waldo can get pretty tricky with some scenes like this one:

“Land of Woofs” from “Where’s Wally? The Wonder Book” [9]

And, this approach using computer vision identifies an object with red and white stripes, but not one that is the correct Waldo!

Mahotas Code Run on Land of Woofs from “Where’s Wally? The Wonder Book”

Looking a little closer, we can see the real Waldo here:

Waldo Circled in “Land of Woofs” from “Where’s Wally? The Wonder Book”

Since this method doesn’t work universally, we can look at other approaches, specifically those involving machine learning, that may improve upon this method.

Convolutional Neural Network Approach

Since this method in computer vision has some drawbacks, we decided to experiment with the capabilities of neural networks. Convolutional neural networks were chosen in specific since they are much more efficient for images, splitting an image into different channels to analyze each differently and in different subsections. They do this through the inclusion of layers not included in traditional neural networks, such as the convolutional layer and max pooling layer, in addition to performing the neural network on only one segment of the image at a time.

Data

The data that we used was taken from a Kaggle dataset of 10,000+ images [11]. However, the data that this dataset required was much more intense than viable for our team to work with on our own machines. As such, we altered the use of the dataset by using the same base images, but creating the different examples of these images on our own. Since each image was the same background with a Waldo in a different location, we used the background without Waldo or Wilma and pasted Waldo or Wilma into the image. Likewise, since Waldo and Wilma are rarely not obscured, we cropped each image to simulate him hiding behind another figure, keeping careful measure not to hide his face. These crops were different for each image to simulate Waldo hiding behind different objects and were created by using a rectangle subsection of the Waldo image that includes a recognizable portion of his face and body. From here, we were able to instantiate samples for our training data dynamically, decreasing the space that the data takes up drastically.

Reasoning

Deep learning models, such as convolutional neural networks (CNN's), can learn to automatically extract features from images that are relevant for the task at hand, such as distinguishing Waldo from other characters and objects in the image. By training a CNN on a large dataset of labeled images, the model can learn to recognize Waldo’s unique visual features, such as his red and white striped shirt, hat, glasses, and jeans. Once the model has learned these features, it can use them to detect Waldo in new, unseen images. This is done by applying a sliding window or a region proposal algorithm to the image, which extracts smaller regions of the image and passes them through the CNN to classify them as either Waldo/Wilma. The model can also output a heat-map or bounding box around the detected location of Waldo, which can be used to visualize its predictions.

At a more technical level, CNN’s use multiple layers like a convolutional layer, max pooling layer, and a fully connected layer, where the convolutional layer applies filters to the image to extract features and the pooling layer reduces the size of the image. Effectively, this takes a small patch of the input image and passes this through a neural network, passing over the entire image to process it piece by piece.

Neural Networks can be difficult to construct, so we looked to TensorFlow guides to help guide the design of the model architecture [4].

Model Architecture

To classify the images, we employ a basic Dense Layer that utilizes a sigmoid activation. Subsequently, we compute the logits using the binary classification loss function and conduct back-propagation to refine the model parameters. In terms of localization, a regression layer is employed. Our objective is to estimate the position of the bounding box’s upper left point (which we randomly placed the images of Wilma and Waldo in). This is accomplished using a regular Dense layer that has two units and no activation. We determine the loss using the “mean squared error” metric. The exact architecture of our model is below:

def convolutional_block(inputs):
    
    x = tf.keras.layers.Conv2D(16, 4, padding = 'same', activation = 'relu')(inputs)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.MaxPool2D(2)(x)
    
    x = tf.keras.layers.Conv2D(32, 4, padding = 'same', activation = 'relu')(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.MaxPool2D(2)(x)
    
    x = tf.keras.layers.Conv2D(64, 6, padding = 'valid', activation = 'relu')(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.MaxPool2D(2)(x)
        
    x = tf.keras.layers.Conv2D(64, 6, padding = 'valid', activation = 'relu')(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.MaxPool2D(2)(x)
    
    return x

def regression_block(x):
    
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(1024, activation = 'relu')(x)
    x = tf.keras.layers.Dense(512, activation = 'relu')(x)
    x = tf.keras.layers.Dense(2, name = 'box')(x)
    
    return x

def classification_block(x):
    
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(1024, activation = 'relu')(x)
    x = tf.keras.layers.Dense(512, activation = 'relu')(x)
    x = tf.keras.layers.Dense(1, activation = 'sigmoid', name = 'class')(x)
    
    return x
x = convolutional_block(inputs)

box_output = regression_block(x)
class_output = classification_block(x)

Inputs and Outputs

For input, we have the model take in an array of size 350x500x3, i.e. a color image (three) of size 350 pixels by 500 pixels. This size was somewhat arbitrarily chosen, but remembering that images are stored as three different color arrays was an important distinction to make.

The model then outputs two things:

What it classifies what it detected (Waldo or Wilma)
The location of the box that bounds what it detected

Training, Results, and Evaluation

For this model, we decided to do ten epochs, with 100 steps per epoch. We were very careful about tuning these parameters. At the beginning, we were under the assumption that more epochs just meant better performance, but soon realized this would lead to over-fitting. When trying to strike a balance between over and under-fitting, we finally came to ten epochs with a hundred steps each. The corresponding learning rate was to update values every five epochs, updating the learning rate by a factor of 0.2.

After the first epoch, we get the following scores:

loss: 2643.4141 
class_loss: 3.5829 
box_loss: 2639.8313 
class_accuracy: 0.5531 
box_mse: 2639.83

With these resulting images:

As you can see, after one epoch it got the classification wrong nearly every time, as well as being unable to even find the figure to begin with.

But, after 10 epochs, we get these scores:

loss: 21.2049 
class_loss: 0.1783 
box_loss: 21.0266 
class_accuracy: 0.9369 
box_mse: 21.0266

And these resulting images:

As you can see, this is a significant improvement from the first epoch. Classification has reached 93% accuracy, and the error between the predicted bounding box and the actual bounding box has been reduced by an order of 100.

We then tried evaluating on randomly generated images, and got very pleasing results:

Modified Convolutional Neural Network Approach

While the convolutional neural network produced successful results for the cases that it was given, it is easy to see why this method isn’t particularly practical — it is easier to find Waldo by eye than it is to run the model.

So, our next endeavor was to make this example of a neural network more practical. Waldo is rarely seen in the same position, let alone the same background, and is typically the same size as the rest of the characters. This specifically entailed widening the scope of the model to function in concordance with different examples of Waldo. We did this by first shrinking Waldo down, then providing more instances of Waldo (since he is rarely in the same position), and then providing different backgrounds for him to hide in. By continuing to crop Waldo to account for instances where he may be partially obscured, we figured that this method would provide a solid representation of a standard Where’s Waldo? puzzle. Below are the instances of Waldo used:

We also tested without Wilma, since our main goal of finding Waldo is independent of Wilma, and including Wilma would introduce additional complexity. And, here are the backgrounds tested on, each with Waldo removed to prevent our dataset from bias:

Where’s Waldo? Backgrounds Without Waldo [7,10]

From here, our methods are mostly similar; we create a convolutional network, generating images with a cropped Waldo, and then training this network on its ability to identify the correct Waldo and the proximity of its bounding box to the bounding box of the placed Waldo. The biggest difference between this implementation and the previous is the expansion of the classes from binary (Waldo or not Waldo) to multi-class, with three different representations of Waldo. As such, the output of our model was a soft max value of three classes, as opposed to a sigmoid value. After one epoch of this new method, we get the following:

loss: 51496.9102
class_loss: 40678.6992
box_loss: 10818.2148
class_accuracy: 0.3025
box_mse: 10818.2148
lr: 0.0010

With the attempts at finding Waldo as follows:

Widened Scope Waldo Results After 1 Epoch

Although much less accurate than the previous method, this makes sense — Waldo is much more difficult to find, and there were many additional factors introduced. Then, after 10 epochs, we get the following:

loss: 745096.3125
class_loss: 736048.6250
box_loss: 9047.4590
class_accuracy: 0.3487
box_mse: 9047.4590
lr: 4.0000e-05

With attempts at finding Waldo as follows:

Widened Scope Waldo Results After 10 Epochs

Overall, this was much less effective, but the results are explained by the increased scope of the problem. Although the classification of Waldo is barely better than random guessing between three classes, it is interesting to consider the reason; was the network unable to identify Waldo, or just unable to distinguish one of the three? Since the input images were all similar, this is possible as well. To account for the increase in complexity of the model, we could alter the training of the CNN to account for this by increasing the number of epochs, batch size of the algorithm, or by hyper-tuning the other parameters like learning rate. However, the largest fall back of using a CNN is that, despite being faster than a traditional neural network, it was still incredibly slow, taking over an hour to train a single model. Thus, hyper-tuning parameters would use up many resources which are unavailable to us, namely the time (or stronger computation) required to train such a model.

Transfer Learning

Another approach that is viable for viewing Waldo is transfer learning. Transfer learning is a process by which a model trained for a task is used for a separate task — in short, we don’t need to reinvent the wheel to accomplish a task when a similar task has already been accomplished.

For the case of finding Waldo, TensorFlow, an open-source library for machine learning and artificial intelligence, has developed an object-detection API that is used to place bounding boxes around identified objects as seen below. Whereas object detection models typically classify objects by either a Faster R-CNN, isolating the tasks of locating objects and classifying them, or YOLO and SSD networks to predict class scores and bounding boxes.

Example of Object Detection from TensorFlow Tutorial [6]

As in the nature of the problem, we can expand this object detection model to detect Waldo by training it against other images of Waldo. Setting up an environment to train a TensorFlow object-detection model on images of Waldo [13], we can then use the same technology to place a bounding box around images of Waldo!

TensorFlow Object-Detection API Trained to Find Waldo with 97% Certainty

While this method worked strongly for cases in which Waldo was found, finding him quickly and accurately, it did not fare as strong for other cases. Similar to the computer vision model, this model also had trouble isolating Waldo in images that were mostly red or white. Also, this model had trouble identifying Waldo on his own, as running this model on an image of Waldo himself output that there was no Waldo found.

Conclusion

After comparing each model, it is easy to see that no model was perfect, each having their own pros and cons. Computer vision, while quick and effective for the majority of the time, fell short for cases either more complex or far less complex. Transfer learning also shared this behavior, working well the majority of the time. Training our own convolutional neural network, however, resulted in success where the other models failed. Our CNN had great success finding Waldo on his own, and this behavior was mostly independent of the background that he was placed on, just needing time to train.

As such, it is easy to see that there hasn’t been a perfect solution to this problem yet, though it would be interesting to see how changes to our models would affect the results; we could put much more time and complexity into training a CNN, or we could try training our transfer learning model with a different network. Regardless, there is still much room for improvement in this problem.

Further Applications

While finding Waldo himself may be a trivial problem, it acts as a gateway to many more complex imaging processes. For example, it looks towards the process of identifying objects in 2-D space. When objects or people are represented in 2-D space, they lose many traits and attributes that help define them; after all, many traits like a nose or glasses are just a line in 2-D! Additionally, Where’s Waldo? poses a very unique data set: there is such a huge class imbalance in a dataset of Where’s Waldo? puzzles that give very few instances of Waldo. Adding that on top of the plethora of objects placed with the intention of looking similar to Waldo make him very difficult to differentiate.

Despite this, the very attributes that make finding Waldo difficult also make the application of this problem that much more interesting and practical. There are plenty of realistic scenarios in which finding an object in the minority becomes important, especially in image processing. Whether identifying different species of animals, a constellation in the night sky, or an animal camouflaged against a background, identifying an object in the minority class becomes extremely useful. And, well, it becomes easy to see the similarities between Waldo and a camouflaged animal.

“The Land of Wallies” in “Where’s Wally? The Fantastic Journey” [8]

References

[1] “Awesome where’s Waldo Wallpapers — WallpaperAccess.” [Online]. Available: https://wallpaperaccess.com/wheres-waldo. [Accessed: 25-Apr-2023].

[2] BigCommerce Help Center. [Online]. Available: https://support.bigcommerce.com/s/article/Using-the-Image-Manager. [Accessed: 25-Apr-2023].

[3] “Finding Wally,” Mahotas. [Online]. Available: https://mahotas.readthedocs.io/en/latest/wally-1.png. [Accessed: 25-Apr-2023].

[4] “Deep convolutional generative Adversarial Network : Tensorflow Core,” TensorFlow. [Online]. Available: https://www.tensorflow.org/tutorials/generative/dcgan. [Accessed: 25-Apr-2023].

[5] “Finding wally,” Finding Wally — mahotas 1.4.3+git documentation. [Online]. Available: https://mahotas.readthedocs.io/en/latest/wally.html. [Accessed: 25-Apr-2023].

[6] “Installation,” Installation — TensorFlow 2 Object Detection API tutorial documentation. [Online]. Available: https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/install.html. [Accessed: 25-Apr-2023].

[7] M. Ginn, “My Christmas gift to the internet is these where’s Waldo Pictures with Waldo photoshopped out Pic.twitter.com/pio88omfn1,” Twitter, 21-Dec-2015. [Online]. Available: https://twitter.com/shutupmikeginn/status/679045081937018880?lang=en. [Accessed: 25-Apr-2023].

[8] M. Handford, “The Land of Wallies,” in Where’s wally?: The fantastic journey, London: Walker Books and Subsidiaries, 2017.

[9] M. Handford, “The Land of Woofs,” in Where’s wally?: The wonder book, Gardners Books, 2018.

[10] “Preview.redd.it image urls,” Reddit. [Online]. Available: https://www.reddit.com/r/redditdev/comments/kzoq8i/previewreddit_image_urls/. [Accessed: 25-Apr-2023].

[11] S. N. Gupta, “Waldo and Wilma Dataset,” Kaggle, 17-Mar-2023. [Online]. Available: https://www.kaggle.com/datasets/sheshngupta/waldowilma. [Accessed: 25-Apr-2023].

[12] “Stickpng — free transparent pngs, stickers, clipart & more!” [Online]. Available: https://www.stickpng.com/??ref=blogduwebdesign.com. [Accessed: 25-Apr-2023].

[13] Tadejmagajna, “Tadejmagajna/Hereiswally: Deep Learning Project that solves where’s Wally puzzles by finding Wally in an image,” GitHub. [Online]. Available: https://github.com/tadejmagajna/HereIsWally. [Accessed: 25-Apr-2023].