Image Recognition the Hard Way

Marker-Based Image Recognition in Javascript

Published in

Instrument Stories

15 min readOct 18, 2019

As part of the Google Cloud x NYT project, we created an image recognition application. The goal was to recognize 7 images consistently on a variety of devices and operating systems — all in-browser and requiring as little bandwidth as possible. For the experience, the clients’ primary concerns were accessibility, and that technology reach as many users as possible. This is what informed most of the goals we set for our image recognition software. Choosing to build the software in-browser as opposed to creating a native application allowed us to achieve all of our accessibility goals. The bandwidth, on the other hand, was limited for one reason: a majority of the posters were going to be placed underground in New York subways. This was another propellant for an in-browser experience: we simply could not afford the bandwidth it would take to hit a server every time we needed to recognize an image.

During this process we learned a lot about AR, existing AR libraries, testing, lighting, and other aspects of computer vision that we thought were worth sharing.

AR Libraries

We started by taking a look at JSARToolKit which is an augmented reality library written in JavaScript. We experimented with using their markers. The markers worked well, but were riddled with limitations. They had a limited number of aesthetic properties and it did not scale well with the image.

After further exploration, we found libraries called Tracking.js and Pixelmatch. Tracking.js is a javascript library that can be used for color and face detection. Pixelmatch is an image comparison library that calculates the percentage of pixels that differ between two images. The idea then became to use these two libraries in tandem to run our image recognition.

In the end, we didn’t end up using Tracking.js, but it was still instrumental in teaching us how to approach the issue. We decided to use markers to pick up on our images but had a better idea of how we would do that without relying on libraries like JSARToolKit, which provided pre-trained markers and dictated very limiting design directions.

Image Detection Process

We created a seemingly straightforward two-step process to detect our images. Unfortunately, we quickly learned there was nothing straightforward about it.

The images we analyze are frames from the user’s camera video feed. Analyzing the video feed to see if one of our seven images is in frame, means we have to draw every few frames of the video onto a canvas. This video canvas maintains the aspect ratio of the video but is much smaller in size. Shrinking down the canvas allows for easier and more reliable analysis of the frames.

Working with The Canvas

A very large part of this project was understanding how the video is rendered and how to best alter that to fit our needs. Essentially understanding how to work with the canvas.

It is easy to look at this and say, it has a yellow beak, green feathers and is displayed on a light brown background. Moreover, when given another image, we are able to use these obvious details to compare the two. The software will work differently; it will log the color of every pixel. The background we described as brown will be made up of hundreds of shades and the same will be true for the other colors.

So with our image recognition software, two images are a perfect match only if every pixel has the exact same color value as its counterpart. If we are even one digit off, those pixels will be considered different, and the images will be marked as a mismatch. To avoid this, we can do some image preprocessing like shrinking the image, converting it to grayscale, blurring the image, applying color correction and so on.

It may seem counterintuitive to apply preprocessing steps like shrinking down your canvas. However, such steps can massively improve the performance of your image recognizer.

In the case of the shrunk canvas, having less pixels to look through will improve performance which directly improves the experience. Don’t underestimate the effect a tired processing unit has; a phone that has been trying to detect photos for the last 10 minutes will do much worse than one that just started. Moreover, and perhaps less obvious, shrinking the canvas will remove small details that are insignificant when looking at the entire image but are easy to focus on when going one pixel at a time. It is a very rare case that the image we have stored and the image we are pulling from a user’s video stream look exactly the same. Things like lighting, angling, camera resolution, distance and a myriad of other factors can affect the look of the incoming image. As humans, we are able to easily discern that two images are the same but one is in brighter lighting or the other is titled 5 degrees to the left. A computer is unable to do the same when running an image recognition software such as this one.

Thus, shrinking our canvas allows us to remove the small details that actually make image recognition for a machine more difficult. In the case of the image above, the computer would now see a blob of brown followed by a blob of yellow followed by a blob of green and so on and would be able to analyze the image using the larger and more obvious details like we would.

This does not mean you should shrink your canvas down to a tiny little spec or make your image so blurry you can’t pick out any details. How and how much to preprocess is dependent on your images and the degree of detail necessary to discern between them without loss of accuracy.

Now that we understand how the canvas works, we can look into the detection process. The first step is detecting the unique markers.

The Marker

After several rounds of marker designs, we ended up using a multi-color bar at the bottom of the image that is comprised of the Google colors — yellow, green, red and blue — with the length and order of the colors also being an identifying aspect. A single permutation of these aspects made up the unique marker associated with each image.

A grayscale image of a Christmas tree in NYC with one of the 7 markers on the bottom and beside it a layout of all 7 markers. — The image markers

The first step is to find the marker. Once we find one of the seven markers, we take a snapshot and draw that frame onto a separate smaller canvas of the same aspect ratio as the video canvas. We then use the Pixelmatch library to compare the images on these two canvases. But when is it considered a match? If we match a marker and an image for 3 consecutive frames, it is declared a match. This is not a perfect system and what we ended up with is a result of a lot of trial and error.

An obvious question here is why we needed a second step. If the marker is unique to the image, wouldn’t identifying the marker be enough? In theory, yes, but that is not the case in reality. The marker has a one-to-one relationship with the images but the marker itself is not unique. It is a combination of 4 common colors which can easily be found out in the wild and so adding that extra step was necessary to make sure it is attached to one of the 7 images.

Another question we can ask is why not skip the marker and just use Pixelmatch instead. Pixelmatch works by analyzing the pixels in an image and comparing it to the pixels in another. Running every nth frame against 7 images requires processing power that we can not afford to use. So the marker is able to narrow down our options from 7 images to 1 and enables us to use Pixelmatch without a serious loss in performance.

The process wasn’t as simple as finding the marker and matching it with the image. We needed to set up a reliable system to test our confidence in the results. The reality was that no matter how well we set up our marker recognition or how well we use Pixelmatch, it would not be perfect. Therefore, we needed to put measures in place to make sure we reduced the error rate as much as possible.

We knew this was something we were going to have to focus on early in the project because we were working with in a browser. A lot of us have seen very cool, pretty reliable applications of AR but most of these are leveraging native device capabilities. When an application is able to use the capabilities of the device, it is able to access better data such as depth, color, and surface detection. There isn’t only more data but also more specific and accurate data. However, when working in a browser, we are no longer able to leverage a device’s capabilities. Instead, we needed to make the experience independent of the device and make use of capabilities that all the devices and operating systems we were considering already had.

To make sure we were making the most of what we had to work with, we made optimizing for performance a priority from the get-go. The marker is placed at the bottom of the images and so we were able to make the assumption that in most cases the marker will appear in the bottom ⅔ (probably ½ but we wanted padding) of the user’s viewport. So when looking for the marker in the pixel matrix, we start at the bottom-left corner and work our way up. This way, our application became more performant because it did not have to look through all the pixels.

Colors

One of the things that we struggled with when detecting the marker was dealing with lighting and on different devices. This is because the color values vary greatly depending on the lighting and the device. To account for this, we knew we had to really understand the different ways we could represent color in our application and decide which one worked best for us.

There were three representations we could use: RGB, HSL and HSV. RGB is a way of representing colors using varying combinations of the primary colors. HSL and HSV (HSB) are a more human-friendly way of describing colors. The H stands for hue, S stands for saturation, L is for lightness and V for value (B for brightness). The main difference between the two is the last letter. Lightness is referring to the amount of white in a color while value is considering the amount of lighting on a color.

A venn diagram-like representation of RBG colors, followed by a cylindrical representation of HSL and HSV. — Left to right: RGB, HSL, HSV.

We can see in the images above the differences between RGB color representation vs. HSL/HSV. While the latter are placed on spectrums that make it easy to understand how different colors are represented relative to other colors and how we can control the range of a given color, RGB values don’t live on a human-friendly spectrum. This makes it very difficult to set ranges and to understand how the representation we use relates these colors to one another.

We had defaulted to using RGB values because that is the representation used in the pixel matrices we were analyzing.

A black and white image with and the pixel matrix that represents it. — A grayscale image and its pixel matrix.

What we see on the left side and the right side are the same image, just different ways of representing that image. In the case of this, there is a single number that represents each pixel because it is a grayscale image. A value between 0 and 255 dictates the brightness of each pixel.

In a color image, each pixel is represented using RGB(A) values (the A is alpha and represents opacity). The first pixel is the first 4 cells of the matrix. The second pixel is represented by the next 4 values and so on. You can see this is the next image.

A 2 by 3 table with each cell divided into 4 cells to represent the values of each pixel using RGBA. — Pixel matrix for RGBA images.

Although we started with RGB color representation, we quickly realized it is not the way to go. We tried both HSL and HSV values and ended up using the HSV representation after a lot of testing. This is because it allowed us to account for lighting, which ended being our biggest challenge. We gave a small range for the hue and saturation and allowed the value to vary to a higher degree to account for as many lighting conditions as possible.

Framing the Image

Detecting the colors to get the marker is the first step in the recognition process. As mentioned earlier, there is a second step that comes after recognizing the marker and for this second step, it is vital that we are able to correctly frame the image.

In order to reliably use Pixelmatch to compare the incoming video stream with an existing image, we need to have a well framed crop of the video stream. The marker is placed at the bottom of the image and gives us the width. We have all the aspect ratios stored, so it is easy to calculate a height given a width. This is only straightforward when the user is looking at the image head on. When looking at it from an angle, the aspect ratio of the image changes and makes this step a lot more complicated.

We tried a lot of things to make this work.

Faking some Maths

There is an event listener called ‘deviceorientation’ that returns the positioning of the phone in a 3-D world. We get the alpha, beta and gamma angles and this allows us to understand how the individual is holding their devices. We tried to use this information to get a better sense of the change in aspect ratio. However, given that we did not know how far the individual was standing from the image, to make the correct calculations, we fudged with some numbers, did some ‘scientific’ guess-work and tested out a formula we put in place. It was an improvement but was not good enough.

Hough Transform

Hough Transform is a technique used in computer vision to isolate shapes in an image. It makes use of simple linear algebra to deduce the existence of lines and shapes in an image. This seemed like a great approach, because we knew how to detect edges in a given image, and often, there would be several edges detected around the image, framing it.

Hough Transform is able to take a set of points and figure out the most likely lines that exist in those set of points. We wrote up some code that allowed us to do just that. If it finds two lines that make a perpendicular angle and are located within a reasonable distance to the marker, it will flag those as potential frames for the image.

A white canvas with several blue dots and orange lines crossing some of the blue dots. — Hough Transform applied to a set of points.

Here is what it the application of the Hough Transform we coded up looks like independent of the image recognition. Given a set of points, it attempts to find the most likely lines. Seemed like a sure thing but alas, it was a disaster.

The issue here was performance. Analyzing all those edges took more runtime power than we had to offer and made the experience impossible to use. So, while in theory this was a great idea, in practice, it was terrible. Our application simply froze a couple frames into the experience. We were analyzing the pixels to find edges and at the same time we were trying to find the marker and logging the colors we observed and once we found our edges, we ran through all of them again, in a nested loop, trying to find the most likely lines and shapes. It kind of made sense why it wasn’t working but it was still a bummer.

Hough Transform + ML Clustering Algorithms

Hough Transform alone did not work because there were too many edges and it affected the performance badly. In an attempt to counter this, we looked at clustering algorithms. The idea was to analyze a cluster at a time to minimize the number of edges we had to look at. We chose to look at clusters because the images were often surrounded by blank space. The hope was that the edges on the image would compose a single cluster. This didn’t work simply because we couldn’t control the clustering. Sometimes the image was made of several clusters and sometimes it included things outside the image.

Using Empirical Data

After countless sessions of trial and error what ended up working best was using what we found to be the average angle at which people held their phones and scaling it to work for that case. This was done by collecting empirical data.

We logged the values we got from the event listener when people used their phones on a variety of devices. We tested both when the image was flat on a surface like a table, and when it was plastered onto a wall. We then took an average of those logged angles and used that value to scale the height.

While this methodology is far from perfect, it produced the most consistent results. This, again, was something we only discovered through trial and error.

Image Processing

After a lot of hard work, we created a version of application that we were happy with. We ran tests on proofs that were based on NYT newspapers, and were getting consistent results across all platforms and devices. The first image with our marker was going to be included in the Sunday edition of the NYT and by Friday night we had felt confident in our image recognizer. We were excited for the Sunday morning.

Sunday was a nightmare!

The NYT paper printing proved to be troublesome for the marker detection we had set up. This was because the samples we used for testing were different from the real thing. The colors in the bar bled into one another which wasn’t the case with the proofs we tested on, since the printing processes were different. Not only were the machines used to print different but printing uses another way to represent colors, CMYK and even simple Google search will tell you a lot about the difficulties and inconsistencies of converting from one to the other. The colors were different shades of the red, yellow, green and blue we expected because we hadn’t taken into account these changes.

Additionally, the background color was much darker than the one we tested on and while this doesn’t seem like a big deal at first glance, the lightness of the background allowed for a contrast between the image/marker and the paper that made it easier to isolate and identify what we were looking for.

We learned this lesson the hard way after the first newspaper ad was printed and made several adjustments to account for changes in the printing process. We also made sure we had proofs that were printed using the same printer the second time around.

Blurring

Adding a blur to the incoming video stream did two things for us. It significantly decreased the impact of the color bleeding and it made the pixel matching process a lot easier. By removing a lot of the noise, we are able to focus on larger details and get more accurate results.

Color Correction

One of the most challenging aspects of the work was the variety of lighting conditions we had to account for. What happens when we are in natural vs. artificial lighting, what happens if the lighting is tinted with a certain shade, what happens if part of the image is under bright light and the other part is under a shadow? So many scenarios to deal with. With the color correction, we were able to adjust brightness based on the brightest pixel and that helped with a lot of the problems we were having.

This however is not a full-proof solution and at times could even make things worse.

Optical Character Recognition (OCR)

We considered using OCR at the peak of our struggles. The images often had a phrase or title associated to them and so we thought using that as an additional identifier when the marker could not be detected would help. It was pretty simple to set up Google Cloud’s OCR software but this had its issues as well. First, it is not guaranteed the user will frame the phrase when pointing at the image and second, a lot of the posters were placed underground and users had limited network connection, hitting a server in these conditions may not always be successful.

There was a lot we learned from this project. In the end, we built an in-browser experience that worked pretty well, but there were a lot of things we would have changed. Things like choosing a marker that would have allowed us to easily infer the positioning of the image or taking into account the printing process, or not zero-ing on one approach when so much of the project details were unknown.

After the project wrapped, we wanted to apply our learnings so we started working on a small internal project. The goals of the internal project are similar to those in the Cloud x NYT project: an in-browser image recognition software that worked on multiple devices and systems. With an additional goal: wanting to see how well we can apply our learnings. This time, we are using Google’s Tensorflow library to produce an image recognition experience that relies on ML. We will probably write an article about that too.

If you’d like to join me in building projects like this, check out our open Technology roles here.