How do we figure out what every store sells without losing our minds?

Karan Kanwar
Wing Assistant
Published in
3 min readMar 20, 2018

The Wing team has a bit of a reputation for going above and beyond, this is no exception. We, thanks to Kevin Norgaard and his fraternity, sent people to all the major grocery stores in our area, and took thousands of pictures of items and item labels on pretty much every shelf. That’s over 5000 pictures. Just in Irvine.

To summarize why we have 5000 pictures of grocery shelves — when a user asks us something like: “I need someone to get me a bottle of merlot, onions and garlic” — we need to figure out this is likely means a trip to a grocery store. In Irvine, that might be Trader Joes, Target or Albertsons. How exactly we do that part of it is proprietary, so we won’t delve much deeper than that.

However, we need to know what these various grocery stores have on their shelves. Maybe Target doesn’t have Merlot? Then what? We just wasted the user’s time, fulfiller’s gas, and our money. So, how do we read several product labels on over 5000 pictures, some handwritten, some in Arial, some in Times New Roman?

Enter Neural Networks!

We think we might be able to do this, using Tensorflow. However, that is an oversimplification. What does this actually entail?

What is a label?

Computers are dumb, so we have to train a computer to recognize what a label looks like. In an image, we might have 1, or we might have 15 product labels. Reading the product itself is an insane undertaking, so it makes much more sense to focus on the relatively uniform labels. We need to train a system to extract cutouts of labels from images, even if those images are rotated, a little blurry, etc. Some amalgamation of OpenCV’s Contour Detection might be ideal for this use case.

What does that label say?

Labels contain a lot of supplemental information, mostly useless to us — since all we want is the product name and price. So, what now? It seems prudent to decrease room for error in our classifier, so instead of training a Perceptron on every variant of character it could encounter, it probably makes sense to first figure out whether we’re looking at a letter, or a number. That’s not exactly simple, as letters and numbers are typically very similar — but we could get smart and look at contextual clues like dollar signs and length of text snippet.

Once we figure out whether it’s a letter or a number, we reduce error from 1/36 to 1/26 or 1/10. Assuming we’re looking at letters, the next perceptron has to figure out which letters we are looking at, and output the entire string, likely character by character. If a number, the same is true, but the odds are much higher we get that right.

In other words, for every image, we should probably do this:

We’ll keep you guys updated on this one.

--

--