How We Use Machine Learning to Turn Product Packaging into Structured Data

Nielsen
Nielsen Forward/
Published in
7 min readOct 16, 2019

By Chris Ballard, Machine Learning Research Leader, Nielsen Connect Data Science

At the supermarket shelves, consumers make their purchase choices based on an array of information, including pack design, market positioning and product claims. But what if we remove the physical shelf from the equation?

In an online environment, the options are far less finite and package information becomes that much more critical in driving purchase decisions. Think about it: If a consumer can’t pick up the product, hold it in their hands and inspect it, every aspect of a product’s packaging has to work that much harder.

Product pack information isn’t just needed for online retailers, but the acceleration of digital channels and the growing wealth of choice in the retail market shines a spotlight on a historic efficiency impediment: Manually converting this type of information into usable data is painstakingly laborious. It’s also encumbered by the need for consistent human involvement. Specifically, human “coders” identify product information and then enter it into a dedicated data field in the data schema used to record this information.

While recording product information manually might be an option for small product groups, it’s far less feasible at scale across the entire consumer packaged goods (CPG) space — dense with variety, myriad categories and frequent new product offerings. Until now.

As the source for one consumer truth, Nielsen helps clients understand consumers and the decisions they make along their paths to purchase, and product packaging plays a critical role in those decisions. To help retailers track this information, Nielsen Brandbank collects images of product packaging, which it turns into structured data and syndicates to retailers for use in their e-commerce grocery sites. This information includes ingredient listings, nutrition tables, manufacturer branding and standard industry logos, such as whether the packaging comes from recycled material and if the product meets specific dietary needs (e.g., vegan).

Given our footprint in the space, in addition to understanding the arduous process of converting package information into usable data, the Nielsen Connect AI R&D team decided to tackle the challenge of corralling this data in a better way.

The team’s goal was to develop an automated approach to recognize specific sections of product packaging based on section type, such as ingredient listing and nutrition table. Since we have many thousands of images of product packaging and the corresponding structured data readily available, we wanted an image recognition algorithm that would be able to locate the text for the information on the image, and convert it to structured data. Doing this would reduce the amount of manual effort that the current process requires.

At the onset, we needed to assess which tools were already available that we might be able to leverage. So we started analyzing convolutional neural networks (CNNs) as a possible option, given their frequent use in object detection algorithms. However, as we explored the suitability of CNNs, we began to realize that they had limitations for our use case. Since our goal is to capture sections on the packaging that typically contain text, we needed to incorporate other information in addition to the image to help the algorithm more differentiate between different sections more precisely.

Why the need to be so precise? Now more than ever, consumers are focused on ingredients, product claims and transparency. And that means that product labeling can make or break a sale. It can also have much more serious implications. If a consumer has an allergy, for example, and a product’s labeling is incorrect, the consequences could be much worse than just a lost sale.

Object Detection 101

To get started, let’s look at object detection, which is the task of identifying specific objects and their location within an image. In a street scene, for example, we might be tasked with identifying the pedestrians, which could involve sorting through any number of other objects in the scene. If we’re designing AI for a self-driving car, we need to be able to identify the location of any objects to avoid, such as pedestrians, other cars and bicycles. So object detection algorithms typically classify object type as well as location.

So how does this relate to detecting information on product packaging?

Unlike in a street scene, the objects that need to be detected on a product package are specific regions that contain specific information. The trick with product packaging, however, is that there’s no uniform structure across products. Manufacturers need to distinguish their products from others, so it’s common to see variances in how different brands style and position the information on their packs. The degree of overlap from region to region further complicates the issue. The ingredients, for example, may “flow” around the nutrition or be entirely independent from it. For these reasons, using a naïve approach to extract all of the text from an image will yield unstructured text with no context. This diminishes the quality of the extracted data, given the “noise” of the surrounding information.

To solve for the lack of consistency across products, we use an object detection algorithm to both classify and locate specific regions on the packaging before we extract the text. For example, as well as locating the ingredient listing on the packaging, we also label it as an ingredient list to differentiate it from other types of section. That way, we ensure that we extract the text in context. This has a number of benefits. If we know that a portion of text is an ingredient listing, we can clean the detected text using specific ingredient dictionaries, which will yield much higher-quality ingredients than if this process was done without knowing the context.

The Need for New Object Detection Techniques

Certainly we aren’t pioneers in object detection. That said, however, standard object detection algorithms, including Faster R-CNN, aren’t a match for our goals. Networks like Faster R-CNN, which combine region proposal and classification into a single deep network architecture, work well when they’re trained to recognize regions that can be identified with visual cues learned through convolutional layers. For example, there are strong visual indicators for the presence of people in a street scene.

Additionally, when we’re training a Faster R-CNN network to recognize regions on packaging, we usually need the network to locate textual regions, such as ingredients and allergen information. These regions visually look very similar, and we have found that the standard Faster R-CNN architecture does not perform well in these instances.

Appearance and Text Information in a CNN

When training a network to recognize regions on product packaging, it’s not feasible to rely on appearance alone. Textual information plays a key role in differentiating different content regions, and it’s important to consider how humans recognize the differences between different regions of text, such as nutritional content and ingredients.

Importantly, when we read ingredient information, we expect to see listings of included ingredients. Comparatively, we expect that nutritional content will be presented in sentence structure format, complete with standard headings alongside weights and percentages. To address this challenge, we developed a method that generates a text map to inject textual information into the network. A text map is basically a heat map whereby the intensity of the color of the word varies depending on its presence in a region.

To build text maps, we use optical character recognition (OCR) to capture text on images. We then color the text according to the probability of it being present in the region. For example, the word “ingredients” and commas will appear in greater intensity than other words because they are much more likely to be present within the ingredient section. Similar text maps can also be generated for other sections, such as nutritional facts.

Standard convolutional networks typically ingest three channels of information: one for each of the RGB channels containing visual information. To inject text maps, we provide this information as additional channels, which allows the convolutional layers to generate feature maps that account for both the visual and textual appearance combined. We create three additional channels for each text map, which encodes word-level information by color according to predefined metrics, such as punctuation, word occurrence and Bayesian distance between words.

In testing, we have found that our automated recognition system performs significantly better than standard Faster R-CNN for detecting ingredient and nutritional information.

While we’ve always provided retailers and other clients with product characteristics that allow them to segment markets (e.g., by fat percentage), integrating ingredient and nutrition detection will provide much richer data sets. And what’s more, the benefit isn’t limited to our clients. By incorporating ingredient data into our sales analytics, our clients will be able to see how the presence — or absence — of certain ingredients factor into consumer purchase decisions. And that will, in turn, drive the innovation that will feed consumer demand.

Notes

The Nielsen Connect AI team first presented this research at CVPR 2019 in the Language and Vision Workshop. The team also presented the research in a paper dated May 2019 (authors: Arroyo, Roberto & Tovar, Javier & J. Delgado, Francisco & J. Almazán, Emilio & G. Serrador, Diego & Hurtado, Antonio).

--

--