Attribute Enrichment: Under the Hood

Published in

Constructor Engineering

10 min readNov 15, 2023

In the previous article, we gave an overview of how Attribute Enrichment at Constructor works. In this part, we’ll explore the machine learning techniques used for this process in more detail.

Choosing which attributes to display alongside products depends largely on what customers want to see and know. For instance, when shopping for a highly technical product like a computer chip or a specific set of screws, many customers delve deeper into technical details, reading about features, where the products can fit, and technical capabilities. They also prioritize these over the images associated with the products. In these domains, words and descriptions play a crucial role.

When someone’s looking for a new Lego set, it’s all about how cool the model looks once it’s put together. Sure, the size and recommended age might be good to know, but what really grabs customers’ attention is the design and mini figures that come with it. With Lego, it’s those eye-catching details that usually seal the deal.

Now, let’s look at furniture. Think of someone arranging their living room and looking for a new sofa. They’d want to see pictures of the sofa from different sides to get a good idea of its design. At the same time, they’d want to read about its size to make sure it fits in their room, in addition to what it’s made of to avoid poor quality materials. In cases like this, both pictures and words are really important to help customers decide what to buy.

You might see what I’m getting at. There’s no silver bullet, and it’s essential to have a full arsenal at hand, including image-based, text-based, and multimodal approaches that merge visual and textual information.

Image-Based Approach

An image-based approach is an obvious choice for attribute enrichment due to the availability of corresponding images for all items. In our exploration of various methods, we have uncovered some valuable insights that we would like to share.

Fine-Tuning an Image Encoder

When we talk about fine-tuning an image encoder, we mean adjusting a model that already knows a lot about pictures to get better at spotting product details. Think of it like this: models like ConvNets or Visual Transformers have seen tons of images before. So, they know general things about pictures. With fine-tuning, we’re helping them get better at specific tasks, like spotting the details of a shirt’s pattern or the color of a shoe. So, if you have products with unique details that need to be noticed, fine-tuning is a good way to help the model see and understand those details better. However, it comes with its own set of limitations:

Overfitting risks. If the training data lacks diversity or contains noise, the model may overfit and struggle to generalize well on unseen data.
Ambiguity in prediction. Let’s say you’re using this model for an e-commerce platform that sells home decor. If you have an image of a living room that has both a sofa without any pattern and a diagonal checkered pattern rug, the model could get confused because there’s no clear pattern for which item should be identified.

Limited consideration of textual data. Imagine browsing for a raincoat online. The image showcases its design, color, and fit on the model. However, textual data is crucial here. It can detail whether the material is breathable, if the coat is 100% waterproof or just water-resistant, and the types of temperatures or conditions it’s best suited for.

Now, consider looking for a pair of hiking boots. The picture gives you a sense of the boots’ style and construction, but it’s the textual information that could specify features like ankle support, type of insulation, or the kind of terrain it’s designed for. Customer reviews could reveal how comfortable they are during long treks or if they’re truly slip-resistant in wet conditions.

In both apparel scenarios, the picture grabs attention, but the textual data provides the depth and context necessary for making a purchase. Ignoring such textual insights can lead the model to make less informed predictions, potentially missing key attributes that customers care about.

Multimodal Approach

Fully-Connected Layer on Top of a Multimodal Encoder

To overcome the problem of ambiguous predictions, we chose to explore a multimodal approach. This method involves using a fully-connected layer on top of a multimodal encoder, combining visual and textual information.

How It Works:

Visual Encoding. Firstly, images of products go through an image encoder, much like our previous approach, turning the image into a list of numbers that computers can understand (often called a feature vector).
Textual Encoding. At the same time, any text associated with that product — product description, title, specifications, etc. — is processed by a text encoder, turning that text into its own list of numbers or a feature vector.
Combining The Two. Once we have the image and text both turned into these lists of numbers, they are combined. This combination can be as simple as just stacking them on top of each other or as complex as merging them with specialized mathematical operations.
Fully-Connected Layer. The combined data then goes through the fully-connected layer, which works as the final decision maker, figuring out which attributes best fit the product.

However, it’s not the best solution either. There are several factors to consider:

Overfitting risks. As mentioned, if our training data isn’t varied enough or if it has mistakes, our model might get too fixated on wrong details. The fully-connected layer, especially when not pre-trained, can amplify these mistakes. It starts from a “noisy” or random state and can latch onto this noise.
Underfitting risks. On the flip side, if the fully-connected layer is too simple, it might not capture all the intricate details from the combined data. This means it might miss out on some important patterns, leading to more generic or inaccurate predictions.
Model maintenance challenges. As with any model, changes or updates can be a headache. If a new type of product attribute comes into the mix, it might mean retraining a large part of the model.

Fine-Tuning Multimodal Models for Zero-Shot Classification

Unfortunately, we ran into issues with overfitting when trying to create fully-connected layers from scratch. So, we looked into adjusting the layers that were already trained by fine-tuning multimodal models in a zero-shot setting.

We employ a sophisticated model known as CLIP for this purpose. CLIP is quite impressive in its abilities. For instance, when a user uploads an image of a t-shirt with an unconventional neckline like a “mock neck,” CLIP can intelligently select the most suitable option from the list of possible choices: v-neck, round neck, square neck, or mock neck. Remarkably, it accomplishes this even though it wasn’t explicitly trained to identify necklines. These remarkable capabilities are a result of its extensive training.

CLIP Pretraining Method. Image retrieved from https://arxiv.org/abs/2103.00020.

However, like any powerful tool, CLIP does have its limitations:

Risk of forgetting. Multimodal models are susceptible to forgetting previously learned information when fine-tuning, making it challenging to maintain overall model quality. The improvements in specific tasks or domains may come at the cost of losing other valuable qualities.
One way to prevent this is by freezing most of the layers, meaning we don’t change them during fine-tuning. This way, we don’t erase what the model already knows, and it’s less likely to forget important information.
More complex training setup. Fine-tuning multimodal models for zero-shot classification requires careful consideration of negative sampling strategies, adding complexity to the training process compared to regular classification.

In summary, we find value in utilizing CLIP as it greatly assists us in addressing challenging issues, such as identifying t-shirt necklines. Nonetheless, it’s worth noting that not all aspects can be discerned from images alone. In some cases, we rely on textual information like titles and descriptions to categorize items, even though CLIP primarily specializes in visual understanding.

Text-Based Approaches

Keeping these considerations in mind, we’ve started exploring text-based approaches, where our primary objective is to extract attributes from textual data. In the following sections, we will delve deeper into these approaches, exploring their variations, practical applications, and advantages and disadvantages.

Regular Expressions

One prominent technique within text-based approaches is the use of regular expressions, which are incredibly powerful for pinpointing specific patterns within text data. By defining predefined patterns or rules, we can achieve remarkable results through this simple yet effective concept.

To illustrate how we can extract valuable product information from textual descriptions, consider the following real-world examples:

Product descriptions often include technical specifications such as screen size, weight, or camera resolution. You can use regular expressions to extract these specifications by defining patterns that match relevant numeric values and units:

Regular Expression Pattern for Screen Size: (\d+(.\d+)?)-inch (screen|display)
Regular Expression Pattern for Weight: (\d+(.\d+)?) (pounds|kg)

Suppose the product descriptions contain sentences like:

“This laptop features a 15.6-inch display.”
“Our camera weighs only 0.5 pounds.”

Applying the respective regular expression patterns would extract screen sizes (e.g., “15.6-inch”) and weights (e.g., “0.5 pounds”) from the descriptions. So regular expressions offer several advantages:

Suitable for Multi-Word Text Patterns. Regular expressions excel in extracting multi-word patterns from text, making them ideal for scenarios where attribute values span multiple words.
Fast and Effective. Regular expressions are efficient for extracting attributes from large volumes of text data, making them a speedy choice for processing.
Deterministic. You have full control over attribute extraction results, ensuring consistency.
One-Time Extraction. Results can be extracted once and used offline, reducing computational overhead.
Customizable. Regular expressions can be easily modified to accommodate changes or extensions in the extraction rules.

However, there are some challenges to consider:

Complexity. Regular expressions can become intricate, especially when dealing with complex text patterns, potentially leading to difficult-to-maintain code.
Limited Contextual Understanding. They lack contextual understanding, making it challenging to cover all possible variations of attribute values and limiting their suitability for tasks like sentiment analysis.

Named Entity Recognition (NER)

Named Entity Recognition (NER) is a powerful technique for extracting specific entities — such as product brands, names, sizes, colors, and materials — from unstructured text. NER models can significantly enhance attribute enrichment by automatically identifying and categorizing these entities. Let’s explore NER in more detail, including real-world examples of its application:

Suppose you have a dataset of fashion product descriptions:

“This T-shirt is available in Small, Medium, and Large sizes.”
“Our backpack collection includes sizes from 15 to 30 liters.”
“Choose from shoe sizes ranging from 5 to 12.”

A NER model trained on your product data can recognize size information like small, medium, and large as well as numeric sizes like 5, 12, 15, and 30.

Open-sourced NER models work well if you know your data is similar to the data that these models expect for the input. This way, these models have:

Semantic Understanding. If your data aligns with the model’s training data, it can accurately detect entities. For instance, if both the data and the NER model focus on food, it can correctly identify food items.
Automatic Categorization. These models can automatically categorize extracted values into predefined attributes, such as brands, names, and sizes, eliminating the need for additional clustering methods.

However, these models have cons, such as:

Limited Entities. The models come with predefined attributes, which might not cover all the entities you need.
Missed Values. Since these models are trained on external datasets, they might not extract or recognize specific values unique to your data.

NER models fine-tuned on your data are very good for your specific data if you have enough qualified labeled data for training. If we lived in an ideal world, these models would be a solution for the majority of AE tasks because the models:

Understand Data. Because these models are trained on your data, they have a deep understanding of it.
Recognize Semantics. They grasp the semantic properties of your data.
Identify Unlimited Attribute Values. They can identify a vast range of attribute values, from known ones to entirely new entries in your dataset.

However, these models also have their own drawbacks, such as:

Data Requirement. A significant amount of labeled data is required for training, which can be challenging to obtain.
Resource Intensive. Training these models demands time and robust computational resources, like GPUs.
Validation Needs. Post-training, robust validation techniques are essential to ensure the model’s accuracy.

Large Language Models

Question Answering (QA) is a technique commonly used to enrich the values of specific attributes. While you can ask any question about your data, including requests to extract all mentioned attributes, being too broad might not guarantee accurate results. It’s straightforward to understand how to use QA models for Attribute Enrichment (AE): the idea is to provide the model with a prompt containing product details, followed by a question about the specific attribute you’re interested in. Then, the model provides an answer containing the value of that attribute.

For instance, imagine you have a product description, and you want to extract flavor attributes using a QA model:

Context: “Our gourmet chocolate box is a delightful indulgence for chocolate aficionados. It boasts an array of handcrafted chocolates with flavors like dark truffle, sea salt caramel, and raspberry ganache, all crafted from top-notch, sustainably harvested cocoa beans.”
Question: “What are the flavors of the chocolates?”
Answer: “Dark Truffle, Sea Salt Caramel, Raspberry.”

We selected a GPT-like model because it’s the best for QA tasks. When provided with a clear prompt about specific details, these models can offer precise answers. Their strengths lie in their extensive training on a vast amount of data and their inherent ability to identify entities, eliminating the need for separate NER models. However, there are points you should consider:

Commercial Restrictions. Some models may not be available for commercial use.
Performance Limitations. Cloud-based models may have token and request limitations.
Processing Time. It can be lengthy, whether using an API or running the model locally.

Conclusion

In the ever-evolving landscape of attribute enrichment, there’s no one-size-fits-all solution. The key lies in having a versatile toolkit at our disposal, with the flexibility to choose the right approach based on the specific needs of each product category. As technology continues to advance, we remain committed to refining our methods and ensuring that customers have all the information they need to make confident purchasing decisions.