Introducing Attribute Enrichment at Constructor

Published in

Constructor Engineering

10 min readJul 26, 2023

Bad product data makes it hard for customers to find what they need on websites, even with well formulated search queries or specific search filters. This happens when products don’t have searchable attributes or complete and correct tags.

Working with some of the biggest retailers in the world, Constructor is continually developing new solutions to help ecommerce teams run successful businesses and improve their customer experience. We got tired of poor product attribute data leading to poor results, and so we set out to improve it in a way that wouldn’t only make search, filtering, and product discovery better, but could optimize a B2C or B2B e-commerce company’s entire value chain.

Enter Attribute Enrichment: Constructor’s new solution to automatically generate and fix product attribute data using a mix of Constructor’s existing best-in-class AI and new innovations in the fields of machine vision and text classification.

Attribute Enrichment automatically tags a customer’s products with new relevant attributes and product categories. Enriched product attributes enable better search and browse experiences, leading to increased conversions and revenue and minimizing returns from people mistakenly buying products based on incomplete or incorrect data. On top of those direct lifts, Attribute Enrichment also provides valuable input to other systems like Product Information Management (PIM) platforms and improves operational efficiency, saving time and effort for employees who no longer have to manually enter this information.

We know a lot of companies are making big claims around AI these days without a lot of substance, so we wanted to dive into how Attribute Enrichment works from the technical side. In this article, we give an overview of Constructor’s Attribute Enrichment engine, including overall system design, data we work with, and challenges we tackled while building the system.

Methodology

Our Attribute Enrichment engine leverages machine learning and deep learning techniques under the hood to predict both product categories and product attributes based on image data, text data, and clickstream data.

Available data

Data is an essential part of Attribute Enrichment as we train our models on it. As inputs, we take raw data from an ecommerce company’s product catalogs, along with the ecommerce website’s behavioral clickstream data.

Clickstream data contains important user/website and user/app interactions including search queries, browse filters, shopping sessions, clicks on search results, add-to-carts, purchases, etc. Currently, Attribute Enrichment relies heavily on search query history to detect high-demand attributes and prioritize them during enrichment.

Because clickstream has always been critical to Constructor’s algorithms, we’ve collected an incredibly large amount of this data across many industries and countries over many years, with years of experience ensuring the clickstream data we receive is correct and trustworthy. Fusing this clickstream data with data from product catalogs gives us the ability to predict the most accurate and relevant attributes for every product.

Customers send us product catalog data on a daily basis. Typically, this data set contains product categories, SKUs, product titles, descriptions, images, and product variations and categories for each product, along with various product attribute data like color, size, or material. However, it’s not uncommon for this attribute data to be missing or broken.

Attributes we predict

Constructor works with some of the largest ecommerce companies in the world, both in B2C and B2B and spanning many different domains, including but not limited to apparel fashion, grocery, home decor, pet supplies, B2B manufacturing/distribution, and more. Customers across the same domain have similar product taxonomies, attributes, and attribute values.

We began this new product by focusing on fashion and home, as those domains both displayed the most interest and the most robust models already trained. But as we are constantly developing our models for other domains and adapting them to customer needs, we have since applied Attribute Enrichment to everything from eyeglasses and frames to grocery to esoteric construction supplies.

We define attribute taxonomy as a tree of the following structure: domain > product category > attribute > attribute values. Below are several examples of attributes we extract in different domains (though this list is certainly not exhaustive).

Enrichment methods

Every e-commerce website contains textual and visual information on available products. To extract that information, we leverage three approaches:.

Image-based approaches are an obvious choice for attribute enrichment due to the availability of corresponding images for all items. We use a cadre of different visual enrichment methods based on pretrained and fine-tuned neural networks, zero-shot vision models, and others.
Text-based approaches are also super helpful here as lots of product information is represented in textual form, e.g. product titles, descriptions, reviews, categories, and tags. We developed a variety of methods from very basic (like those based on regular expressions or token matching) to more advanced NLP techniques like transformer-backed or GPT-based question answering.
Multimodal approaches can combine both visual and textual information for learning and inference.

The choice of approach depends on the specific needs of each customer. For instance, textual information plays a significant role in domains like grocery and electronics where ingredients or tech specs matter most, while visual information is more crucial for apparel and home decor.

At the end of the day, it’s most important for us to find and enrich attributes that are important for e-commerce businesses and their customers. We use clickstream data to achieve this. While businesses may have their own theories of what attributes are important for shoppers, the customers themselves know the exact answer to this question. Through historical analysis of search queries, we can discover the most necessary attributes and attribute values for users (as well as product categories) and enrich them.

Finally, rather than limiting ourselves to existing methods, we are constantly improving our models and researching new approaches that work better for our customers. We’ll cover all of them in detail in the next article soon — stay tuned.

Challenges

There are a number of challenges in product data enrichment that can hinder accurate identification of product or category attributes.

Lack of domain understanding

In fashion, for example, understanding important attributes like fabric type, sleeve length, or neckline requires domain expertise (what is a cap sleeve? A boat neck?). Without this knowledge, it becomes difficult to enrich product descriptions with the appropriate attributes.

Below are some examples of fabric patterns (first two products) and neckline (the last one) which are hard to figure out without being a fashion expert.

To address this challenge, we can take several approaches:

Investigate the Domain. Explore customers’ websites and their competitors’ sites to gain insights into how attributes are described and presented. This can provide valuable information about common attributes used in the domain.
Research Customer Search Queries. Conduct research on customers’ search queries to understand what attributes or product features they search for the most. This helps in identifying important attributes from the customers’ perspective.
Collaborate with Experts. Engage with domain experts who have a deep understanding of the specific industry or product domain. Collaborating with these experts can provide valuable insights and ensure that attribute enrichment efforts align with industry standards and customer expectations.
Leverage External Sources of Information. Use tools like GPT as a dense source of information from the internet. This can help in acquiring a broader understanding of the domain and gathering relevant information for Attribute Enrichment.

By combining these approaches, it is possible to bridge the gap in domain understanding and improve the accuracy of attribute identification and annotation for better attribute enrichment outcomes.

Lack of annotated data

Another challenge in attribute enrichment is the limited availability of annotated data. Not all of our customers provide complete or accurate annotations, which makes it difficult to enrich attributes effectively (which explains why our customers wanted Attribute Enrichment in the first place!).

Sure, we could hire people to label images or texts, but it can be costly because the ecommerce field is vast and has many unique cases. Instead, it’s important to focus on showing the model the most difficult cases.

Active learning methods can help with this by selecting samples that the model is less sure about. To determine these probabilities, we can use pretrained models that can make predictions or collect a small amount of data to train a model without external help. Even if it’s not perfect, this can be a starting point to identify the most challenging samples.

Huge attribute taxonomy

As Constructor works with many customers in many different domains, attribute taxonomy grows very quickly. Every customer has from tens to thousands of unique product categories. Each unique category has its own set of attributes, and each attribute has a list of values associated with it.

The sheer magnitude of this taxonomy doesn’t give us a chance to develop specific models to specific categories to address specific customer needs. Instead, we are forced to look for generic yet efficient techniques which can be transferred between domains and categories. Here is where deep learning comes in.

Pretrained neural networks are able to interpret our data and extract low-level features from it. All we need is to fine-tune them for specific attributes. This approach we use to enrich attributes for apparel and fashion. We also have a team of data annotators who label products with relevant attributes, and then later use these annotations to fine-tune backbone models.

System Design

Our Attribute Enrichment system begins with the product catalog that a customer sends us daily, which is a major piece of data for model training and validation. To obtain even more quality data we involve a data annotation team whose goal is to manually annotate products with relevant attributes and tags for further model training.

To train ML & DL models on a regular basis, we use data pipelines built on Spotify’s Luigi. Then enrichment pipelines use these models for inference. For example, we have enrichment pipelines to predict product category, colors, apparel attributes, text entity recognition, and others. Enrichment pipelines run on a daily basis, taking the most recent catalog and predicting new attributes for products. Thus we have fresh attributes even if a catalog has recently been updated.

The enrichment pipelines save the attributes into a MySQL database. Another microservice builds search indices for production use and delivers them right to the search backend.

Our search backend is largely microservices implemented in Rust, C++ and Python and designed for high-performance index data access. It takes each processed search query as an input and returns relevant products from a search index that matches the query.

The search backend reloads search indices when they’re updated. So when a new search index with enriched attributes is built, the search backend uploads it immediately and serves search requests with this new index under the hood.

Results

Customer dashboard

A customer dashboard serves as a valuable tool for merchandisers, allowing them to review and manage enriched attributes that impact search and browse functionalities. This is necessary because no model can achieve 100% accuracy, and our domain knowledge can be limited.

The purpose of the customer dashboard is twofold:

Reviewing Enriched Attributes. Merchandisers should have the ability to review the attributes that have been enriched. This allows them to assess the accuracy and relevance of the attributes that influence the search and browse experience for customers. By reviewing these attributes, merchandisers can leverage their domain expertise to identify any inaccuracies or irrelevant values. We also make sure this work creates more global value as it feeds back into our Attribute Enrichment system, helping ensure we both produce better results in the future and telling us about the attribute preferences of each individual company we work with.
Attribute Management. The dashboard should provide functionality for merchandisers to remove attribute values that they consider irrelevant. This action should have a direct and timely impact on search and browse functionalities, ensuring that customers are presented with the most relevant and accurate results. Modifications made here also help train the system.

To fulfill these requirements, we have developed a dedicated web application as a separate tool for merchandisers. The application presents a user-friendly table where merchandisers can access a comprehensive overview of items and their variations. Each item is associated with its unique attribute values, and merchandisers have the ability to remove or correct any irrelevant values as needed.

Once a review session is completed and customers have made edits in the dashboard, it is essential that the changes take effect on search and browse functionalities in a short period of time. To achieve this, we trigger an index building process. This process ensures that the updated attribute information is reflected accurately in the search index, allowing our customers to experience the impact of the attribute modifications promptly.

Enrichment examples

Conclusion

Constructor’s Attribute Enrichment is a relatively new product offering for us, but with the early results we’ve seen, we believe that there is a real opportunity to help customers further augment their product data by leveraging powerful deep learning techniques. For too long, bad product data was an excuse search engines gave for bad results. We want to remove that excuse and ensure attractive, revenue-optimizing results, even for product catalogs with poor data.

In the next installment in this series, we’ll cover in detail some machine learning and deep learning techniques we use to enrich products with new attributes.

We’re constantly improving our enrichment models, covering new domains and attributes to address customer needs. Our goal is to make sure end users find what they are looking for, they are satisfied with the whole experience and end their sessions by easily finding and purchasing products they’ll love.

Ready to learn more about Attribute Enrichment? See our product page here.