eCommerce catalogs are created by sourcing data from sellers(3P), suppliers/brands(1P). The data provided by partners (sellers, suppliers, brands) are often incomplete, sometimes missing crucial bits of information that our customers are looking for. Even though partners adhere to a spec (an agreed format for sending product data) there is a vast amount of data buried in the title, description and images. Besides the data provided by our partners there are lots of unstructured data on the internet in the form of product manual, product reviews, blogs, social media sites etc.
At Walmart we are working on building a Retail Graph that captures the knowledge about product and its related entities to help our customers better discover products in our catalog. It’s a product knowledge graph that can answer questions about products and related knowledge in the retail context. Such a system can be used to power semantic search, recommendation system etc. This article further expounds on what is Retail Graph, how we built it, tech choices around graph model, db and some use cases.
What Is Walmart’s Retail Graph
Retail Graph captures the connections between products and entities that exist in the retail world. Entities are objects, things, concepts or abstractions that exist e.g. living room, wildlife photography, bright colors, back to school, farmhouse style. There are broadly two kinds of entities we focus on — abstract (subjective like kid-friendly) and concrete (attributes like red color). The former helps us answer queries like “summer pool party supplies”, “farmhouse living room furniture”, “lenses for wildlife photography” while the latter helps answer questions like “blue jean pants”, “wooden dining table”. The graph also captures relationships between products into two large buckets — substitutes & complements (accessories, compatible products etc). It also attempts to map abstract concepts like bright colors to concrete product attributes.
Having worked on the Walmart catalog we knew some of the challenges in building such a system. The biggest challenge was the lack of single authoritative source of truth for product data. Besides our catalog also had erroneous data from our partners. So, we started off by :
- building a bipartite graph with products on one side and related entities on the other
- leveraging our existing taxonomy and enriching the entities as we discovered new ones.
- connecting products(SKUs) with entities.
Building a Retail Graph
At a high level our main focus was on the following relationships to build our Retail Graph :
1. Product <-> Entities
2. Product <-> Product (broadly classified as substitutes & complements)
- Product <-> Entities
In order to build the product to entity graph we started off by extracting entities from product content and then linking them to abstract or concrete concepts to form triples. We added a layer of governance to allow for humans to validate triples below a certain confidence level to keep the quality bar high.
a. Entity Extraction
The goal of the entity extraction module was to extract “entities” from product titles and descriptions. The product description content comes in all kinds of flavors. Sometimes the content is verbose, while at times it could be small phrases in bullet points. Keeping this in mind we developed two algorithms for extracting entities from product content:
i. NLP based model
We started off by extracting the entities from the product title, descriptions and other metadata. This was done by building a linguistic model that leverages the POS Tagger provided by Standford Core NLP. This model worked better for our use case as the product titles and descriptions are typically in bullet points, featuring product highlights rather than well-constructed sentences. Below is an example of the output from our NLP based model.
ii. Heuristic model
The other approach we took, that yielded good result, was to use rules to parse description. Sellers/Suppliers use certain formats (HTML tags) to highlight the key features of the product. We built rules around how to parse and extract key information by applying a set of heuristics on them. Here is an example of the sample product description and its output:
In production we use both the models described above. This gave us a good balance between the heuristic model which was very accurate and the NLP model which gave us good coverage.
b. Entity Linking
Once the entities are extracted, we need to identify what they represent and their relationship back to the product (SKU). For example, with an entity like “mid-century sofa” we must identify what mid-century stands for in the context of the sofa. This is achieved by a process called entity linking where we attempt to find the relationship between the extracted entity and its SKU. Entity Linking module has another important function which is to disambiguate given a context. For e.g. “cherry” could mean a scent in the context of a candle, flavor in the context of a juice, finish in the context of a furniture, color in the context of a cloth, fruit in the context of grocery. The context referred here is typically a product category or product type.
The linker takes the context (product type) and entity as input and produces a triple (subject-object-predicate). As there is no single accurate source of truth for product data, the task of linking entities became hard. We started off by creating a dictionary of product type, attribute name and attribute value triplet from a set of top selling SKUs(assumption is that top selling SKUs have more accurate data). The first step was to use this dictionary and identify the possible candidates agnostic of the context. After that a second model runs to rank them by using the context.
For the entities extracted above, the linker output is shown below:
c. Entity Governance
As part of entity extraction there is a good amount of “noise” that gets extracted as well. We have used the existing product metadata to construct a dictionary that serves as a reference to classify an extracted entity as noise or “unknown” concept. Then, we have added a governance module which can weed out noise using a combination of heuristics and manual tagging. This ensures that the data that enters the knowledge graph is always clean and reliable.
2. Product <-> Product
In order to identify substitutes for a given SKU we leverage both text data and image data. There are certain product categories like furniture, apparel where visual similarity plays an important role in identifying substitutes. We built image embedding and text embedding for our SKUs and pushed them into FAISS index.(Faiss is a library for efficient similarity search and clustering of dense vectors developed by Facebook). For each SKU we generate its KNN (k-nearest neighbors) from both text embedding and image embedding to arrive at candidate sets. Post that, we apply a category specific ranking logic to arrive at the final set. For e.g. in the case of furniture category, “home decor style” (mid-century/coastal/farmhouse) plays a critical role in determining substitutability and it will bias the ranking logic.
When we embarked on this journey of building Retail Graph we weren’t quite sure what the final state of the system was going to look like. All we knew was that we needed a component to extract entities, link them and then store them. Given the size of our catalog we knew that each of these must scale to 100s of millions of products. Also there was a need to rapidly experiment, build POCs and iterate quickly on them to get feedback. We decided to adopt Evolutionary Architecture principles for building our system.
An evolutionary architecture supports incremental, guided change as a first principle across multiple dimensions.
The entity extraction and linking were built as simple libraries which was then exposed as a REST API for other systems to integrate with. We also built Hive UDFs on top of the entity extraction and entity linker libraries to run them at scale on our on-prem Hadoop cluster.
Data Processing Pipelines
We had two pipelines — one for generating the product <-> entities and another for product <-> product. They run periodically on the on-prem Hadoop cluster managed by our data platform team. Once the process is complete we leverage the bulk ingestion APIs to ingest them into Cosmos DB on Azure. Below is a very high level overview of the data processing pipelines :
Graph Data Model & Graph DB
We have evaluated both LPG (Labelled Property Graph) and RDF (Resource Description Framework) graph data models for our read and write use cases before converging on LPG. Here is a nice read on the comparison between the two. After few experiments with graph databases in house we narrowed down on Azure Cosmos DB (Graph model). We worked closely with the Azure team to provide us Java support for bulk ingestion of data. For graph traversals we use gremlin.
Applications within Walmart
Building a product knowledge graph for the size of Walmart Catalog takes considerable amount of time. We have taken one category at a time approach while building this, learn and then scale out to other categories. We kicked off this effort by focusing on the Home & Garden category. We did a A/B test on the Walmart product page working with our item recommendation team which used the product<->product relationships.
Our eCommerce semantic search team is working very closely with us to build a new query understanding system leveraging the relationships in the Retail Graph. We are currently running interleaving test, A/B test to gather customer feedback on our new semantic search implementation.
It’s very hard to go into detail on all aspects of Retail Graph in single post but I hope this provides a decent overview. We have barely scratched the surface of this problem and there is still a long way to go. An initiative like this requires rapid iteration, lots of experimentation and willingness to learn from mistakes before figuring out the right approach. I’m fortunate to have a great team of engineers and data scientists to work with on this fun project!