A Journey into Knowledge Graphs at Instacart

Michelle Yi (Yulle)

Published in

knowledge-bytes

5 min readApr 21, 2022

Graph Thinking - Tom Grubb - 20220421.mp4

Original

drive.google.com

[Download the presentation as a PDF here]

Key Knowledge Graph Applications:

Accelerate machine learning by serving as a more consistent feature store with high quality data
Enable semantic search, which improves both external customer experiences (finding the items they need / personalization) and internal customer experiences (finding the data they need in the format they need it)

Online Shopping

When the pandemic started, many of us desperately turned to online ordering through grocery delivery services like Instacart. With quarantine procedures, the increase in COVID-19 variants, and ongoing cases, these services are now almost essential to our daily lives.

With this unprecedented demand, companies like Instacart had to quickly adapt; however, this speed to scale has come with its own challenges.

For example, have you ever ordered four individual apples, but instead received four bags? Or searched for your favorite vegan items, but couldn’t find them, even though you knew they were in stock?

To solve these problems, Instacart turned to knowledge graphs to do two things that would improve the customer experience:

Standardize machine learning training data
Improve search functionality

But inevitably, as with any large-scale knowledge graph, there were data challenges to overcome to unlock these capabilities.

Detecting Noise in Knowledge Graphs

The Opportunity and Challenge

*Example from Slide 5 on the target state knowledge graph*

At the time, the Instacart knowledge graph was relatively new, with about 70 million facts related to groceries and their related attributes. While the team recognized the advantages in standardizing machine learning training data and improved search that having this knowledge graph could unlock, they encountered data quality challenges common to large-scale knowledge graphs.

In particular, large-scale knowledge graphs involve many extract-transform-load (ETL) data engineering pipelines and automated processes to collect and clean source data that could be coming from Wikipedia, the public web, or the many store catalogs that Instacart relies on. Data quality issues are bound to happen, and incorrect data can end up populating the knowledge graph.

*Example from Slide 11 on inconsistent product size data*

For instance, in the Instacart example, similar products for granola bars at various stores can be contained in four different boxes with multiple ways of returning a quantity (e.g., bars, grams, etc.). Creators of the knowledge graph, in this scenario, are at the mercy of the store catalogs — and there is not necessarily one right or wrong answer. In fact, all accurate information related to a product is valuable given the right context.

Different people actually need the unit of measure in different values for multiple purposes:

A customer wants to know what the cheapest value per bar is.
A machine learning engineer building search algorithms wants to know what the bulk size is per item for packaging.
A supply chain expert wants to know what the dimensions are per pallet for warehouse optimization.

To maintain the data integrity of diverse attributes of grocery products and serve a variety of personas, the Instacart team tried starting with a series of simple and explainable tests that take advantage of the inherent qualities of knowledge graphs.

The Solution

Some of the solutions implemented that take advantage of knowledge graph features include:

1. Leverage the semantic meaning from strings — The strings in this case can include a brand, product type, product name, and other attributes, such as “Raspberry European Biscuits”. Each of these elements contain a specific meaning and relation to each other that can then be mapped in vector space. This means that algorithms such as k-NN (nearest neighbor) can be applied to determine the proximity of related taxonomies, as shown in the image below:

*Example from Slide 19 on approximating a taxonomy*

2. Account for business-relevant logic — By this, we mean logic that can be derived from domain knowledge, historical trends such as seasonality, external information, or other sources. Some might call these “integrity constraints”, and these can vary in complexity. At the simplest level, an example might be that we know that one gram of fat has nine calories, so any values that do not match this fact will be flagged as erroneous.

3. Weave metadata and disparate sources together for fact-checking — One powerful attribute of a knowledge graph is the ability to tie not just many pieces of information together, but also to interpret metadata or the context around the data as well. In this scenario, Instacart pulled information from store catalogs, the public web, etc. but also looked at the metadata involved to see if there is consistency across sources that could validate the quality of the data. For instance, we could see when product information is updated for biscuits across multiple sources to determine its validity.

Conclusion

This use case speaks to what is still just the early days of the knowledge graph journey for Instacart. However, even in its current state, leveraging the knowledge graph to comprehensively understand products serves as a powerful foundation that has unlocked new capabilities or benefits:

Marketing or business people can leverage the knowledge graph to quickly find the most reliable data to push a holiday sale or campaign and drive additional revenue.
More powerful algorithms can be developed at a faster rate for areas like search, recommendation, and personalization.
Data scientists and machine learning engineers have consistent and reliable training data at scale across the organization, without having to hunt down 50 databases or re-engineer features.
Leaders can save on operating costs and time spent querying unreliable information.

If you’re interested in learning more of the technical details, check out these references from the speaker:

Hogan et al, Knowledge Graphs
Paulheim, Knowledge Graph Refinement: A Survey of Approaches and Evaluation Methods
Zalmout, Zhang, Li, Liang, and Dong, All you Need to Know to Build a Product Knowledge Graph
Grubb, Andersen, and Alonso, On Reliability Scores for Knowledge Graphs (to appear, WWWConf 2022)

About the Speaker

Thomas Grubb is a 5th Year Ph.D. Student at UC San Diego with a background in math and economics. Tom’s work on knowledge graphs started last summer with an internship at Instacart and will continue this summer with an internship at Coupang, where he will focus on applying knowledge graphs to search and query understanding. He hopes to continue working in this area after graduation.

Talk Summary: Building large knowledge graphs often relies on automated “extract, transform, load” techniques, which can allow noise from source data to be incorporated into the resulting graph. This talk surveys techniques for detecting unreliable facts in a knowledge graph at scale, with the goal of preventing this noise from corrupting downstream applications of the graph.

About Us

Graph Thinking is a community whose mission is to:

Raise community awareness about business and industry knowledge graph use cases
Create interactions and connections that inspire knowledge graph applications

Join the meetup group, run by Diffbot and RelationalAI, to see when our next event is and subscribe to Knowledge Bytes to receive summaries and write-ups!