Recreating Amazon’s New Generative AI Feature: Product Review Summaries

How to generate summaries from data in your Weaviate vector database with an OpenAI LLM in Python using a concept called “Generative Feedback Loops”

8 min readNov 21, 2023

Recreating Amazon’s New Generative AI Feature: Product Review Summaries

Customer reviews are one of the most important features on Amazon, the world’s largest online retailer. People love learning from others who have spent their own money on a product and what they thought about it to decide whether they should buy it. Since introducing customer reviews in 1995 [1], Amazon has made various improvements to the feature.

On August 14th, 2023, Amazon introduced its latest improvement to customer reviews using Generative AI. Amazon offers millions of products, with some accumulating thousands of reviews. In 2022, 125 million customers contributed nearly 1.5 billion reviews and ratings [1]. The new feature summarizes the customer sentiment from thousands of verified customer reviews into a short paragraph.

The feature can be found at the top of the review section under “Customers say” and comes with a disclaimer that the paragraph is “AI-generated from the text of customer reviews”. Additionally, the feature comes with AI-generated product attributes mentioned across reviews, enabling customers to filter for specific reviews mentioning these attributes.

Amazon review summary of “Aokeo Professional Microphone Pop Filter Mask Shield” (Screenshot by my colleague Jonathan with permission to use)

Currently, this new feature is being tested and is only available to a subset of mobile shoppers in the U.S. across a selection of products. The release of the new feature has already sparked some discussion around the reliability, accuracy, and bias of this type of AI-generated information.

Since summarizing customer reviews is one of the more obvious use cases of Generative AI, other companies such as Newegg or Microsoft have also already released similar features. Although Amazon has not released any details on the technical implementation of this new feature, this article will discuss how you can recreate it for your purposes and implement a simple example.

Implementation

To recreate the review summary feature, you can follow a concept called Generative feedback loops. It retrieves information from a database to prompt a generative model to generate new data that is then stored back into the database.

Prerequisites

As illustrated above, you will need a database to store the data and a generative model. For the database, we will use a Weaviate vector database, which comes with integrations with many different generative modules (e.g., OpenAI, Cohere, Hugging Face, etc.).

!pip install weaviate-client - upgrade

For the generative model, we will use OpenAI’s gpt-3.5-turbo for which you will need to have your OPENAI_API_KEY environment variable set. To obtain an API Key, you need an OpenAI account and then “Create new secret key” under API keys. Since OpenAI’s generative models are directly integrated with Weaviate, you don’t need to install any additional package.

Dataset Overview

For this small example, we will use the Amazon Musical Instruments Reviews dataset (License: CC0: Public Domain) with 10,254 reviews across 900 products on Amazon in the musical instruments category.

import pandas as pd 

df = pd.read_csv("/kaggle/input/amazon-music-reviews/Musical_instruments_reviews.csv",
                usecols = ['reviewerID', 'asin', 'reviewText', 'overall', 'summary', 'reviewTime'])

df = df[df.reviewText.notna()]

Amazon Musical Instruments Reviews dataset preview

Setup

As a first step, you will need to set up your database. You can use Weaviate’s Embedded option for playing around, which doesn’t require any registration or API key setup.

import weaviate
from weaviate import EmbeddedOptions
import os

client = weaviate.Client(
    embedded_options=EmbeddedOptions(),
    additional_headers={
        "X-OpenAI-Api-Key": os.environ["OPENAI_API_KEY"]
)

Next, we will define the schema to populate the database with the data (review_text, product_id, and reviewer_id). Note that we’re skipping the vectorization with "skip" : True here to keep inferencing costs to a minimum. You can enable vectorization if you want to expand this feature and enable semantic search across reviews.

if client.schema.exists("Reviews"):
    client.schema.delete_class("Reviews")

class_obj = {
    "class": "Reviews", # Class definition
    "properties": [     # Property definitions
        {
            "name": "review_text",
            "dataType": ["text"],
        },
        {
            "name": "product_id",
            "dataType": ["text"],
            "moduleConfig": {
                "text2vec-openai": { 
                    "skip": True, # skip vectorization for this property
                    "vectorizePropertyName": False
                }
            }
        },
        {
            "name": "reviewer_id",
            "dataType": ["text"],
            "moduleConfig": {
                "text2vec-openai": { 
                    "skip": True, # skip vectorization for this property
                    "vectorizePropertyName": False
                }
            }
        },
    ],
    "vectorizer": "text2vec-openai", # Specify a vectorizer
    "moduleConfig": { # Module settings
        "text2vec-openai": {
            "vectorizeClassName": False,
            "model": "ada",
            "modelVersion": "002",
            "type": "text"
        },
        "generative-openai": {
          "model": "gpt-3.5-turbo"
        }
        
    },
}

client.schema.create_class(class_obj)

Now, you can populate the database in batches.

from weaviate.util import generate_uuid5

# Configure batch
client.batch.configure(batch_size=100) 

# Initialize batch process
with client.batch as batch:
    for _, row in df.iterrows():
        review_item = {
            "review_text": row.reviewText,
            "product_id": row.asin,
            "reviewer_id": row.reviewerID,
        }

        batch.add_data_object(
            class_name="Reviews",
            data_object=review_item,
            uuid=generate_uuid5(review_item)
        )

Generate new data object (summary)

Now, you can start generating the review summary for every product. Under the hood, you are performing retrieval-augmented generation:

First, prepare a prompt template that can take in review texts as follows:

generate_prompt = """
Summarize these customer reviews into a one-paragraph long overall review: 
{review_text}
"""

Then, build a generative search query that follows these steps:

Retrieve all reviews (client.query.get('Reviews')) for a given product (.with_where())
Stuff the retrieved review texts into the prompt template and feed it to the generative model (.with_generate(grouped_task=generate_prompt))

summary = client.query\
                .get('Reviews', 
                     ['review_text', "product_id"])\
                .with_where({
                    "path": ["product_id"],
                    "operator": "Equal",
                    "valueText": product_id
                })\
                .with_generate(grouped_task=generate_prompt)\
                .do()["data"]["Get"]["Reviews"]

Once a review summary is generated, store it together with the product ID in a new data collection called Products.

new_review_summary = {
        "product_id" : product_id,
        "summary": summary[0]["_additional"]["generate"]["groupedResult"]
    }
    
# Create new object
client.data_object.create(
  data_object = new_review_summary,
  class_name = "Products",
  uuid = generate_uuid5(new_review_summary)
)

If you want to take this step further, you could also add a cross-reference between the product review summary in the summary class and the product review in the review class (see Generative Feedback Loops for more details).

Now, repeat the above steps for all available products:

generate_prompt = """
Summarize these customer reviews into a one-paragraph long overall review: 
{review_text}
"""

for product_id in list(df.asin.unique()):
    # Generate summary
    summary = client.query\
                .get('Reviews', 
                     ['review_text', "product_id"])\
                .with_where({
                    "path": ["product_id"],
                    "operator": "Equal",
                    "valueText": product_id
                })\
                .with_generate(grouped_task=generate_prompt)\
                .do()["data"]["Get"]["Reviews"]
    
    new_review_summary = {
        "product_id" : product_id,
        "summary": summary[0]["_additional"]["generate"]["groupedResult"]
    }
    
    # Create new object
    client.data_object.create(
      data_object = new_review_summary,
      class_name = "Products",
      uuid = generate_uuid5(new_review_summary)
    )

For the product with the asin = 1384719342 you have the following five reviews:

reviews = client.query\
                .get('Reviews', ['review_text', "product_id"])\
                .with_where({
                    "path": ["product_id"],
                    "operator": "Equal",
                    "valueText": "1384719342"
                })\
               .do()

 {
    "product_id": "1384719342",
    "review_text": "Not much to write about here, but it does exactly what it's supposed to. 
    filters out the pop sounds. now my recordings are much more crisp. 
    it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,"
  },
  {
    "product_id": "1384719342",
    "review_text": "The product does exactly as it should and is quite affordable.
    I did not realized it was double screened until it arrived, so it was even better than I had expected.
    As an added bonus, one of the screens carries a small hint of the smell of an old grape candy I used to buy, so for reminiscent's sake, I cannot stop putting the pop filter next to my nose and smelling it after recording. :D
    If you needed a pop filter, this will work just as well as the expensive ones, and it may even come with a pleasing aroma like mine did!
    Buy this product! :]"
  },
  {
    "product_id": "1384719342",
    "review_text": "The primary job of this device is to block the breath that would otherwise produce a popping sound, while allowing your voice to pass through with no noticeable reduction of volume or high frequencies. 
     The double cloth filter blocks the pops and lets the voice through with no coloration. 
     The metal clamp mount attaches to the mike stand secure enough to keep it attached. 
     The goose neck needs a little coaxing to stay where you put it."
  },
  {
    "product_id": "1384719342",
    "review_text": "Nice windscreen protects my MXL mic and prevents pops. 
     Only thing is that the gooseneck is only marginally able to hold the screen in position and requires careful positioning of the clamp to avoid sagging."
  },
  {
    "product_id": "1384719342",
    "review_text": "This pop filter is great. 
     It looks and performs like a studio filter. 
     If you're recording vocals this will eliminate the pops that gets recorded when you sing."
  }

The resulting review summary for this product is:

res = client.query\
            .get('Products', ['product_id', 'summary'])\
            .with_where({
                "path": ["product_id"],
                "operator": "Equal",
                "valueText": "1384719342"
            })\
            .do()

  {
    "product_id": "1384719342",
    "summary": "Overall, customers are highly satisfied with this pop filter. 
     They praise its ability to effectively filter out pop sounds, resulting in crisp recordings. 
     Despite its low price, it performs just as well as more expensive options. 
     Additionally, customers appreciate the double screening and the added bonus of a pleasant aroma. 
     The device successfully blocks breath pops without reducing volume or high frequencies, and the metal clamp mount securely attaches to the microphone stand. 
     The only minor issue mentioned is that the gooseneck requires careful positioning to avoid sagging. 
     Overall, this pop filter is highly recommended for vocal recordings as it effectively eliminates pops and performs like a studio filter."
  }

As you can see, the generative summary reflects the points mentioned in the original reviews quite well, including points such as the cost-benefit ratio.

Summary

Generating summaries from extensive amounts of text data you already have is one of the more straightforward applications of generative AI and has already been rolled out by different companies, such as Amazon, Microsoft, and Newegg. If you are interested in more similar use cases, you can check out the Healthsearch demo, where you can semantically search through supplements and get a summary of their reviews.

This tutorial only covered the basic concepts of how you can approach building an AI-generated summary feature with generative feedback loops. To make this production-ready, you’d have to improve the prompt with some prompt engineering, think about how to handle this at scale when the number of reviews overflows the context window of your generative model, make sure you can identify verified reviews to filter on, and so on.

This article only covered how to generate the review summaries but not the AI-generated highlight features. If you are interested in a sequel covering the highlight feature, please leave a message in the comments.

Enjoyed This Story?

Subscribe for free to get notified when I publish a new story.

Get an email whenever Leonie Monigatti publishes.

Get an email whenever Leonie Monigatti publishes. By signing up, you will create a Medium account if you don't already…

medium.com

Find me on LinkedIn, Twitter, and Kaggle!

Disclaimer

I am a Developer Advocate at Weaviate at the time of this writing.

References

Literature

[1] V. Schermerhorn, Director, Community Shopping at Amazon (August 14th, 2023): How Amazon continues to improve the customer reviews experience with generative AI (accessed November 10th, 2023)

Dataset

Amazon Musical Instruments Reviews (License: CC0: Public Domain)

Images

If not otherwise stated, all images are created by the author.