Key Insights from Starbucks Review: Extracting Keywords using Open Source LLM

Published in

ScrapeHero

6 min readNov 1, 2023

In today’s digitally connected world, customer reviews play a pivotal role in shaping our decisions. They offer valuable insights into products, services, and businesses.

In this blog, we will delve into the world of Starbucks reviews, and how we can harness the power of Mistral 7B LLM (Large Language Model) to extract essential keywords and insights from these reviews.

Mistral 7B — An Open Source 7B Instruct Model

Mistral 7B is a large language model created by Mistral AI, known for its impressive performance and efficiency.
This model is a generative text model with a substantial 7 billion parameters. It has gained recognition for surpassing other pre-trained large language models of similar size in terms of performance.
Mistral 7B outperforms Llama 2 13B across all metrics and demonstrates a level of performance similar to that of Llama 34B.
Mistral 7B utilizes techniques like grouped-query attention and sliding-window attention to achieve fast inference with low latency and handle longer sequences effectively.

About Data

Data is about location reviews and ratings of Starbucks stores in the USA region.

The data has been collected from ScrapeHero, one of the leading web-scraping services in the world. Click here for the Data Source that we used for analysis!

Columns:

ID, Name, Address, Street, Zip_Code, State, City, Author, Review, Rating

Sample Data:

Installation

pip install — upgrade git+https://github.com/UKPLab/sentence-transformers
pip install keybert ctransformers[cuda]
pip install — upgrade git+https://github.com/huggingface/transformers
pip install wordcloud matplotlib numpy

Keyword Extraction using Mistral 7B

Let’s begin by loading the model. We’ll shift 50 layers to the GPU, which will decrease RAM usage in favor of VRAM. If you encounter memory errors, consider lowering the ‘gpu_layers’ parameter for potential relief.

Loading Model:

from ctransformers import AutoModelForCausalLM


# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
model_type="mistral",
gpu_layers=50,
hf=True
)

Setting up huggingface pipeline:

from transformers import AutoTokenizer, pipeline

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

# Pipeline
generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1
)

Loading data:

import pandas as pd

df = pd.read_csv("Starbucks_reviews.csv")

Selecting all the reviews from New Hampshire Region:

newhampshire_reviews = df[df.state == "NH"]['review'].tolist()

Now let’s try to extract the keywords from the review. For that, this will be the prompt template or instruction we would be giving to the model.

prompt = f"""
I have the following document:


* {newhampshire_reviews[0]}


Please give me the keywords that are present in this document and separate them with commas.
Make sure you only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
"""
response = generator(prompt)
print(response[0]["generated_text"])

Output:

"""
I have the following document:
* Always a pleasure visiting this Starbucks drive through! The staff are friendly

Extracted keywords from that document.
**Answer:**
1. Starbucks
2. drive through
3. staff
4. friendly
"""

We are getting unstructured output from the model. So we want to make the output consistent throughout by changing the prompt template.

Like many Large Language Models, Mistral 7B requires a specific prompt format, which greatly aids in illustrating a “correct” interaction.

Utilizing the provided template, let’s formulate a structure for keyword extraction with two essential components:

Example Prompt — This serves the purpose of demonstrating the Language Model what qualifies as a “high-quality” output.
Keyword Prompt — This serves the purpose of instructing the Language Model to perform keyword extraction.

example_prompt = """
<s>[INST]
I have the following document:
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.


Please give me the keywords that are present in this document and separate them with commas.
Make sure you only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken</s>"""




keyword_prompt = """
[INST]


I have the following document:
- [DOCUMENT]


Please give me the keywords that are present in this document and separate them with commas.
Make sure you only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST]
"""


prompt = example_prompt + keyword_prompt

Example:

Utilizing KeyBert and KeyLLM for Extracting Keywords

from keybert.llm import TextGeneration
from keybert import KeyLLM


# Load it in KeyLLM
llm = TextGeneration(generator, prompt=prompt)
kw_model = KeyLLM(llm)
keywords = kw_model.extract_keywords(newhampshire_reviews); keywords

Since there are a lot of keywords present, let’s generate a word cloud from the keywords.

from wordcloud import WordCloud
import matplotlib.pyplot as plt
import numpy as np


# Flatten the list of lists into a single list
flattened_data = [word for sublist in keywords for word in sublist]


# Convert the list into a space-separated string
text = " ".join(flattened_data)


# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color="white").generate(text)


# Display the word cloud using matplotlib
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

We can clearly see what the customers are talking about.

Key takeaways are friendly staff, service, understaffed, leadership, wait time, expensive, teamwork, slow etc.

Now let’s check one more state which is Arkansas

arkansas_reviews = df[df.state == “AR”][‘review’].tolist()

keywords = kw_model.extract_keywords(arkansas_reviews)

Wordcloud:

Key takeaways are drive thru, fast, complaint, location, customer service, good taste, wait time, excellent, friendly, parking etc.

Conclusion

In the era of information overload, efficient methods of extracting essential information from a sea of data are invaluable. Mistral 7B, an open source 7B language model, can be a game-changer when it comes to distilling key insights from Starbucks reviews.

By utilizing its powerful natural language processing capabilities, we can extract and present essential keywords that enable consumers and businesses to make informed decisions.

Whether you’re a shopper looking for the best product or a business aiming to understand customer sentiments, Mistral 7B’s capabilities in keyword extraction open up new possibilities for efficient data analysis and decision-making.

As online shopping continues to grow, extracting valuable insights from customer reviews is more critical than ever.

Mistral 7B and similar local models offer a compelling solution to enhance our understanding of consumer sentiments and preferences.

So, next time you read a Starbucks review, keep in mind the technology working behind the scenes to bring you the most critical takeaways.

The above mentioned code is available in my Github Repository.

Hope you learned something new today, Happy Learning!