Building an aspect-based sentiment analysis pipeline using GPT-3

Michele Pangrazzi
11 min readSep 10, 2022

--

Large Language Models like GPT-3 (but also BLOOM) have been defined as “few shot learners”. This means that, given a few examples, they can perform a very wide variety of tasks with excellent results.

The GPT-3 Davinci model is, at the time of writing, the most capable model of GPT-3 family and one of the largest LLMs out there. Davinci is able to solve logic problems, determine cause and effect, understand the intent of text, produce creative content, explain character motives, and handle complex summarization tasks.

Considering all of this, I was wondering:

Can a model so capable be the main component of an aspect-based sentiment analysis pipeline?

To obtain an answer for such a question, I’ve tried to build a proof-of-concept ABSA pipeline using it (the link to full code is available at the end of the article). In this article I will go through the implementation details.

TL;DR

Feedbacks about Boss MT-2 Metal Zone guitar distortion pedal analysed by GPT-3

The pipeline

We can model our ABSA POC pipeline in this way:

Scraping feedbacks

The first thing to do is to get some users feedbacks about a generic product or service. How many? For instance, when I have to buy something from Amazon I tend to briefly read:

  • At max 25 feedbacks from the Top reviews / All stars (ranked by Amazon algorithm)
  • At max 25 feedbacks from the Most recent / All stars

Or, in alternative:

  • At max 25 feedbacks from the Top reviews / All positive (ranked by Amazon algorithm)
  • At max 25 feedbacks from the Top reviews / All critical (ranked by Amazon algorithm)

In both situations, we’ll need to scrape around 50 selected feedbacks of a target product.

Since Amazon is a quite common data source for users feedbacks, we will get the feedbacks from it. To do that, we need to implement a basic scraper.

I chose to implement a basic Scrapy spider for amazon.com. Scrapy is a well-known, easy-to-use and powerful scraping framework. It will be more than enough to get the feedbacks we need from Amazon.

In less than 50 lines of code you can manage to build a (basic) spider which:

  • Scrape the number of items we need (50 feedbacks)
  • Add delay between request (to avoid getting blocked easily)
  • Crawl feedback pages with ease

We can run the spider either programmatically or using a CLI command.

Now, since I’m a guitar player, let’s scrape the needed user feedbacks of a Boss MT-2 Metal Zone (a famous distortion pedal for electric guitar) and save the output in a CSV file:

# Top reviews / All positive$ scrapy runspider \
absa/scraping/amazon.py \
-O feedbacks.csv \
-a start_url='https://www.amazon.com/MT-2-Metal-Distortion-Guitar-Pedal/product-reviews/B0002CZVA0/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&filterByStar=positive&pageNumber=1'
# Top reviews / All critical$ scrapy runspider \
absa/scraping/amazon.py \
-o feedbacks.csv \
-a start_url='https://www.amazon.com/MT-2-Metal-Distortion-Guitar-Pedal/product-reviews/B0002CZVA0/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&filterByStar=critical&pageNumber=1'

It will take a few seconds to scrape the feedbacks what we need. And that will be enough for our scraping needs.

Using GPT-3 for analysis

Nowadays, a fully-featured ABSA pipeline requires lot of steps. Naming some of them:

  • EDA of the data / domain
  • Aspect dictionary building (or topic modelling)
  • Sampling and annotation of examples for each aspect (manually, with active learning or using rules)
  • Feature extraction / model training
  • Model evaluation
  • Model deploy

…And so on.

This approach instead will ambitiously try to rely only on two things:

  • The immense “knowledge” of GPT-3 (it has been trained on about 45TB text data from multiple sources which include Wikipedia and books)
  • The impressive [zero|one|few]-shot learning capability

Zero, one and few-shot approaches

Prompt-engineering is a crucial part if you want to get the most out of GPT-3.

Let’s start with a basic ABSA-related task. We will ask GPT-3 to extract aspect expressions, related segments and related sentiments from a sentence:

Zero-shot approach

Impressive, but maybe not satisfactory. This is because we didn’t provide any examples to the model, so it’s only guessing which could be the best way to fulfil the task and format the output (which can be different if you try to run the query a few times). We just tried a zero-shot approach.

Let’s focus only on aspect extraction. We will give GPT-3 a single example and see the outcome:

One-shot approach

With only one example (one-shot approach), it did what we asked for. As you may already figured out, the main idea of prompt-engineering is to create a text prompt which helps GPT-3 to better understand the specific task requirements and then fulfil it.

Let’s now use a more specific prompt, asking to the model to format the output in JSON, so it will be a lot easier to parse model responses:

One-shot approach and JSON format

Now you know that GPT-3 is (also) aware of JSON format. However, for a basic ABSA analysis, aspect extraction only is not enough. We also need to extract:

  • The text segments related to the extracted aspects
  • The sentiment values related to each aspect / segment pairs

Moreover, we would still like to have an easily parsable JSON output. So let’s ask GPT-3 to kindly do all of this (I know it seems a bit weird 😉).
Prompt-wise, we will pass two generic examples of the ABSA analysis output we expect in JSON format (few-shot approach).

Few-shot approach

This is exactly what we need. To summarize, using the few-shot approach, GPT-3 was able to:

  • Extract relevant aspects
  • Extract sentence segments related to aspects
  • Assign a sentiment value to each aspect / segment pairs

Even if we are looking at a single result, I think it’s quite impressive, especially if you consider -as I wrote above- that usually this output is the combination of different stages, which can be very complex (depending on what one want to achieve).

Of course, since the idea is to build an automated pipeline, we need to figure out how to programmatically query GPT-3.

This can be easily done using the OpenAI Python Library. To query GPT-3 via API, you first need to sign up to OpenAI and obtain an API key. You will get a free $18 credit (more than enough to play with and run this experiment), then -depending you’re an individual or an organization- you will pay as you go.

Let’s call OpenAI APIs asking for a basic Completion:

This will print something like:

I'm doing well. How are you?

Now we can wrap theCompletion call into a more useful method which e.g. takes care of the prepending to the prompt the examples discussed above, in order to leverage the few-shot learning capability of GPT-3 and obtain the output we want.

A note about “temperature”

On GPT-3 playground, you may have noticed that there are some parameters you can tune.

The temperature value can be important here.

It basically controls the randomness of the generated text. If you want your experiments to be reproducible, you need to put a value like 0 or 0.1. The more you raise the value, the more different the output will become among runs (but it will be always meaningful). More info can be found on the excellent docs.

Evaluation of analysis results

As also shown at the beginning of the article, here is another analysed feedback for the Boss MT-2 metal zone.

An analysed feedback about MT-2

It’s interesting to see that GPT-3 is performing well on a very specific context (electric guitar effects) even if the prompt we passed on Completion calls is quite general.

Assuming that the analysis quality on single feedbacks is empirically acceptable, we will now “zoom out” trying to see the “big picture” of the whole analysis. To do that, we will generate a simple Aspect vs Sentiment chart using Plotly (you will find the link to the full code at the end of the article):

Aspect vs Sentiment on Boss MT-2

Looking at this chart, we can quickly see that the vast majority of the aspects extracted automatically by GPT-3 are very valid. We can also see that there’s a bit of noise (e.g. “This pedal” is not a valid aspect).

Disclaimer here: I have already tried the Boss MT-2 and I have a precise idea about how it performs. And, looking at the chart, I would say that it resembles the general (and mine) opinion quite well: people are very divided about the sound quality while they may appreciate the features.

Let’s now try to run the analysis in the same exact way, but on a totally different product: a vacuum cleaner. I chose a random vacuum cleaner from Amazon, and here are some sample results:

Analysed feedbacks about a vacuum cleaner

While it’s not perfect, I think it did a very good job. It’s worth also to mention that it’s handling well languages other than English:

A Spanish feedback of a vacuum cleaner analysed

The Aspect vs Sentiment charts looks nice also in this case. We can easily see which are the best and the worst aspects.

“What if the analysis quality is not good enough?”

If the results are not satisfactory even if you did a good prompt-engineering, it means that the few-shot approach presented here is probably not enough to fulfil your specific task.

In this case, I recommend you to fine-tune GPT-3. As stated in the docs, with a task-adapted model you will have notable benefits:

  • Higher quality results than prompt design
  • Ability to train on more examples than can fit in a prompt
  • Token savings due to shorter prompts
  • Lower latency requests

Last but not least, fine-tuning GPT-3 using OpenAI APIs is way easier to e.g. fine tuning GPT-2 using transformers.

“Can I use GPT-3 to further improve the pipeline?”

Yes! As we already know, GPT-3 is able to handle a lot of different tasks. Let’s give some starting points.

Words-to-concept abstraction

If you’re not satisfied with a flat list of aspects, you can ask GPT-3 to group them by the most similar. Here’s a (almost) zero-shot example with generic words:

Grouping word by concepts

A sample use case may be the following: if you don’t like the aspects generated by GPT-3 during analysis, you may still use GPT-3 (or other ways if you like) to extract noun phrases:

Extracting noun phrases with GPT-3

Then, when you have many of them, use GPT-3 to abstract them in concepts like we did above.

Summarize long feedbacks

It might not worth to analyze very long feedbacks directly. Instead, a good idea may be to use GPT-3 to summarize them before analysis:

A zero-shot text summarization done by GPT-3

In terms of cost it will be much cheaper to analyse the summarized feedbacks (less tokens to pass) and most of the relevant aspects are maintained. GPT-3 performance here seems to be in line (if not better) than BART-large.

Extraction of insights from datasets

Charts are basically visual representation of data. So if you ask GPT-3 to extract insights from the dataset on which you’ve generated the chart, in this way (zero-shot example):

Querying GPT-3 with a CSV dataset

You will obtain something like:

Insights extracted by GPT-3 from a CSV dataset

Which could be great to show just below the chart.

Note that for simplicity I’ve always used a zero-shot approach. It worked, but it might doesn’t handle all kind of situations, especially when the context is very specific.

The tricks to improve the results are exactly the same described above: first, try with a few examples. If it’s still not enough, try to fine-tune GPT-3. The results will be a lot more interesting then.

“What if I want to build a real pipeline with this?”

If you look at the code (link is provided at the end of the article), you will find that the main flow of what we described in this article lives in some Jupyter Notebooks.

This is fine for a proof-of-concept (the goal here is only to demonstrate the potential of the approach), but what if you really want to build a production-ready pipeline which uses Scrapy spiders and GPT-3? If this is the case, you may consider the following things.

Scaling scraping

Scrapy spiders can scale if you deploy them on Zyte, their managed cloud. This way, managing a large number of spiders should be not too complicated.

Scaling GPT-3 analysis

There is nothing to provision or worry about here (no clusters to manage, no GPUs to handle, …), you will only rely on OpenAI APIs. Take a look at their usage limits and setup your workloads accordingly.

It’s also worth mentioning that OpenAI APIs pricing have drop significantly from September 1.

If GPT-3 doesn’t suit you because you want e.g. something more “open”, I suggest to try BLOOM (using HuggingFace APIs) or OPT-175B.

Create a workflow

Since both Scraping and Analysis part can scale, at some point you may want to build a real pipeline.

Spotify’s Luigi is a nice option here. I’ve already wrote about it. But you can also consider newer stuff like Metaflow, which seems to be very promising.

Finally, you probably want to run your workflows on AWS EC2 or similar services.

Conclusions

We have just seen that GPT-3 can perform well when it comes to solve relatively complex NLP tasks like aspect-based sentiment analysis.

Considering the quality of the results, it didn’t took much time to “put the pieces together” and see some relevant results. And this mainly thanks to GPT-3 capabilities. Of course you can obtain similar using multiple task-specific transformers models, but it will require a lot more.

We have also seen that GPT-3 is not perfect, so it’s important to keep in mind the possible flaws of such approach.

Still, I think that LLMs will become the new standard in NLP and will be used more and more in the future to solve many NLP tasks.

They will help to improve your current projects or -why not- they will help you to create something new.

References

The full code of the experiment is available here.

Thanks a lot for reading!

--

--

Michele Pangrazzi

Software Engineer. Mostly Javascript and Python. Node.js addict.