Automating App Store Review Analysis with AI

How we built a tool for turning feedback into actionable insights

Pierluigi Rufo
Snapp AI
Published in
6 min readJul 18, 2024

--

We’ve all been there. Trying to listen to our users, empathise with them, and understand their pain points to identify how we can eventually improve our solutions.

Analysing user feedback and extracting insights is a core activity for every company that builds successful products.

This feedback can come from multiple sources, such as direct feedback, qualitative research, customer service inquiries, social media channels and customer reviews.

In the mobile software space, there’s a perfect place for finding valuable user feedback and insights: App Stores Reviews.

I personally find App Stores Reviews particularly helpful because:

  • They’re a mix of qualitative (textual feedback) and quantitative (star ratings) data.
  • They’re publicly accessible: This allow companies not only to analyse their own reviews but also those from the competition.
  • They provide a regular feedback stream. You can segment them based on timeframe, app version, country, language.
  • They’re a free data source. No expensive tool or investment needed.

There’s a catch though: if you are building a successful app, you might get hundreds of reviews per day, and analysing all of them is extremely time-consuming.

In fact:

  • You need to look into a large set of qualitative data, categorise and synthesise them.
  • You need to perform the analysis regularly to bring real value to the product development cycle.

In many companies, creating review reports takes precious time from UX designers and researchers. When resources are scarce and the backlog is full, these types of initiatives often fall at the bottom of the list.

Leveraging the power of AI for review automation

LLM models are powerful tools when it comes to analysing and extracting summaries from a large amount of data.

Our colleague and research specialist Ed Roberts has been using LLM models to create extensive UX analysis for several apps in the UK market.

Ed gave us the idea for building a tool to help researchers, designers and PMs to streamline the process of gathering user insights while saving hours of manual work.

Our hypothesis was the following:
We believe that, with the right setup, it’s possible to use LLM to analyse user reviews and extract insights with enough accuracy to add value to the product.

Our approach

At Snapp we’re big fans of prototyping and experimenting with different tools. We decided to take up on this challenge and we set up an exploration with a small team of devs and designers.

We approached the experiment in 4 phases:

  1. Understand & Define
  2. Prototype
  3. Evaluate
  4. Deploy
Structuring the experiment. A clear path ahead helps to keep focus and momentum.

1. Understand & Define

With our hypothesis clearly defined, we started to dive into the problem space.

We used the “Start at the end” methodology to outline user goals and then work backward, brainstorming potential solutions.

Imagine you’re a PM, UX Researcher or Product Designer wanting to gather regular insights about your app. What would you be most interested in?

We identified 3 main goals:

  1. Get an overview of your app performance for a defined timeframe.
  2. Spot common pain points, bugs, feedback and feature requests.
  3. Identify product improvements and prioritise action points.

We wanted to build a solution that could meet those needs simply by inputting an App Store link. The system would take care of the rest.

2. Prototype

After scoping the core solution we went into prototyping mode.
The goal of this phase was to create a PoC that could be tested.

We approached this phase in two stages:
A technical exploration, followed by the creation of an alpha build.

2.1 Technical Exploration
For the technical setup we used Python notebooks, as they’re a great tool for structuring and documenting experiments.

There are a few different ways to fetch reviews from the App Store. We opted for an approach using RSS feed. This solution comes with a limit of 500 reviews, but this was not a blocker for the scope of our experiment.

The challenging part was deriving meaningful and consistent insights from the reviews via the LLMs.

To achieve that we tested:

  • Different approaches to the data analysis (e.g. bulk analysis vs individual review analysis)
  • Different models (GPT-4 vs GPT-4o vs LLaMA-3)
  • Different prompts and prompt-chaining structures

👉 There’re a lot of things we tried out. We’ll write a bit more about the technical explorations in a separated post.

2.2 Alpha Build
Happy with the first results, we built an alpha version with the LangChain stack, deploying via LangServe and using LangSmith for monitoring and tracing.
LangServe provides a way to easily deploy LLM-based chains that are also backed by a playground for easy testing. We used this as a first point of contact for internal validation of the results.
Furthermore, using LangSmith also immediately allowed us to monitor prompt sizes, project costs and create evaluations to flag unexpected results.

After a few tests on the playground it was time to go for a more extensive evaluation.

An example of AI hallucination. It takes a lot of trials and errors to identify edge cases and get the prompt right.

3. Evaluate

We decided to manually analyse small data sets from different apps and compare the results with the LLM output of ReviewWizard.

Surprisingly we could match most of the insights of the manual report to the AI-generated ones 🎉

We noticed though, that AI insights were randomly ordered and not ranked based on the frequency in which they appeared in the reviews.
Adding ranking and weight to the insights was definitely an important aspect to consider. However, as it required a fundamental restructure of the prompt chain, we decided to keep this improvement for the post-MVP version.

Example of output comparison between ReviewWizard and a manual report for the Tesla app.

We were satisfied with the results, so we decided to break out of the prototyping mode and create a proper tool for everyone to use.

4. Deploy

From prototype to public release there are a lot of steps to take.
Two of the critical blockers were authentication and cost management.

After brainstorming a couple of options we decided to opt for offering authentication via the OpenAI API Key.
Even if the setup is a bit geeky, this option avoids adding request limits, and users can always stay in control of their usage costs.

Beside this, there’re a lot of improvements we wanted to add to ReviewWizard but we forced ourselves to keep the MVP scope small, release fast and gather feedback from a larger audience.

Slicing the features using the Now-Next-Later framework helped us to contain the MVP scope and release fast.

And that’s where we are 🚀

Today ReviewWizard is available as public Beta. You can try it out 👉 here.

No registration is needed, add your OpenAI API key and start creating your automated reports. We are curious to hear your thoughts!

What’s next

Based on early feedback, we identified a few core improvements we want to include next. Here are the ones at the top of our list:

  1. Option for downloading the report as .pdf
  2. Attribution: Linking insights to the reviews that generated them
  3. Raking of insights: Show how different insights score against each other
  4. Google Play Store support

We hope you enjoyed this article. If you get a chance to test ReviewWizard let us know how it went. We’re looking forward to hearing your feedback and suggestions!

--

--