Just noise? Using GPT3 to analyse Google reviews of London police stations

Published in

Behavioural Insights

13 min readJan 17, 2023

At a push, most of us might leave a review of a particularly good or bad restaurant. But who reviews a police station? And can we learn anything from the reviews?

There are often formal complaints channels for public services. But you’re unlikely to do that if you just can’t get through to the front desk. One avenue people have is public reviews on sites like Google maps. So let’s jump in and see what we can learn.

For this analysis I’m taking reviews of Metropolitan Police Service (MPS) police stations. I’ve used these because; a) from a quick search many of the stations had more than 10–20 reviews, b) the Met is under pressure to improve engagement with the public and c) policing and criminal justice is a personal area of interest.

A quick caveat, this is a side project to explore analysing review data using the tool du jour that is OpenAI’s models.

The data

The data extract is for 37 Met police stations. The list of stations comes from this website which was the first oneI found online that I could copy and paste from.

The Google reviews were then scraped using Outscraper, a tool for scraping data from the internet.

In total we have 1538 reviews dating all the way back to 2011. When you split that out over 37 stations and 10 years, it’s really not that much data. But let’s see what we can find out.

Exploratory analysis

It’s love-hate: 86% of reviews are either 1 star or 5 star reviews

There are very few reviews before 2016

Average review ratings have been declining with a significant drop in 2022

Some stations have higher ratings than others but small sample sizes limit the conclusions we can draw

Text analysis: Word pairs

Ok, so the quantitative data is vaguely interesting. There are some trends and differences in there, but nothing we’d draw robust conclusions from.

So let’s instead turn to the text. We can use text analysis tools to find out what people are saying. We’ll start out with something simple.

Bi-grams are simply pairs of words that appear beside once another such as “good service” or “long wait”. To understand what words are appearing in 1 and 5 star reviews, we’ll look at each of those separately.

One star reviews seem to be about waiting times, rudeness and trying to contact the station. Five star reviews seem to be thanking/praising the police

Note that for some reason ‘y’ has been replaced by ‘i’ 😕

This is interesting, but really it’s just a fancier version of a word cloud. We can try something more sophisticated.

Text analysis: Topic modelling

Topic modelling can be used to identify the topics within text documents. It works by identifying groups of words that appear together within the document (in this case, the review).

If we run the topic model on our data and ask it to find 4 topics here’s what we get.

It’s hard to distinguish these topics from one another. Topic 1 might be about thanking the police. The second might be about waiting times and the fourth might be about trouble contacting the police.

Text analysis: OpenAI GPT models 💻

After having mixed results and limited insights from that text analysis we’ll now use Open AI’s models to analyse the data.

Firstly, I wanted to categorise the reviews. To do this I manually looked at a sample of the reviews and assigned them to a few categories that seemed to pop up regularly. Note that this wasn’t very systematic.

Below are the categories and some reviews I assigned to them.

Waiting time

Struggling to get in contact

Thanks and/or appreciation of service

Poor investigation or response to crime

Generally poor customer service

Sarcastic (some of these are quite funny 😆)

So let’s see if we can use the AI model to categorise them.

Based on some trial and error and reading OpenAI’s guidance and other tips online I found that a long prompt providing plenty of context and examples led to better results.

Prompt to identify review categories

Imagine you are reading google reviews for a police station. 

Your task is to categorise these reviews. 

There are 7 categories: 
1) 'Waiting times', people complaining that they were waiting for a long time in the police station or for someone to visit them. 
2) 'Poor customer service', people complaining that they were not well treated at the station or an officer was rude to them. 
3) 'Thank you', the reviewer had a positive experience with the police and wants to show their appreciation. 
4) 'Difficulty contacting', people complaining that they could not contact the police over the phone. 
5) 'Poor investigation of crime', people complaining that the police did not investigate a crime properly. 
6) 'Sarcastic', positive reviews about how they received as good service a bit like a hotel guest after being arrested. 
7) 'Other', reviews that do not fall into any of the other categories. Here are some examples. 

Example 1: Review: Absolutely Useless! My car got destroyed and answer to this.. sorry nothing we can do for you... my question- are you going to investigate it? Answer- more likely No, So technically none cares .. what is POLICE for then? if you need them they give you papers to complete and that's all, job done! Well done! Oh yeah we don't care who done it, just call your insurance. So where is the justice?!? 
Category: 'Poor investigation of crime'. 
Example 2: Review: Why do they not ever pick up the phone? Is there a reason for this? I've been trying to get hold of the police station all week! Is there another phone number I can use!!!. 
Category: 'Difficulty contacting'. 
Example 3: Very nice and friendly staff who are attentive and patient, they helped me very kindly when I was hopeless and homeless. Thank you. 
Category: 'Thank you'. 
Example 4: Only 1 "officer" working the desk at any time, expect long waiting times if any other person decides to be there at the same time as you. 
Category: 'Waiting times'. 
Example 5: Review: worst police station ever!!!! They didn't even help me or try to understand my point of view! Very poor service and lots of lie, they never update you on things! 
Category: 'Poor customer service'. 
Example 6: Review: CAN YOU PLEASE STOP LEAVING DOGS OUTSIDE IN COURTYARD AT NIGHT. I live next door and constantly hear distressed and terrified dogs barking for up to 12 hours at night/into the day. It wakes my dogs, makes them bark, it wakes me up and I have a terrible day at work. 
Category: 'other'. 
Example 7: Review: amazing hotel my compliments to the chef, had some nice slop and reused teabag tea. 
Awesome. Category: 'Sarcastic' 
Example 8: Review: Love Brixton police station. Order pizza from my cell, got my Xbox....top place. 
Category: 'Sarcastic'. 
Review: [INSERT THE REVIEW TO BE CATEGORISED]
Category: [MODEL COMPLETION]

Technical notes

I used Open AI’s davinci-002 model. This is a faster and cheaper version of davinci-003 but not as sophisticated in what it can do. I tried using 003 but didn’t see much difference and it took longer to run.
Temperature in the model was set to 1. I tried varying this and didn’t see much difference in the results so stuck with the default.
Analysis was carried out in R using the OpenAI package.

How accurate were the categorisations?

My assessment wasn’t very systematic, but I took a sample of 50 and read through to assess how accurate the categorisation was. I assessed 44 out of 50 as correct. It was prone to some mistakes such as mis-labelling some genuine reviews as sarcastic. Tweaking the prompt might improved this, but I didn’t go back and re-run it all again.

The results… 🥁

Breakdown of categories for different review ratings — Poor customer service and waiting times make up a majority of 1 star reviews.

Breakdown of categories over time — No obvious strong trends especially given relatively small sample size

I found this pretty interesting. We were able to categorise reviews relatively accurately according to a framework that we defined for the AI model.

Text analysis — Using AI to extracting recommendations for improvement

Ok so that’s encouraging. But surely the promise of AI is that we can get it to do more than just categorise.

Next we try to extract recommendations as to how we can improve customer service. I just did this with a sample of 50 reviews categorised as “customer service”.

The steps were:

Run a prompt asking the AI to suggest changes as to how we could improve service to address the issue raised by the reviewer
This gives us some long sentences with recommended improvements. So we run the AI over these outputs and ask it to summarise them into a few short words
Then give the AI an aggregate of individual recommendations and ask it to make a series of suggestions to improve our service

First we run the following prompt over all the reviews

Prompt to generate customer service solutions

Imagine you are in charge of making improvements to customer service in a police station.
You're going to read customer reviews that are about the service and/or how they were treated at the station.

You need to suggest changes or solutions that could address the issue. 

Example: Complaint: The staff at reception were rude to me and didn't seem to take me seriously. 
Solution: Staff should take every complaint or report seriously. 

Example: I was waiting for ages, no one told us anything there was only one person at the reception. 
Solution: More staff at reception at busier times and better communication with customers. 

Complaint: [INSERT TEXT OF COMPLAINT]
Solution: [MODEL COMPLETION]

A sample of the solutions generated by the AI model

Next we get it to create short summarise of all 50 solutions it generated

Prompt to generate customer service solutions
Please summarise these suggested changes at the police station in a few words. 

Example: Solution: This customer had a number of issues with their experience at the police station. 
To address these issues, the staff should be more polite and efficient, and better equipped to handle customer inquiries. 
In addition, the police station should make an effort to be more responsive to the needs of women and victims of violence. 
Summary: More polite staff, more understanding of needs of women and victims of violence. 

Example: Solution: Staff should be respectful and helpful to all customers. 
If a customer reports a bad experience, the staff should take the complaint seriously and try to resolve the issue. 
Summary: More respectful, take all complaints seriously. 

Example: Solution: The officers should have used better judgement in deciding whether or not to arrest the individuals. 
If the individuals were not causing any harm and were fully clothed, there was no need for them to be arrested. 
The officers also needs to be more professional and respectful when dealing with individuals, regardless of their age. 
Summary: Exercise judgement before arrest, more professional and respectful. 

Solution: [INSERT TEXT OF SOLUTION TO BE SUMMARISED]

Summary: [MODEL COMPLETION]

A sample of the shortened solution summaries

This gave us brief summaries of suggested solutions to 50 customer service related reviews. The final step was to use this an part of a final prompt to develop the recommendations. Note that because I was only running this once, I used the (better) davinci-003 model.

Prompt to generate a series of recommendations

Imagine you are in charge of making specific and actionable recommendations to improve customer service at a police station. 

The following are a series of summaries of solutions to customer service complaints at our police station. 

There are a lot of them and we need 10 specific and actionable recommendations based on these to make changes to improve our service. 

The summaries of the suggested solutions are as follows: 
[INSERT LIST OF 50 SOUTION SUMMARIES]

So what did our AI management consultant recommend?

1. Offer more staff at the police station, especially during busy times.
2. Enhance customer service training for all staff to learn how to better communicate with guests and handle customer service inquiries.
3. Improve the communication between staff and customers by summarising complaints and apologising for any inconvenience.
4. Make sure staff members adhere to customer service policies, providing accurate information and taking responsibility for actions.
5. Make sure staff members are properly identified when dealing with the public.
6. Require officers to attend refresher courses on the law, customer service and other skills related to their work.
7. Investigate complaints thoroughly, document them and work with victims to provide helpful advice.
8. Provide more compassionate and understanding staff, better trained to deal with customers.
9. Work to improve customer service by creating a Police Complaints Department and assuring customers that they will not be threatened with arrest for complaining.
10. Make sure conversations with customers have a polite and professional phone manner and staff are more attentive and responsive to customer needs.

Ok so, these aren’t the most groundbreaking of recommendations but they might be a helpful starting point and are based directly on customer feedback. I also think that playing around with the ‘solution generation’ prompt could have generated more creative ideas.

So there we are. We’ve managed to analyse a large number of complaints and extracted some actionable changes based on the data.

Extracting important information

Another use case for AI is to extract key information from text data. Reading through some reviews I occasionally noticed that the reviewer would mention names of officers. I ran a prompt over the reviews to extract the names of any officers mentioned in the review. I ran this over a sample of 150 reviews.

Prompt to identify named officers

The following is a review about a police station. 
If the name of an officer is mentioned in the review, please identify the name of the officer or officers. 
Return the exact names of any officers mentioned. 

Example 1: Review: On the 9th of February I had an argument with police officer Barry Jones in the station who was very rude to me. 
Specific officers named: Barry Jones. 

Example 2: Review: The police didn't investigate at all after I reported my bike stolen!! 
Specific officers named: none. 

Example 3: Review: I had a terrible experience with an officer who pulled me over in my car outside kings cross station and gave no reason for doing so. 
I need this to be investigated. 
Specific officers named: none. 

Example 4: Review: When I went to the station in Brixton I tried to speak to the officer investigating the burglary at my house. 
Ed Smith the officer I spoke to was really helpful, thanks. 
Specific officers named: Ed Smith.

Example 5: Review: Horrible horrible experience. 
I went there to complain about someone and the police officer in the reception was very rude and was kinda blaming Me. 
She did not even ask for any relevant question. It was just a waste of time. worst police station ever!!!! 
Specific officers named: none. 

Example 6: Review: The constable didn't even help me or try to understand my point of view! 
Very poor service and lots of lies, they never update you on things! 
Specific officers named: none. 

Review: [INSERT REVIEW HERE]
Specific officers named: [MODEL COMPLETION]

How does it perform?

I didn’t read all 150 reviews thoroughly to know for sure if an officer was named. So I can’t say for sure that there were no false negatives. It returned names in 11 out of 150 reviews. 4 of those were correctly identifying the names of officers.

The names in the other 5 reviews were names of the reviewers who signed off their review, another person related to the incident and in two cases the AI thinking ‘Old Bill’ referred to a person 😆.

What are the key takeaways?

The volume of Google maps reviews for police stations is low limiting their value
Reviews may give user insights not captured in standard police feedback channels
AI models show very promising capabilities for analysis of unstructured text:

Prompt engineering matters a lot: Iterating the prompts you give the model can dramatically improve the results. Much as with providing instructions to a person, context and examples seems to help a lot. I’m sure that this is going to become it’s own niche art/science.
Fine tuning models should improve accuracy a lot: I didn’t try to do this, but OpenAI recommends fine tuning a model specific to your use case by training it on prompts and exemplar solutions.
The AI can be be given a framework for analysis: Just as you might write an analysis protocol for analysing qualitative data. You can instruct the AI giving it a role, context and examples to get the best results.
Model outputs do need human checking: As with human analysis you need to QA the results. I found it to be pretty good, but it was prone to mistakes. For example, mis-labelling some genuinely positive ‘thank you’ reviews as sarcastic. Or maybe I was wrong and it was right?! 😕