SteamVox: Data Science for Learning What Players Say and Feel

13 min readAug 5, 2019

--

For when you absolutely have to draw conclusions from 10,000 reviews in a single day.

NOTE: This is a non-technical writeup. For the detailed technical writeup, click here.

SteamVox in a Nutshell

I built SteamVox to obtain the “Voice of the Player” from players’ reviews on Steam (one of the largest online distributors for PC games.)

SteamVox scrapes, cleans, and analyses reviews from Steam to identify topics and corresponding sentiment for each topic for a chosen game on Steam.

Using the power of machine learning, it delivers convenient snapshots and narrows down areas of negativity to investigate (and areas of positivity to celebrate!).

A future iteration will be made applicable across most games on Steam.

This project is hosted on GitHub at:

https://github.com/alfredtangsw/steamvox

One of the world’s biggest PC Game distribution platforms. Source

Voice of the Player

If you could get a summary of your customers’ opinions on your product’s features with just a click or two, you could quickly address any changes in sentiment by solving the correct problem associated with that feature.

I stress “correct problem”. Passionate developers often build their idealised version of their game and have to course-correct to deliver what players want.

Intended or not, the effect is the same in the end. Players who dislike your features will quit very quickly.

The Problem

Time for doing anything comes at a premium in the video game industry:

Game developers are always pressed for time.
Game publication is time-sensitive.
Reviews are important, but reading reviews is time-consuming.

Game developers are pressed for time dealing with these problems.

Source: This paper by Petrillo et al., 2009.

Players won’t keep spending money while features are less-than-satisfactory!

Would you spend a whole month just manually processing reviews?

Problem Statement

We want to get actionable conclusions (backed by data) ASAP.

SteamVox Built to Answer a Real-World Business Problem

In my previous job, I encountered precisely the problem described above.

Due to time constraints, we had to make do with what we could get by talking to a small group of players and quickly skimming some forums.

The conclusions derived did not inspire confidence because the sample did not represent our playerbase well.

Looking back, I believe we could have done much better.

SteamVox is a Solution

SteamVox delivers insights that shorten the time taken to improve features.

Inspired by DoctorSnapshot by Nuo Wang, I developed my own methodology for generating sentiment snapshots for Steam games.

I decided to test my concept on one game first, before scaling up. I selected Total War: Three Kingdoms, a strategy game from the Total War series.

SteamVox Anatomy

Scrape Steam Reviews
Cleaning & Tokenisation
Topic Modelling
Sentiment Analysis
Scoring & Aggregation

Output: Tableau Dashboard

Tableau Public

Edit description

public.tableau.com

SteamVox deployed as a Tableau Dashboard

Sort by score and click on any bar to see the review, dominant topic, and sentiment score.

Steam Reviews as a Dataset

Steam reviews can be downloaded directly from Steam through an API.

When using reviews, take note:

Reviewers only make up a portion of your player base. The rest thrive/suffer/quit in silence.
Paid reviews tend to be longer and more usable. Free-to-play (F2P) games will require a modified approach.

Paid games get fewer spam reviews

To illustrate, I tested out my review qualification criteria on CounterStrike: Global Offensive, now an F2P game.

I scraped 92,074 reviews (1 year’s worth) and found that for the whole year, only 6,000 reviews had at least 5 words in them.

In contrast, Total War: Three Kingdoms has 3,661 usable reviews out of 8,235 available. As of the time of this writing, it has been available on Steam for 2 to 3 months.

The Internet is for Spam

Steam reviews are similar to social media posts, so it’s easy for the VADER SentimentIntensityAnalyzer to read them.

However, there are issues that plague all informal text found on the Internet:

Spam and memes.
Most reviews are incredibly short.
Gamers fully demonstrate their proficiency in the ancient and mystic art of sarcasm when they are upset. This throws VADER off.
Players may spam the same review repeatedly.
Some reviewers do not use punctuation.

There is no easy fix, but we can minimise such issues with effective data cleaning.

Total War: Three Kingdoms Reviews

8,235 English Steam reviews
8,150 unique reviews
3,661 determined usable for training

3,661 reviews may not sound like much, but the total word count exceeds 100,000 words!

For reference, the New Testament of the Bible has around 184,000 words.

Step 1: Scrape Steam Reviews

An awesome Internet citizen (woctezuma on GitHub) developed steamreviews, a Python package that allows you to download reviews from Steam directly.

Starting this project during the Steam Summer Sale of 2019, I encountered numerous scraping problems.

After adding my own error handling clauses to identify and troubleshoot this problem, I determined the issue to be 502 Bad Gateway errors.

It seems that Valve throttles traffic on Steam during big sales to prevent epic site crashes of yesteryears.

I asked woctezuma to make 2 improvements:

Improve error handling to complete the review scraping even if there are 502 Bad Gateway errors.
Add an option to show error reports.

woctezuma has added those features and made steamreviews better for everyone. Kudos!

Step 2: Clean and Tokenise

Arguably the most important step for Latent Dirichlet Allocation (LDA). It will be revisited multiple times as you seek out a coherent model.

LDA is not a simple plug-and-play algorithm that accepts a list of reviews as scraped from Steam. We have to process the text into a format that it can read.

Garbage in, garbage out.

We want to:

Get coherent reviews
Clean coherent reviews for LDA Model to process

Here’s what that looks like:

Essentially, we want to turn reviews into lists of keywords.

General process of cleaning and tokenisation:

Filter reviews by criteria to maximise the number of usable reviews
Transform reviews into lists of important terms
Identify phrases (n-grams)
Lemmatise each token and retain only nouns

2–1. Filtering the Data

4 conditions for qualifying a review as usable

Before filtering, the shortest review is just 1 word long!

After filtering, reviews are much meatier and more likely to be usable.

2–2. Tokenisation

Before the LDA Model can read any text, you have to clean the text in each review and transform it into a list of keywords, as shown earlier.

Cleaning Steps

Cleaning these 6 things up makes the text more machine-readable

2–3. Identify phrases

It’s important to identify phrases because they will make your topic model more coherent.

Many words consistently appear together, such as ‘total’ and ‘war’ in this dataset:

‘total’ and ‘war’ appear together a lot in this Word Cloud. ‘three’ and ‘kingdom’ as well.

Makes more sense to put them together as ‘total_war’ and ‘three_kingdom’!

After making phrases, you can be more certain that single-token terms most likely occurred by themselves. This can make it easier to differentiate topics.

To tokenise phrases, we turn them into single tokens using gensim’s Phraser:

2–4. Lemmatise tokens and retain nouns only

Words come in many forms and often come from the same lemma.

For example:

inflected: “playing”, “played”, “plays”
lemma: “play”

By reducing them to their lemma, “play”, we get 3 of the same word. This reduces noise without losing information.

I prepared 3 variations to try out in Topic Modelling:

All types of words
Nouns, verbs, and 3grams only
Nouns and 3grams only

3. Topic Modelling

For Topic Modelling, I used Latent Dirichlet Allocation.

What is Latent Dirichlet Allocation?

In NLP, LDA is a machine learning algorithm that explains observations in the dataset using latent (hidden) variables derived from the observations themselves.

LDA in Plain English

Let’s say you are given news articles without headlines or category labels, but with body text intact. Given no additional information, you are asked to label the categories.

One solution is to read the documents and find out what they’re about, then label them.

LDA identifies topics using the words in each document

As you read each document, you identify topics within the document. Essentially, LDA performs the above task.

Next, using LDAvis, we can generate a visualisation for our topic model.

1st Run: All Types of Words

There is little overlap between topics at first glance, but inspect the relevant terms on the right and the topics will seem less coherent.

To further clean the data, I removed both rare and overly common terms, and other terms that seemed to only create noise.

12th Run: Nouns and n-grams only

After intensive cleaning and retaining only nouns, I got this largely coherent topic model.

I identified 5 topics that made sense to me as a human:

I decided not to get too granular with the topics because the model is just meant to quickly identify general areas to investigate and improve.

Bonus Round: Nouns, Verbs, 3grams

Despite running the model with both verbs and nouns at least 5 times, the results did not appear more coherent than the nouns-only model.

However, it would be hasty to exclude the Nouns-and-Verbs dataset from future analysis.

Verbs could still be useful when identifying more general features using a large Steam dataset that contains reviews for thousands of games.

Replayability: 'replay'
Content: 'read','feel','connect'

Nouns & n-grams Most Effective

Topics are nouns themselves, and we look for the nouns most associated with them because they carry the most information.

n-grams also act like nouns in this case.

Which type of word is more informative about the topic?

Model Validation: Overview

Models like LDA do not have built-in validation reports, because there are no labels in the original dataset to check against.

To validate my model, I randomly sampled 10% of the dataset and manually checked the labels using 2 different sets of conditions.

Validation Findings

The model was able to assign topics correctly 85% of the time.

The lower the token count, the worse the classification.

Rows with 0 tokens were classified inaccurately most of the time.

This is expected. If I asked you to guess what I was thinking and I didn’t give you a single word, you wouldn’t have enough information to do so.

I checked the performance of the all-token-counts sample dataset as well:

Note the red lines for 25th, 50th, and 75th percentiles for token counts

Reviews with 0 to 2 tokens tended to be misclassified more than those with ≥ 3 tokens.

It simply wasn’t worth trading a whole 25% of the usable data for a marginal improvement.

At this point, I dropped all reviews with 1 or 0 tokens.

I was unwilling to drop everything up to 2 tokens. For a small decrease in misclassification rates, we would have to drop 25% of the dataset. That seemed like a bad tradeoff to me.

In summary:

The model learned 5 topics, and all were coherent topics corresponding to game features of Total War: Three Kingdoms.
The model was able to assign the dominant topics of reviews accurately 85% of the time; higher with 0- and 1-token reviews removed.
Makes no sense to classify reviews with 0 or 1 tokens, so they will be excluded entirely in the final aggregation.

Step 4: Sentiment Analysis

Now that we know what players are talking about, we want to know how they feel about each topic.

Let’s pause here to address a critical question.

Why not use the provided sentiment label?

Steam reviews force the user to give a positive/negative opinion. There is also only positive and negative; no middle ground.

This simplifies data analysis, but in reality, most people have some level of mixed opinion.

Even if they are “mostly” positive, for reasons discussed in the next section, I didn’t want to stop at getting an “overall positive” for the whole review.

Here’s a preview of VADER at work:

Yes, my instance of the VADER SentimentIntensityAnalyzer is named “Anakin”.

VADER takes the text that you feed it (including punctuation and emotes) and analyses the sentiment behind the words. Out of the 4 scores it outputs, we just need the compound score to get the overall sentiment rating.

Caveats:

We have to assume that (at least for paid games) most players have a vested interest in making the developers improve the game.
VADER performs better when it is fed more text.

VADER compound score can get good sentiment scores on long documents! (This is the same review)

5. Scoring & Aggregation

Sample output: tokenised and scored by sentences. Note that the classification is not highly accurate.

After we identify topics and score each document, we need to aggregate scores across the whole dataset in order to generate snapshots.

Preprocessing for Aggregation

The compound score returned by VADER is a decimal. For this model, however, the decimals provide dimensions that we don’t need.

Histogram of sentiment scores for sentences in a test review

For generating a snapshot of how many reviews/sentences in a dataset are positive, the exact value of the score is not needed.

We just need to know whether it’s positive, neutral, or negative (represented by 1, 0, and -1 respectively).

I decided to change the sentiment scores to integers:

Score > 0.1 → 1 (Pos)
-0.1 ≤ Score ≤ 0.1 → 0 (Neutral)
Score < -0.1 → -1 (Neg)

Aggregation Logic

Here, we score and aggregate by paragraphs, because it is close to how humans read reviews, identify topics, and feel the sentiment of the reviewer.

Readers identify topics and sentiments by the paragraph as they read the document.

In the end, readers may generally recall, “This person seems to really like the strategic gameplay. They also liked tactical gameplay and authenticity”.

Score each review, then aggregate for the whole dataset

This method used Syntok, a specialised text tokenisation tool. Using Syntok, I parsed the data into paragraphs and pseudo-paragraphs, all of which were more coherent than the short sentences I had up to that point.

This was the final result, visualised in Python:

I was more confident about using these results than those from previous attempts because:

Aggregation logic follows how a human reads and scores reviews
Longer sentences have good classification rates

How to Use the Data

How to use the data (Tableau example):

Narrow down which features to pay most attention to (most negative)
Sort Original Text by Compound Score, then click mouse over any bar to read the review/paragraph.

Insights and How to Use Them

In the end, it’s up to a human to make sense of the data.

I inspected reviews for the most negative category (Strategic Gameplay) first and sorted the most negative ones to the top.

For Total War: Three Kingdoms, Strategic Gameplay has the highest proportion of negative reviews.

That doesn’t necessarily mean it’s badly built. Numbers aren’t everything; context matters as well.

Sample negative review (Strategic Gameplay)

This reviewer is unhappy with how the Diplomacy system works in Total War: Three Kingdoms, but that doesn’t mean it is objectively bad.

Sample positive review (Strategic Gameplay).

Put in context, the Three Kingdoms era was full of impermanent diplomatic deals and alliances.

The developers built Total War: Three Kingdoms to maximise immersion, and they included the era’s political upheavals in their game design.

It may be that the player doesn’t agree with the design due to their personal preferences.

Positive review by a Chinese player (Content & Authenticity)

If Creative Assembly’s aim was to target the Chinese PC gaming market, they may have succeeded. Most reviews are about the authenticity of the game and how they succeeded in making an authentic and fun Three Kingdoms game.

Another negative review for Strategic Gameplay

The above review presents legitimate concerns about Strategic Gameplay. The chief complaint is that there are bugs in the diplomacy system and on the campaign map.

With this information on hand, developers can quickly pinpoint what exactly to investigate and fix.

Conclusion

Doing this project challenged me to pick up and apply machine learning to a real-world problem using real-world data, the gathering and cleaning of which came with numerous real-world challenges.

It was also immensely satisfying to apply different skills from my diverse skill set in combination with each other.

Thank you for reading, and stay tuned for more!

My work has not ended; there is much more room for improvement.

Future Work (in progress):

Add timestamps and enable aggregation by time period
Add a review age field (reviews become less relevant as new updates are released)
Make the model more generally applicable to most/all Steam games

References

Liu, S. (2019) Dirichlet distribution. https://towardsdatascience.com/dirichlet-distribution-a82ab942a879

Amazon Web Services (2019) How LDA Works. https://docs.aws.amazon.com/sagemaker/latest/dg/lda-how-it-works.html

Clark, S. (2013) Topic Modelling and Latent Dirichlet Allocation. Machine Learning for Language Processing: Lecture 7. https://www.cl.cam.ac.uk/teaching/1213/L101/clark_lectures/lect7.pdf

Pandey, P. (2018) Simplifying Sentiment Analysis using VADER in Python (on Social Media Text). https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f

cjhutto. vaderSentiment Documentation. https://github.com/cjhutto/vaderSentiment

SIL International. SIL Glossary of Linguistic Terms: Valency. https://glossary.sil.org/term/valency

About the Author

Alfred Tang is a writer-turned-data scientist with 3.5 years’ experience as a copywriter and sales optimisation analyst in the video game industry.

He comes from a diverse educational background, having studied data science, finance, hotel administration, and mechanical engineering.

In this project, he put on different hats, drawing on his experience as a writer, analyst, polyglot, and gamer to create this model and its aggregation logic.