SteamVox: Technical Writeup

27 min readAug 3, 2019

All the nitty-gritty you need to get started using LDA.

NOTE: This is a detailed technical writeup meant to describe the logic behind every decision made. The shorter version is here.

About the Author

Alfred Tang is a writer-turned-data scientist with 3.5 years’ experience as a copywriter and sales optimisation analyst in the video game industry.

He comes from a diverse educational background, having studied data science, finance, hotel administration, and mechanical engineering.

In this project, he puts on different hats, drawing on his experience as a writer, polyglot, analyst, and gamer to create this model and its aggregation logic.

Introduction

SteamVox is a model for identifying topics and analysing sentiment for player reviews on Steam.

Steam is an app store for PC game distribution, one of, or perhaps the biggest in the world.

“Vox” is Latin for “voice” and refers to “Vox populi, vox dei”: “The voice of the people is the voice of God.”

This project is hosted on GitHub at:

https://github.com/alfredtangsw/steamvox

One of the world’s biggest PC Game distribution platforms. Source

The business problems behind this project are discussed in the non-technical article.

SteamVox Overview

SteamVox delivers insights that shorten the time taken to improve features.

SteamVox Anatomy

Scrape Steam Reviews
Cleaning & Tokenisation
Topic Modelling
Sentiment Analysis
Scoring & Aggregation

Toolbox

Scraping — steamreviews, a package by woctezuma
Cleaning & Tokenisation — NLTK, SpaCy, gensim, Syntok
Topic Modelling — Latent Dirichlet Allocation, ldaMulticore
Sentiment Analysis — VADER SentimentIntensityAnalyser
Scoring & Aggregation — Python, Tableau

Output: Tableau Dashboard

SteamVox deployed as a Tableau Dashboard

Sort by score and click on any bar to see the review, dominant topic, and sentiment score.

Current State

SteamVox v0.1 supports only Total War: Three Kingdoms, as it is a proof of concept.

Future versions will be expanded to be more generally applicable across Steam.

Steam Reviews as a Dataset

Steam reviews for each game are stored as JSON files. This is a convenient format for Python, because it can be read in as a Python dictionary.

The Internet is for Spam

Steam reviews are similar to social media posts. This is good for the VADER SentimentIntensityAnalyzer, which is designed to analyse such text.

However, there are issues that plague virtually all informal text found on the Internet:

Spam and meme reviews are common and often hold no meaning.
Most reviews are incredibly short (fewer than 5 words).
Gamers fully demonstrate their proficiency in the ancient and mystic art of sarcasm when they are upset (this throws VADER off).
Players may spam the same review over and over, sometimes in coordination with other reviewers, to “review bomb” a game.
Some reviewers do not use punctuation. This can affect tokenisation.

No easy fix exists for these problems, but we can minimise the issues faced by using effective cleaning and tokenisation techniques.

Paid games get fewer spam reviews

It seems that paying players have more incentive to give serious feedback.

I tested out my review qualification criteria on CounterStrike: Global Offensive, a free-to-play (F2P) game.

I scraped 92,074 reviews (1 year’s worth) and found that only 6,000 reviews had at least 5 words in them.

Total War: Three Kingdoms, on the other hand, had 3661 such reviews from just 2 months on Steam.

Total War Reviews

8,235 English Steam reviews
8,150 unique reviews
3,661 determined usable for training

3,661 reviews may not sound like much, but the total word count exceeds 100,000 words!

For reference, the New Testament of the Bible has around 184,000 words.

Step 1: Scrape Steam Reviews

An awesome Internet citizen by the username of woctezuma developed steamreviews, a package available on PyPI. It allows you to query reviews from the Steam API directly with just a few lines of code.

To use it, just type the following in your Anaconda Prompt:

pip install steamreviews

To use it, import it like you would any other Python package:

import steamreviews

Then follow the instructions here on the main page.

NOTE 1: If you want to get verbose progress reports, you need to pass the parameter verbose=True (by default, verbose=False).
Sometimes the Steam API cannot be queried, and you may want to be informed of errors.

download_reviews_for_app_id(app_id, query_count=query_count,chosen_request_params=request_params,
verbose=True)

NOTE 2: During high-traffic periods e.g. Steam sales, you may have to run the query multiple times (or wait for the sale to end) to get your dataset.

Valve throttles traffic on Steam during sales, so every time you access a page, there is a chance of getting a 502 Bad Gateway Error.

I started this project during the Steam Summer Sale of 2019 and had trouble running this package. I would get incomplete review datasets that did not go back the 365 days I specified. However, it worked fine outside of the sale period.

After adding my own error handling clauses to identify and troubleshoot this problem, I asked woctezuma to make 2 improvements:

Improve error handling so that the API querying process does not end upon encountering a single 502 Bad Gateway error.
Add option for verbose error reports so you can choose to be informed of such errors.

woctezuma has added those features and made steamreviews better for everyone. Kudos!

Step 2: Clean and Tokenise

This is arguably the most important step in this project. It will be revisited multiple times as you tweak the model for coherence.

There is a heavy focus on cleaning and tokenisation in this article because LDA is not a simple plug-and-play algorithm that accepts a list of reviews as scraped from Steam.

Garbage in, garbage out.

We want to:

Get coherent reviews
Clean coherent reviews for LDA Model to process

Here’s what that looks like:

Essentially, we want to turn reviews into lists of keywords.

General process of cleaning and tokenisation:

Filter reviews by criteria to maximise number of usable reviews
Tokenise reviews into lists of important terms
Identify phrases (n-grams)
Lemmatise each token and retain only nouns

2–1. Filtering

Raw data

The shortest review is just 1 word long!

2–1 (i). Playtime filtering

The shortest review is **still** 1 word long.

Playtime conditions:

Player must have played at least 3 hours
Player must have played at least 10 minutes in the 2 weeks leading up to publication of review

Steam allows refunds for up to 2 hours of game time, and opinions are more valid if the player has clocked playtime recently.

Even after dropping more than 3000 reviews, there was one problem: the shortest review was still 1 word long. That’s not usable.

2–1 (ii). Review length (in words) filtering

Now, we can be more certain that our dataset is full of usable reviews.

I decided to drop any reviews whose word counts were equal to or shorter than…

“Best Total War game ever.”

This sentence happens to be 5 words long, and 25% of the data was 1 to 5 words long. It can at least tell you that the player likes this game and that it has earned its place in the Total War series.

However, I wanted meatier, more informative reviews, so I decided to drop these reviews and anything shorter.

Dropping reviews containing 5 or fewer words, I essentially traded 25% of the data to ensure that the rest of the data was usable. It was worth it.

2–1 (iii). Remove duplicates

Sometimes, by chance, 4 different people will type exactly the same thing, and that’s fine. We just don’t want to include reviews if they are spammed by the same person. Duplicates are more noise than data.

2–1 (iv). Language filtering

SpaCy langdetect at work (detection by sentence) on the first of 3 randomly selected reviews

I designed SteamVox to analyse English reviews, so naturally, I needed a dataset of English reviews. During the scraping step, I set steamreviews to scrape only reviews marked as being in English, but I wanted to be absolutely sure.

I manually inspected 3 randomly selected reviews and found that 1 of them did in fact contain 3 languages: English, Turkish, and German. If 1 review can have 3 languages, surely there are more such reviews. (There were others with Chinese and other languages.)

Using SpaCy langdetect, we can detect languages by sentences in a document, or for the whole document in general. However, nothing is perfect.

SpaCy langdetect tends to detect more false positives when detecting languages by sentence.

On the same document, SpaCy langdetect returned English only when detecting the language of the whole document.

Think of it as a tool that gets the mean probability score for each langauge detected in the document. It returns only the language with the highest mean probability score, greatly reducing the chances of getting false positives.

I decided to filter languages at the document level, so it would only filter out those that were not mainly in English. Any remaining documents would likely only have small amounts of non-English text, which could be filtered out based on their (in)frequency.

We only dropped 1.77% of reviews not labelled as “English”!

2-2. Tokenisation

Before the LDA Model can read any text, you have to clean the text in each review and transform it into a list of keywords.

Cleaning Steps

i. Remove BBCode (a markup language used by Steam)

ii. Expand contractions

iii. Remove punctuation and split reviews into lists of words

iv. Convert single digits and Roman numerals to words

v. Clean out incoherent/unhelpful/spam words

2–2 (i). Remove BBCode

There’s a convenient BBCode parser package (linked above) for Python. I thought of using Regex to remove all the markup, but the BBCode package already does it. No need to reinvent the wheel.

2–2 (ii). Expand contractions

Shorthand use is very common in informal text. It’s better to expand them so that they can more easily be removed when cleaning out stopwords (words that do not have much meaning by themselves, e.g. “of” and “the”).

2–2 (iii). Remove punctuation

Testing out the regular expression on Regex101

You can split a string on non-word characters so it will return ['I','like','Total','War','3','Kingdoms']. That makes the tokens easier to transform.

2–2 (iv). Convert single digits and Roman numerals to words

We want to capture phrases such as 'shogun_two' and 'three_kingdoms' , and we can only do so if we convert single digits and Roman numerals into their word forms.

‘2’ → ‘two’

However, the Roman numeral “I” is skipped because it conflicts with the English pronoun “I”. It’s not a problem anyway, because first games in a series are often mentioned without the number.

2–2 (v). Clean out incoherent/spam/unhelpful words

Because of the informal nature of Steam reviews, people do not write in a format that many Natural Language Processing (NLP) toolkits are trained to handle. We need to convert tokens into terms that such toolkits are equipped to process.

For example, there are Internet memes such as “REEEEEEEEEEEEEEEE” where are no hard rules on the capitalisation or number of letters.

To control the occurrence of such words, we retain only words that are longer than 1 and shorter than or equal to 45 characters long:

1 → ‘a’
45 → ‘pneumonoultramicroscopicsilicovolcanoconiosis’

2–3. Identify phrases

It’s important to identify phrases because they will make your topic modelling more coherent.

Many words consistently appear together, such as ‘total’ and ‘war’ in this dataset:

‘total’ and ‘war’ appear together a lot in this Word Cloud. ‘three’ and ‘kingdom’ as well.

Makes more sense to put them together as ‘total_war’!

After making phrases, you can be more certain that single-token terms most likely occurred by themselves. Players can talk about ‘war’ as a part of the game rather than ‘war’ in reference to the ‘total_war’ series.

To tokenise phrases, we turn words into single tokens using gensim’s Phraser:

For those who prefer to see this visually

2–4. Lemmatise tokens and retain nouns only

Words come in many forms and often come from the same lemma.

Lemma: The canonical form of an inflected word; i.e., the form usually found as the headword in a dictionary, such as the nominative singular of a noun, the bare infinitive of a verb, etc.

A simple example would be:

inflected: “playing”, “played”, “plays”
lemma: “play”

Returning tokens to their dictionary form makes it easier to identify topics.

Take for example the following sentences:

I love playing Total War!
I have played every Total War there is.
I play Total War every Friday night.

They are all talking about the same topic in different ways, using different forms of the word “play”.

You only need to know that all three reviews mention ‘play’ and ‘total_war’, to get an idea of what the three sentences are about. They all refer to the ‘total_war’ series and the ‘play’ activity.

Left uncleaned, “playing”, “played”, and “plays” will all be treated as different tokens in the model. By reducing them to their lemma, “play”, we get 3 of the same token, reducing the noise in the LDA model without losing information.

SpaCy has a built-in lemmatiser that we can use for this purpose. Even more conveniently, we can use Part-Of-Speech (POS) tags to select what kinds of words we want to include.

Sample code for SpaCy lemmatisation

I prepared 3 variations to try out in Topic Modelling:

All types of words allowed
Nouns, verbs, and 3grams only
Nouns and 3grams only

3. Topic Modelling

For Topic Modelling, I used Latent Dirichlet Allocation.

What is Latent Dirichlet Allocation?

Fun fact: Dirichlet comes from the name Peter Gustav Lejeune Dirichlet, the German mathematician for whom the Dirichlet Distribution is named.

In NLP, LDA is a topic model used for unsupervised machine learning about a dataset.

In this case, we use the reviews to determine the topics that exist within them, rather than having topic labels that another model (such as a Logistic Regression) could use to predict the topic of the review.

To go deeper into the mechanics of LDA, check out this link.

LDA in Plain English

There’s an easier way to understand LDA.

Let’s say you’re asked to provide category labels for a set of news articles which do not have headlines or category labels but have completed body text. You are given no additional information, such as the distribution of article categories.

One way of solving this problem is to read the documents and find out what they’re about, then label them.

As you read each document, you identify topics within the document.

Essentially, LDA performs the above task.

How does LDA work?

First, let’s talk about what LDA assumes.

LDA assumptions:

Each document is a bag of words. Frequency of occurrence is key; the order of word placement is not important.
Each document contains multiple topics.
Each topic is a distribution over a set of keywords.
Each document is assumed to be generated by LDA’s probabilistic model.
Only the number of topics is specified in advance.

Based on these assumptions, we have cleaned and tokenised our data into a format that LDA can easily read and analyse.

Each review has been tokenised into a list of keywords that LDA can then be trained on.

Training the LDA Model

NOTE: You may wish to use ldaMulticore if your computer has more than 1 or 2 cores. This significantly shortens the time required to run the model.

First, we have to specify a number of topics that the LDA model must identify.

The LDA Model then learns topic distributions across the documents and fits what it finds into the number of topics you specify.

Caveat: LDA rarely returns a full set of coherent topics the first time it is run, partly because it is random unless you set a random_state.

In case you’re wondering, I set 10 topics the first time, got a beautiful distribution of topics, and didn’t save my model before running it again. That distribution of topics is now lost to us forever. Always save your model!

Second, the LDA Model assigns temporary topics for all documents. Don’t worry about this, there’s no human input required for it.

Third, the LDA Model will evaluate topics in each document according to the keywords associated with each topic. The model iterates over the number of passes that you instruct it to use.

Next, using LDAvis, we can generate a visualisation for our topic model.

1st Run: All Types of Words

Topic separation looks good at first glance, but human assessment of the smaller topics will tell you that they are not so coherent.

At this point, I revisited cleaning and tokenisation. I also applied filter_extremes() to remove overly common terms and added additional stopwords for removal. Such words seemed to only create more noise, most notably ‘total_war’, which appeared in numerous topics.

12th Run: Nouns and n-grams only

After intensive cleaning and retaining only nouns, I got this largely coherent model.

I identified 5 topics that made sense to me as a human:

I decided not to go too granular with the topics because such granularity is not important for the purpose of this model, which is to quickly identify general areas to investigate and improve.

Specific details can be gathered by reading relevant reviews (or playing that part of the game yourself).

Bonus Round: Nouns, Verbs, 3grams

I tried the dataset with nouns and verbs last because I thought to test it out only after I got the working model on my 12th run.

Despite running this model at least 5 times, the results did not appear more coherent than the nouns-only model.

Nouns & n-grams Most Effective

When we identify topics, we are trying to identify nouns. For those nouns, there are also nouns are most associated with them. Nouns carry the most information in this case, as compared to adjectives and pronouns.

For example, “bicycle” (noun) relates to the sports topic known as “cycling” (noun). “Exciting” and “fun” can also be related to cycling, but they are less informative about the topic than “bicycle”, “helmet”, and “race”.

n-grams also act like nouns in this case. After forming n-grams, we are less concerned about the meaning of the phrase and more concerned about how frequently the phrase occurs.

However, it would be hasty to exclude the Nouns-and-Verbs dataset from future analysis off-hand. I can see certain verbs such as “replay”, “read”, and “feel” being important for general game features such as Replayability and Content.

This could still be useful when identifying more general features using a large Steam dataset that contains reviews for thousands of games.

Model Validation: Overview

It is always a good practice to check your model’s performance after you build it.

Whereas supervised learning models can be used on a test set of data and then scored using pre-built functions to get R-squared values or classification reports, unsupervised learning models do not have such validation reports. There is no ground truth in the original dataset to check against.

I still thought validation was necessary, so to validate my model, I checked the labels manually to see if the classification was reasonably accurate.

I randomly sampled 10% of my dataset twice, using different conditions for each sample dataset:

Token count ≤ 5
All token counts

I chose 10% as a sample size because I wanted to be 95% confident of my conclusions. Trying to be 99% confident would have meant much more labelling work for a marginal increase in accuracy, which is a bad tradeoff.

I wanted to see whether low token counts would have horrible classification accuracy.

Hazards and Pitfalls

Beware that checking in this manner can be somewhat arbitrary.

That’s because it is up to you (the creator of this model) to decide whether each row is classified correctly. Check your bias!

Another problem is that you are basically checking against your training data. Typically, training set accuracy will be higher than test set accuracy. Checking against an unseen dataset may be helpful for validating your model, but beware another problem.

Manual labelling is extremely time-consuming. Be sure that you are satisfied with your LDA Model’s topics before you start labelling any sampled dataset. I burned a good 10 hours or so on this particular activity and restarted many times.

Validation Findings

The model achieved 85% accuracy overall.

This was based on the sample that included all token counts. This result seemed quite good for an unsupervised learning model.

The lower the token count, the worse the classification.

I treated my classification results as harshly as I could, because when I was lenient with the results, the rows with 0 tokens had high classification accuracy.

This made no sense to me because by simple reasoning, you can’t possibly know what topic a sentence is about if there are 0 keywords. (Of course, this 0-keywords situation happened only after we cleaned our text intensively.)

Additionally, I found that most of the time, the model defaulted to Content & Authenticity, the most dominant topic of the entire dataset.

Through manual inspection of the sampled data, I found that most short reviews with 0 tokens said something like “This is the best Total War game!” which naturally would have only ‘total_war’ as a token (which we removed during cleaning for a good reason).

Worth noting: Since LDA is a probabilistic model, when given 0 tokens, it often gives you the most commonly occuring topic because it’s more likely to be the correct than any other topic. This behaviour may change when using an upsampled dataset with equal class balances.

I checked the performance of the all-token-counts sample dataset as well:

Note the red lines for 25th, 50th, and 75th percentiles for token counts

Cumulative misclassification rate dropped significantly when reviews had 0 to 3 tokens, and the misclassification rate stabilised at around 15% when token counts were higher. (Leading to the ~85% overall accuracy).

It simply wasn’t worth trading a whole 25% of the usable data for a marginal improvement.

At this point, I decided to drop all reviews with 1 or 0 tokens. I was unwilling to drop everything up to 2 tokens, because that meant dropping 25% of the dataset, a whopping 916 out of 3661 reviews.

In summary:

The model learned 5 topics, and all were coherent topics corresponding to game features of Total War: Three Kingdoms.
The model can assign the dominant topics of reviews accurately 85% of the time.
Makes no sense to classify reviews with 0 or 1 tokens, so they will be excluded entirely in the final aggregation.

Step 4: Sentiment Analysis

Now that we know what players are talking about, we want to know how they feel about each topic.

Let’s pause here to address a critical question.

Steam data comes with sentiment labels provided by the user, and the user will not be allowed to post a review unless they provide a score. There is also only positive and negative; it does not allow a middle ground.

That would normally be a good thing, and Steam’s review system is designed that way to simplify data analysis, but I did not think it was reflective of reality.

Most people have some level of mixed opinion. It’s highly unlikely that anyone will love every single feature in a given game.

Even if they are “mostly” positive, for reasons discussed in the next section, I didn’t want to stop at getting an “overall positive” for the whole review.

Now, let’s discuss VADER and some issues it may present when you use it.

Known issues using VADER:

Tends to give neutral scores.
Sarcastic reviews are likely to be given the wrong sentiment score.
VADER may read vulgarities used for emphasis as negative when they are in fact positive.
Short documents tend to be given the wrong sentiment score.

Here’s a preview of VADER at work:

Yes, my instance of the VADER SentimentIntensityAnalyzer is named “Anakin”.

VADER takes the text that you feed it (including punctuation and emotes) and analyses the sentiment behind the words. Trained to analyse social media posts, VADER performs well on Steam reviews because of how similar they are.

Its output is 4 separate scores: negative, neutral, positive, and compound. For specifics on each, and how they are calculated, check out this article.

We are only interested in the compound score because we just need the overall sentiment for each document on hand. Learn more about the compound score here.

The compound score is a weighted composite score for the valence scores of each word in the text, so you can think of it as the overall score for the entire document’s sentiments. More on valency here.

Caveats:

We have to assume that (at least for paid games) most players want to give serious feedback that can help improve the game, because they have a vested interest (money spent).
We must feed as much text to VADER as possible, before we get sentiment scores.

VADER compound score can get good sentiment scores on long documents! (This is the same review)

5. Scoring & Aggregation

Sample output: tokenised and scored by sentences. Note that the classification rates look bad and

After we identify topics and score each document, we need to aggregate scores across the whole dataset in order to generate snapshots.

Preprocessing for Aggregation

The compound score returned by VADER is a float. For the purpose of aggregation, however, I found the float scores just made the problem more complex.

Histogram of sentiment scores for sentences in a test review

I decided to change the sentiment scores to integers and store them in a new column.

For generating a snapshot of how many reviews/sentences in a dataset are positive, the magnitude of the score is not needed. We just need to know whether it’s positive, neutral, or negative (represented by 1, 0, and -1 respectively).

Here’s how that looks:

Score > 0.1 → 1 (Pos)
-0.1 ≤ Score ≤ 0.1 → 0 (Neutral)
Score < -0.1 → -1 (Neg)

Aggregation Logic

Steps in general:

For each review in the dataset…
Iterate through each topic in the topic dictionary…
Form smaller dataframe just for that topic.
Count the number of Pos/Neg/Neutral rows associated with that topic.
Divide Pos/Neg/Neutral row count by total number of rows for that topic to get proportion percentages.
Find the category with the highest proportion (the mode of this distribution); append the category to a topic score list inside a score_dict.
If there are 2 dominant sentiments, append Neutral sentiment. (More than 1 dominant sentiment → unclear sentiment → Neutral).
At the end, score_dict contains lists with 1 sentiment rating per review. Divide Pos/Neg/Neutral review count by total number of reviews to get proportion percentages.
Final result is your numerical snapshot. Use df.plot() in Python, Tableau, or any other visualisation/deployment tool you like.

I liked this method because appending to a list inside a dictionary saved me the trouble of creating “NA” fields for reviews that had fewer than 5 topics and cleaning the “NAs” away later during final calculations.

Deriving Final Output from Aggregation Logic

Here’s the code for it:

The above aggregation logic works for any number of rows you derive from the dataset, as long as you correctly assign the review numbers before and during transformation.

Similar/identical logic is used in all 4 methods as the final step. The difference in results comes from how I parse and group the rows for final aggregation.

Aggregation Methods

By dominant topic of each review
By sentences (regardless of review)
By sentences, aggregated by review
By “paragraphs”, aggregated by review (Chosen method)

I tried so many different ways of aggregation because I wanted to summarise the scores efficiently while minimising bias and maximising accuracy and coherence. Nobody should use biased and/or unreliable insights.

5–1. By review

A review may contain multiple topics, each with their own sentiment.

The most obvious place to start was to just get the dominant topic for each document and score the sentiment of that review to get the overall sentiment.

There were 2 problems:

Data wastage: Using this method, the other 2 topics would be ignored. I thought it was wasteful to just leave them out.
Biased scoring: Scoring based on the whole review while only taking the dominant topic into account does not make sense to me at the review level. (I think it’s reasonable at the sentence/paragraph level.)

The results look good at first glance, but they are biased:

Method 1: Aggregate by dominant topic of each review

5–2. By sentences (regardless of review)

Next, I tried going by sentences instead, to solve the 1st method’s problems.

The reasoning was that if I went by sentences, I could simplify the aggregation since (in theory) it should not matter which review each sentence comes from.

I tokenised the reviews into sentences and validated my model one more time before I proceeded. All in all, there were 16,739 sentences as tokenised by SpaCy, parsed from 3661 reviews.

I found that going by sentences, the model produced a much lower classification accuracy of 67%.

After investigating, I found out that tokenisation by sentence caused all documents to have fewer tokens (obviously), and more documents with 0 and 1 token counts were created.

I found 2 problems:

SpaCy’s tokenisation (with default settings) tends to break sentences incoherently at times.
Low classification accuracy.

I simply couldn’t trust the model’s classification at the sentence level. The aggregation would be wrong as well because aggregation is based on the classification.

Allow me to illustrate:

SpaCy’s sentence tokenisation generates a lot of sentences…

I don’t think that “1.” should be a sentence. To be fair, of course, SpaCy uses punctuation to help determine sentence boundaries. With some parameter changes, it’s possible to change how the sentences are tokenised. I decided to resolve this in a later attempt.

I then counted the number of pos/neg/neutral sentences and got them as a percentage of the total number of sentences, grouped by topic, throughout the whole dataset.

Neutral counts increased dramatically, most likely because VADER had no idea how to score them and decided to assign a score close to or equal to 0.

I had serious doubts about this result because:

Classification accuracy was low (~67% on the sample)
Shortened sentences tend to get wrong or neutral scores from VADER
The context of a sentence matters. Aggregating all sentences by the review they belonged to made more sense to me.

5–3. By sentences, grouped by review

I repeated the above method with 1 difference: I grouped the sentences by the reviews they belonged to, to get the overall sentiment for each topic found in each review. Then, I aggregated for the dataset.

The main difference is the logic behind the parsing. I thought about how a human might read reviews and manually assign the sentiment score.

One way could be to:

Label all sentences with topics
Aggregate using sentences instead of reviews as documents (still separate rows in the dataset)

This summarises the scores for all sentences about a topic within each review.

The results are as follows:

Method 3: By dominant topic of sentences, grouped by topic

However, at this point I had not yet resolved the SpaCy sentence tokenisation issue. I wanted to see if this would change the results significantly.

There were some minor changes, but the overall result still looks like the one derived from Method 2. Some of the same issues remained, too.

However, I believed in this result a little bit more because of the reasoning behind the parsing and grouping.

I felt like I was on the right track, building the aggregation logic to follow a typical human’s way of identifying topics and scoring sentiment while reading text.

5–4. By “paragaphs”, grouped by review

All this exploration seemed to lead me to a dead end. I was stuck as long as I couldn’t solve the SpaCy tokenisation issues.

At this point, I reviewed the logic one more time. How would a human read reviews? Did I really have to go by sentences?

After some consideration, I thought it would be more reasonable to assume that readers identify topics and sentiments by the paragraph as they read the document. Readers would then generally recall, “This person seems to really like the strategic gameplay”, and/or “This review was mostly about strategic gameplay.”

So, I decided to parse each review into paragraphs before getting dominant topics and scores. I looked around for a paragraphing tool and thought I found one in Syntok.

Comparing SpaCy and Syntok:

For the same review, SpaCy generated more sentences than Syntok (41 vs 27). I found the sentences tokenised by Syntok to be generally more coherent. (‘1. Diplomacy.’ is better than ‘1.’ by itself).

Some of them also contained more than one sentence, if we’re a little stricter about the definition of a “sentence”.

Stumbling Block

I was more satisfied with this tokenisation, but it still wasn’t generating what I would call “paragraphs”, which should contain maybe 3 lines or more. (Or at least 2 long lines.)

I found out that Syntok works best when paragraphs are demarcated by double newline characters (‘\n\n’). This was a bit of an issue. Steam reviews are typically short.

For this dataset, the median review length (before dropping any rows) was 18 words. The chances of a newline appearing are not high with 18 words; I would expect there to be at most 3 sentences in such a review.

I decided I would make do with the new sentence tokenisation anyway, since it was more coherent and some sentences were joined together into pseudo-paragraphs.

Happy Accidents Do Happen

I had to slightly modify my cleaning process to work with Syntok (into something I now call Syntokenize):

When I ran this function on a list of test reviews, I got paragraphs!

The numbers in the 2nd list are index numbers for the review that the text belongs to.

Investigating this, I discovered an accidental interaction between expandContractions() and Syntok, such that when expandContractions is run on the review, Syntok tokenises it into very long sentences (made of sentences). It wasn’t quite the plug-and-play function I had hoped for, but it did what I needed it to.

Using Syntok, I parsed the data into paragraphs and pseudo-paragraphs, all of which were more coherent than the short sentences I had up to now. 642 rows (a 17.5% increase) were gained this way, so there were 4303 rows of data before dropping those with low token counts (after dropping low-token-count reviews, 3724 rows remained).

I was a lot more confident of using these results than those from previous attempts.

The logic more closely followed how a human would read and score a review, and I was confident that Syntok would parse coherent paragraphs out of reviews.

How to Use the Data

Deployed on Tableau, you can make a dashboard that allows you to see at a glance what the top/lowest-ranked features are in each category of Pos/Neg/Neutral. (I included the original compound sentiment score to enable more effective sorting.)

How to use the data:

Narrow down which features to pay most attention to (most negative)
Sort Original Text by Compound Score, then click mouse over any bar to read the review/paragraph.

Insights and How to Use Them

In the end, it’s up to a human to make sense of the data.

I inspected reviews for the most negative category (Strategic Gameplay) first, and sorted the most negative ones to the top.

Sample negative review (Strategic Gameplay)

In this case, Strategic Gameplay has the highest proportion of negative reviews, relative to all reviews on the topic. That doesn’t necessarily mean it’s badly built. Numbers aren’t everything; context matters as well.

This reviewer is unhappy with how the Diplomacy system works in Total War: Three Kingdoms, but that doesn’t mean it is objectively bad. It may be that they don’t agree with the design due to their personal preferences.

Sample positive review (Strategic Gameplay).

Put in context, the Three Kingdoms era was full of impermanent diplomatic deals and alliances. Your ally from yesterday could be your enemy tomorrow. The developers built Total War: Three Kingdoms to maximise immersion, and they included the era’s political upheavals in their game design.

Positive review (Content & Authenticity)

I would venture to say that if their aim was to target the Chinese PC gaming market, they have succeeded. Most reviews are about the authenticity of the game and how Creative Assembly succeeded in making an authentic Three Kingdoms game.

Another negative review for Strategic Gameplay

The above review presents legitimate concerns about Strategic Gameplay. The chief complaint is that there are bugs in the diplomacy system and on the campaign map.

With this information on hand, developers can quickly pinpoint what they need to fix.

Conclusion

Doing this project challenged me to pick up and apply machine learning to a real-world problem using real-world data, the gathering and cleaning of which came with numerous real-world challenges.

SteamVox let me combine several abilities from my diverse skill set to solve a business problem I previously encountered.

I’m glad to see it come together, and I hope to take it many steps further and develop a tool that can be used by all game developers who use Steam.

If you’ve read up to this point, thank you for reading. Stay tuned for more! My work has not ended; there is much more room for improvement.

Future Work (in progress):

Add timestamps and enable aggregation by time period
Add a review age field(reviews become less relevant as new updates are released)
Make the model more generally applicable to most/all Steam games

References

Liu, S. (2019) Dirichlet distribution. https://towardsdatascience.com/dirichlet-distribution-a82ab942a879

Amazon Web Services (2019) How LDA Works. https://docs.aws.amazon.com/sagemaker/latest/dg/lda-how-it-works.html

Clark, S. (2013) Topic Modelling and Latent Dirichlet Allocation. Machine Learning for Language Processing: Lecture 7. https://www.cl.cam.ac.uk/teaching/1213/L101/clark_lectures/lect7.pdf

Pandey, P. (2018) Simplifying Sentiment Analysis using VADER in Python (on Social Media Text). https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f

cjhutto. vaderSentiment Documentation. https://github.com/cjhutto/vaderSentiment

SIL International. SIL Glossary of Linguistic Terms: Valency. https://glossary.sil.org/term/valency