Linear Regression: Predicting the Magnitude of Steam discount

Da Guo
Analytics Vidhya
Published in
6 min readApr 20, 2020
A project born out of curiosity.

Have you ever wondered why some games get bigger discounts than others? I know I have and if you are just as curious as I am, please join me and read on.

(You can find this project on my git repository.)

Screenshot taken from Steam — 4/19/2020

Primary Goal

In this project, I will seek to find out what influence the discount percentage of video games on Steam in the strategy genre and build a model to predict the % discount using linear regression.

Data

The data in this project will be directly scraped from Steam’s official website.

The scraping algorithm is written in Python, utilizing libraries including:

  1. re” — stands for “regex”, for pattern finding.
  2. CSV” — for writing scraped data into .csv files that can be read by Pandas.
  3. requests” — to make request to Steam website for any dynamic elements.
  4. BeautifulSoup” — popular web scraping library to find html elements/tags.

You can find the scraper in my git repository.

The scraped data when loaded into Pandas looks like this:

The overlapping columns will not be an issue in Jupyter Notebook environment, apologies for the messy look.

For all future references, the Target I want to predict in the data set will be ‘DiscountPercentage’.

Data Cleaning

The raw data scraped contains many things we do not need for our purpose:

I. Free games that have no price, including demos and upcoming releases.

II. Games that are not on sale.

But we are not done yet, because we are making a Linear Regression model, it’s important that we do not include

III. Non-numerical entries.

That means we will drop some string based columns, while we are at it, why not also try to extract some information from them through Feature Engineering?

I will cover all of the Feature Engineering I have done when modeling and testing in later chapters, but for baseline model see below:

We can also add a “Season” column to see which season a game is released in:

After the above process, we can now safely drop all string based columns from the data frame:

This process would also shrink our observations from 14,806 entries and 12 features to 370 entries and 7 features. Unfortunately it means this model will be prone to inaccuracies due to smaller sample size.

Cleaned baseline data frame would look like this.

Data Analysis

The analysis part of this project includes three steps using Sickit Learn:

  1. Exploratory Data Analysis (EDA)
  2. Feature Engineering (FE)
  3. Modeling

And all three steps are repeated every time something changes.

The general work flow for this project looks like this:

  1. EDA to find outstanding Feature-Target relationship (though Pair Plots/heat maps, Lasso Coefficients, etc)
  2. Based on available information, conduct Feature Engineering (mathematical transformation, binning, get dummy entries, etc)
  3. Modeling and scoring using and/or other metrics (RMSE, MAE, etc.)
  4. Rinse and repeat until all potential FE ideas are tried and exhausted or an acceptable R² score has been achieved.

There are in total 7 rounds of aforementioned cycles conducted, below I will summarize and share the code for each one:

.corr() is a Panda’s built-in function.
An example heat map that shows feature-target relationship. The higher the score the stronger the relationship.
A complied function for linear regression modeling. Credits to all the smart people before me.

Round 1: Baseline Model

Changes made: removing games that have less than 30 reviews

Best Model: Lasso
Score: 0.419 +- 0.073

Round 2: Mathematical Transformation

Changes made:
Log transformed “Reviews” & “OriginalPrice” due to skewed pair-plot results.

Best Model: Lasso
Score: 0.437 +- 0.104

Round 3: Incorporate Main_Tag as dummies into the model

Changes made:
Adding dummy data generated from “Main_Tag” column to the data frame.

Dummies created from “Main_Tag” column in the original data frame.
“df3” is a direct copy from the original data frame with “Main_Tag” column added back in.

Best Model: Lasso
Score: 0.330 +- 0.073

Round 4: Try with All_Tags as dummies and see if there is any improvements

Changes made:
Dummy columns added based on which csv is used, “top 5 tags”(scraped game usually have 10 to 20 tags) in this case.

Values in ‘All_Tags’ are comma separated, we will need to do some additional work before we can turn it into dummy values.

Best Model: Lasso
Score: 0.359 +- 0.080

Round 5: Mathematical Transformation & All_Tags dummies

Changes made:
Log transformed “Reviews” + top 5 tags.

Best Model: Lasso
Score: 0.359 +- 0.080
(Note: Exact same score as round 4, which means “Reviews” has absolutely no impact on DiscountPercentage. The feature can be safely dropped if needed.)

Round 6: Binning Features in three brackets

Changes made:
Separating “Reviews” and “Days_Since_Release” into three brackets.

We will run a df6.describe() to decide what are the binning thresholds, in this case I will look at 25% and 75%.
The two columns are separated into 3 brackets.

Best Model: Ridge
Score: 0.273 +- 0.044

Round 7: Binning Features in two brackets + adding Floor Threshold

Changes made:
Separating “Reviews” and “Days_Since_Release” into two brackets and dropping observations that have less than 30 reviews.

Again, we look at the thresholds for brackets, for 2 bins, I will pick the 50% this time.

Best Model: Lasso
Score: 0.313 +- 0.098

Conclusion

Let’s start with some statistics generated from Data Cleaning:

Some statistics generated from data cleaning.

Notice that this is not in any sales season, which is interesting because that would mean that sales season may not offer as wide range of discounts as I previously imagined.

Now, let’s see what’s the Top 2 features that influence the discount %:

‘Days_Since_Release’ has a lasso coefficient of 11.3, it has the most impact.
In simple terms, the poorer the game is reviewed, the more discount there will be.

But what doesn’t impact the discount %? Let’s see what’s the Top 2 features that does not influence the discount %:

Even though some indie games are sold for $2 and some AAA titles for $60, original price does not impact disount % almost at all.
Number of reviews are highly correlated with number of units sold based on this article. In other words, how well a game is being sold does not impact the discount %. This is perhaps the biggest surprise I got from this project.

And the best model is in fact the baseline model with the least error margin. Though the accuracy is in the lower middle bracket.

This is a low R² score for a prediction model.

R² of 0.42 is nothing to brag about, and I think it has a lot to do with how Steam handles the discounts —

Only publishers/developers have the authority to put their games on sale.

That means the discount % would depend heavily on what each publisher/developer’s marketing strategy is and their financial situation. Such data is beyond my ability to gather at this point in time, though I hope in the future things will improve.

Use Case

Other than being driven by curiosity, another important factor that drove me to do a project like this is because it has practical implication as well — it will be particularly useful for Online Video Game Re-sellers.

A simple diagram showing how they operate.

As the above diagram shows, my prediction model can help them predict the next big discount, so they can allocate their resources better and potentially increase their profit margin.

Thank you for reading, and I hope this would bring helpful insight into whatever you do.

--

--

Da Guo
Analytics Vidhya

Architecture, Software Development, Data Science. Always learning.