Linear Regression: Predicting the Magnitude of Steam discount
Have you ever wondered why some games get bigger discounts than others? I know I have and if you are just as curious as I am, please join me and read on.
(You can find this project on my git repository.)
Primary Goal
In this project, I will seek to find out what influence the discount percentage of video games on Steam in the strategy genre and build a model to predict the % discount using linear regression.
Data
The data in this project will be directly scraped from Steam’s official website.
The scraping algorithm is written in Python, utilizing libraries including:
- “re” — stands for “regex”, for pattern finding.
- “CSV” — for writing scraped data into .csv files that can be read by Pandas.
- “requests” — to make request to Steam website for any dynamic elements.
- “BeautifulSoup” — popular web scraping library to find html elements/tags.
You can find the scraper in my git repository.
The scraped data when loaded into Pandas looks like this:
For all future references, the Target I want to predict in the data set will be ‘DiscountPercentage’.
Data Cleaning
The raw data scraped contains many things we do not need for our purpose:
I. Free games that have no price, including demos and upcoming releases.
II. Games that are not on sale.
But we are not done yet, because we are making a Linear Regression model, it’s important that we do not include
III. Non-numerical entries.
That means we will drop some string based columns, while we are at it, why not also try to extract some information from them through Feature Engineering?
I will cover all of the Feature Engineering I have done when modeling and testing in later chapters, but for baseline model see below:
We can also add a “Season” column to see which season a game is released in:
After the above process, we can now safely drop all string based columns from the data frame:
This process would also shrink our observations from 14,806 entries and 12 features to 370 entries and 7 features. Unfortunately it means this model will be prone to inaccuracies due to smaller sample size.
Data Analysis
The analysis part of this project includes three steps using Sickit Learn:
- Exploratory Data Analysis (EDA)
- Feature Engineering (FE)
- Modeling
And all three steps are repeated every time something changes.
The general work flow for this project looks like this:
- EDA to find outstanding Feature-Target relationship (though Pair Plots/heat maps, Lasso Coefficients, etc)
- Based on available information, conduct Feature Engineering (mathematical transformation, binning, get dummy entries, etc)
- Modeling and scoring using R² and/or other metrics (RMSE, MAE, etc.)
- Rinse and repeat until all potential FE ideas are tried and exhausted or an acceptable R² score has been achieved.
There are in total 7 rounds of aforementioned cycles conducted, below I will summarize and share the code for each one:
Round 1: Baseline Model
Changes made: removing games that have less than 30 reviews
Best Model: Lasso
Score: 0.419 +- 0.073
Round 2: Mathematical Transformation
Changes made:
Log transformed “Reviews” & “OriginalPrice” due to skewed pair-plot results.
Best Model: Lasso
Score: 0.437 +- 0.104
Round 3: Incorporate Main_Tag as dummies into the model
Changes made:
Adding dummy data generated from “Main_Tag” column to the data frame.
Best Model: Lasso
Score: 0.330 +- 0.073
Round 4: Try with All_Tags as dummies and see if there is any improvements
Changes made:
Dummy columns added based on which csv is used, “top 5 tags”(scraped game usually have 10 to 20 tags) in this case.
Best Model: Lasso
Score: 0.359 +- 0.080
Round 5: Mathematical Transformation & All_Tags dummies
Changes made:
Log transformed “Reviews” + top 5 tags.
Best Model: Lasso
Score: 0.359 +- 0.080
(Note: Exact same score as round 4, which means “Reviews” has absolutely no impact on DiscountPercentage. The feature can be safely dropped if needed.)
Round 6: Binning Features in three brackets
Changes made:
Separating “Reviews” and “Days_Since_Release” into three brackets.
Best Model: Ridge
Score: 0.273 +- 0.044
Round 7: Binning Features in two brackets + adding Floor Threshold
Changes made:
Separating “Reviews” and “Days_Since_Release” into two brackets and dropping observations that have less than 30 reviews.
Best Model: Lasso
Score: 0.313 +- 0.098
Conclusion
Let’s start with some statistics generated from Data Cleaning:
Notice that this is not in any sales season, which is interesting because that would mean that sales season may not offer as wide range of discounts as I previously imagined.
Now, let’s see what’s the Top 2 features that influence the discount %:
But what doesn’t impact the discount %? Let’s see what’s the Top 2 features that does not influence the discount %:
And the best model is in fact the baseline model with the least error margin. Though the accuracy is in the lower middle bracket.
R² of 0.42 is nothing to brag about, and I think it has a lot to do with how Steam handles the discounts —
Only publishers/developers have the authority to put their games on sale.
That means the discount % would depend heavily on what each publisher/developer’s marketing strategy is and their financial situation. Such data is beyond my ability to gather at this point in time, though I hope in the future things will improve.
Use Case
Other than being driven by curiosity, another important factor that drove me to do a project like this is because it has practical implication as well — it will be particularly useful for Online Video Game Re-sellers.
As the above diagram shows, my prediction model can help them predict the next big discount, so they can allocate their resources better and potentially increase their profit margin.
Thank you for reading, and I hope this would bring helpful insight into whatever you do.