Analysis of Video Game Sales
Using data from VgChartz.com to predict video game copies shipped in the millions.
Overview
In this project, I scraped data from VgChartz.com to predict video game copies shipped in the millions. The data wasn’t the cleanest nor the most comprehensive, but it was a great exercise in scraping web pages as well as linear and Poisson regression. For stronger analysis, more numerical data would be necessary.
Approach
The first step was to acquire the data from the website. Using BeautifulSoup in Python, I scraped 55,000 rows of data from VgChartz search results. I then ad to engineer features to predict the shipped copies from, given the limited information on VgChartz.
The features I engineered and used were average rating, years since release, whether or not the release date was within 4 days of a US holiday, and whether the publisher was in the top 10 publishers by revenue in 2018. I originally was going to look at whether the game was released within a Chinese, South Korean, or Japanese holiday but their correlations with the target variable were low so I dropped them.
After trying standard Ordinary Least Squares (OLS) without transforming the data, but got an R² of 0.11. I decided to try a Poisson regression because the sales data appeared to be in discrete counts and right skewed, because most of the copies sold entries were around .01 million, the minimum for the website.
I also tried transformations such as log transform on the target, but this proved unhelpful.
Results
Both the Poisson linear regression and standard linear regression gave log likelihoods in the -2000 to -3000 range, suggesting that they were not due to chance. However, their root mean squared error (RMSE) was fairly high, in the .36 millions relative to a 1.6 million max copies sold dataset. This was about 20% of the max.
I plotted the residuals (distance between my data points and my model, or how far my model was off) to diagnose the issue. There was a clear negative trend, suggesting that my models were consistently overestimating the copies sold of underperforming games, and underestimating the copies sold for successful games. It can be concluded from this chart that these models, while they were capturing some sort of correlation, were not very accurate.
The most helpful feature between both models seemed to be whether or not the game was made by a top 10 publisher. The other features had differing signs between models as well as relatively low coefficients, suggesting that the models couldn’t accurately use those features to consistently predict and advantage or disadvantage in sales.
Conclusion
In conclusion, it makes sense that it’s difficult to accurately estimate the copies of a video game sold given only two numerical features, average rating and years since release. I would like to further explore this analysis but with data such as the total budget, number of mentions on twitter before release, sales during opening week, and genre of the game in order to predict the success of a video game in its lifetime.
My code can be found here.