Analysis of Video Game Sales

Using data from VgChartz.com to predict video game copies shipped in the millions.

Jeremy Chow
Modeling (The Data Kind)
3 min readApr 24, 2019

--

Overview

In this project, I scraped data from VgChartz.com to predict video game copies shipped in the millions. The data wasn’t the cleanest nor the most comprehensive, but it was a great exercise in scraping web pages as well as linear and Poisson regression. For stronger analysis, more numerical data would be necessary.

Approach

The first step was to acquire the data from the website. Using BeautifulSoup in Python, I scraped 55,000 rows of data from VgChartz search results. I then ad to engineer features to predict the shipped copies from, given the limited information on VgChartz.

Screenshot of rows scraped from VgChartz.com

The features I engineered and used were average rating, years since release, whether or not the release date was within 4 days of a US holiday, and whether the publisher was in the top 10 publishers by revenue in 2018. I originally was going to look at whether the game was released within a Chinese, South Korean, or Japanese holiday but their correlations with the target variable were low so I dropped them.

Heatmap between engineered features and target variable. The left column shows the correlation between the features and the copies shipped in millions. Overall there are low correlations across the board.

After trying standard Ordinary Least Squares (OLS) without transforming the data, but got an R² of 0.11. I decided to try a Poisson regression because the sales data appeared to be in discrete counts and right skewed, because most of the copies sold entries were around .01 million, the minimum for the website.

The histogram of the target variable (number of copies sold) shows a discrete, right-skewed distribution. This is a prime example of Poisson’s distribution.

I also tried transformations such as log transform on the target, but this proved unhelpful.

Results

Both the Poisson linear regression and standard linear regression gave log likelihoods in the -2000 to -3000 range, suggesting that they were not due to chance. However, their root mean squared error (RMSE) was fairly high, in the .36 millions relative to a 1.6 million max copies sold dataset. This was about 20% of the max.

Negative trend between residuals and sales in millions suggests the model is systematically overpredicting sales in underperforming games and under predicting sales in successful games.

I plotted the residuals (distance between my data points and my model, or how far my model was off) to diagnose the issue. There was a clear negative trend, suggesting that my models were consistently overestimating the copies sold of underperforming games, and underestimating the copies sold for successful games. It can be concluded from this chart that these models, while they were capturing some sort of correlation, were not very accurate.

If the model was 100% accurate, the points on this graph would have a 1:1 correlation; actual sales would match the predicted sales exactly. Here, we have a random and flat distribution, meaning that the predictions don’t go up with the actual sales numbers, suggesting that they are not predicting accurately given the features in the model.

The most helpful feature between both models seemed to be whether or not the game was made by a top 10 publisher. The other features had differing signs between models as well as relatively low coefficients, suggesting that the models couldn’t accurately use those features to consistently predict and advantage or disadvantage in sales.

Conclusion

In conclusion, it makes sense that it’s difficult to accurately estimate the copies of a video game sold given only two numerical features, average rating and years since release. I would like to further explore this analysis but with data such as the total budget, number of mentions on twitter before release, sales during opening week, and genre of the game in order to predict the success of a video game in its lifetime.

My code can be found here.

--

--

Jeremy Chow
Modeling (The Data Kind)

Data Scientist passionate about using machine learning to impact business decisions and discover meaningful insights.