The Similarities of Video Games Sold

Michael Kelley
INST414: Data Science Techniques
3 min readMay 11, 2022

Using an existing dataset on video game stats from Kaggle, I was able to conduct an analysis of different video games, their sales, where they were sold, and other stats. This was done with the hope of learning more about which video games have had more similar or different histories than people may initially believe.

One insight that I hope to extract from this data is which video games among those in the dataset have the most in common with each other in terms of their backgrounds, and their commercial success all over the world. This insight could inform decisions regarding the marketing of these games and their franchises in the future, and to which audiences video game companies may try to appeal with their design choices in future projects.

The source of my data is Kaggle, specifically a dataset called “vg-stats” uploaded by a user named Suhaib Ahmad. This dataset gives several important pieces of information on the video games included, such as their title, console, year of release, developer, user ratings, global sales, and sales in specific regions of the world. For the purpose of this analysis specifically, I will use North American sales as the metric of similarity. While this alone is not a sufficient unit of showing similarity, it will function effectively enough as one for this exercise in finding quantifiable similarity between video games.

The three query items that I have chosen for the sake of this analysis are Wii Sports, Super Mario Land 2: 6 Golden Coins, and Pokémon Gold/Pokémon Silver. The most similar items to each of these games in the dataset, in terms of North American sales, are as follows:

Wii Sports (41.36 million sales)
Super Mario Land 2: 6 Golden Coins (6.16 million sales)
Pokémon Gold/Pokémon Silver (9.00 million sales)

To facilitate this analysis, I used Jupyter Notebook via Anaconda. This allowed me to work with my collected data and use Python to sort and observe the values in the dataset. Along the way, there were some setbacks. It was difficult for me to figure out how to sort data and display all of the rows that I needed to complete this analysis at first. This made it frustratingly hard to complete the analysis after all data had been processed. However, I was able to move past this by looking at some example code and applying the same principles to my own analysis. I also had a lot of additional data displayed from by output data that was unnecessary for this particular analysis. I resolved this matter by manually transcribing the relevant data into tables in Google Docs, which could then be displayed in this article.

Overall, the limitations of this analysis are significant. This is a rather base-level analysis of the data, with only one factor, sales in North America, being used to measure the similarity of video games. A more thorough analysis in the future could find ways to implement things such as genre, console, reviews, data size, common players, and more in order to truly quantify the similarities of video games with a more advanced algorithm.

Overall, I believe that I was able to get a good idea of which video games have had similar levels of success in North America. This can give people a good idea of which games may have a similar type or level of appeal to western audiences. If and when a more complete similarity analysis is conducted, it could potentially be very useful for game developers in determining how best to market their products and who to target.

--

--