Video-Game Sales Analysis with Python

Kaif Kohari
Nerd For Tech
Published in
6 min readJul 15, 2020

The goal is to turn data into information, and information into Insight. -Carly Florina

Whether you are a Data Scientist or an ML engineer, data analysis is something which will always be an integral part of your life. Andrew Ng said in one of his webinars that most of the work his team does in their research labs is data cleaning and analysis, before feeding them into ML models. This shows how important it is to understand and analyse data before applying ML or DL algorithms to them.

In this blog we are going to work with a cool dataset which contains data about Video Games sales across various regions in the world from 2015. I have tried to make this blog as interesting as possible by keeping it simple and short.

Disclaimer: This data is from 2015/16 and because of this it doesn’t contain the latest released games.

The link to the github repo which contains the code and the dataset is at the end of the blog. I have also shared some great resources for beginners in data science to follow.

Prerequisite: Make sure you have matplotlib, numpy, seaborn and pandas installed before starting.

PANDAS in conjunction with MATPLOTLIB and SEABORN, provides a wide range of opportunities for data analysis. These three are the most widely used libraries for data analysis in Python and that is all you need to follow this blog.

EDA cycle: Understanding data quality, description, shape, patterns, relationships, and visualizing it for better understanding. We will go through a fun ride of plotting cool graphs, piechart, finding out the best games, publishers, critic scores, etc.

So, without any further delay let’s start.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
#Read the csv file
df = pd.read_csv('E:/videogame/Video_Games_Sales_as_at_22_Dec_2016.csv')
df.head(20)#Check first 20 rows of our data

Columns in our dataset:

Name of the game, Platform (like PS2, PS3, Wii,etc), Year_of_Release, Genre of game, Critic score, user score, Ratings.

Check the no of rows and columns in your dataset

df.shape
(16719, 16)
#16719 rows and 16 columns
#Some statistical analysis of our data
df.describe()
#this gives count, mean, avg etc of all columns containing numerical values.

df.describe() gives us a statistical overview of columns containing numerical values. Below I have used only column to show how the results look like.(In the same way it gives statistical output for all other numerical columns.)

N.America_Sales

  • count: 16179
  • max: 41.360000
  • min:0.000000
  • std: 0.813514
  • 25%: 0.000000
  • 50%: 0.080000
  • 75%: 0.240000
  • mean:0.263330
df.info()
#Displays the type of values each column in dataset like if the column has float, int, object values,etc.

Filtering

Lets see which game rules the market in global sales by filtering the game which has max global sales.

Output: Wii Sports (Most Popular game by global sales in 2015)

Filtering is something which I use a lot and is very important to extract a particular information from dataset. Below is a great video to get started with filtering in pandas.

Some Sports Games:

Let’s say your favourite genre of gaming is sports and we know that most of the top sports games are published by EA. So we will filter out all the games made by EA. We will use the groupby function.

What groupby does is it will return all rows from our dataset in which The Publisher column has Electronic Arts(EA) in it.

The Output after grouping out games made by E.A. in decreasing order of YEAR:

  • FIFA 16 (2015)
  • FIFA Soccer 13 (2012)
  • The Sims 3 (2009)
  • Star Wars Battlefront (2015)
  • ……..
  • The Godfather (JP sales) (2006)
  • Psychic Detective (1995)

Seaborn is a great library for plotting great visualizations in few lines of code. Know more about Seaborn here.

sns.pairplot(df): Plot pairwise relationships in a dataset.

By default, this function (sns.pairplot(df)) will create a grid of Axes such that each numeric variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.

sns.pairplot(df)

By analyzing the above pairwise relationships between different columns, we can bring out many conclusions like detecting outliers, regression analysis, columns which do not contribute to the data, etc.

Let’s compare the sales in America, Japan and Europe of some of the most famous games with matplotlib:

plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(25,25))
plt.plot(df['Name'].head(10), df['N.America_Sales'].head(10), color='red', label='N.America_Sales')
plt.plot(df['Name'].head(10), df['Japan_Sales'].head(10), color ='pink', label='Japan_Sales')
plt.plot(df['Name'].head(10), df['Europe_Sales'].head(10), color='yellow', label='Europe_Sales')
plt.tight_layout()
plt.legend()
plt.xlabel('Famous Games')
plt.ylabel('Sales')
plt.title('Popularity of Famous Games')

EA vs Nintendo in terms of sales

Nin = (df['Publisher']=='Nintendo')
EA = (df['Publisher']=='Electronic Arts')
#filtering out EA sports and Nintendo to compare which company dominates

Nintendo

#Nintendo sales across various regions
print(df['Japan_Sales'][Nin].sum())
print(df['Europe_Sales'][Nin].sum())
print(df['N.America_Sales'][Nin].sum())
print(df['Global_Sales'][Nin].sum())
#Output for Nintendo458.15 #Japan sales of EA
419.01 #Europe_Sales of EA
816.9700000000001 #N.America Sales of EA
1788.81 #Global_Sales of EA

EA

#EA sales across various regions
print(df['Japan_Sales'][EA].sum())
print(df['Europe_Sales'][EA].sum())
print(df['N.America_Sales'][EA].sum())
print(df['Global_Sales'][EA].sum())
#Output for EA14.350000000000001 #Japan sales of EA
373.90999999999997 #Europe_Sales of EA
599.5 #N.America Sales of EA
1116.96 #Global_Sales of EA

Nintendo dominates over E.A. in all regions in terms of sales.

Which games are the best according to the critics? Let’s find out.

filter2 = (df['Critic_Score']==df['Critic_Score'].max())df['Name'][filter2] #best games according to critic scores#Below were the top 4 games according to critic score in 2015.51           Grand Theft Auto IV
57 Grand Theft Auto IV
227 Tony Hawk's Pro Skater 2
5350 SoulCalibur

Let’s plot a cool pie chart to find out which gaming genres have the most number of games published.

Watch the video below to learn how to make cool pie charts like this.

Rating system in Gaming

Now, we also having a rating column in our dataset. It contains values like E, M, T, E10+, K-A, AO, RP. Now let’s take a look at what they mean:

  • E: Everyone can play it.
  • M: Mature 17+
  • T: For Teens
  • K-A: Kids to Adults
  • AO: Adults Only
  • RP: Ratings are Pending
  • E10+: Age of 10+ can play

To know more about how video games are rated, click here.

Below is a simple bar-plot to check how many games belong to each rating category.

Pandas is rarely used for plotting but it is quite useful sometimes. To learn more about data visualization with pandas , Click Here.

from collections import Counter
a = list(df['Rating'])
letter_counts = Counter(a)
d = pd.DataFrame.from_dict(letter_counts, orient='index')


d.plot(kind='bar')

Time Series

#Convert to datetime format so pandas can process it.
df['Year_of_Release'] = pd.to_datetime(df['Year_of_Release'], format='%Y')
df['Year_of_Release'].min()
#Timestamp('1980-01-01') (earliest game).
df['Year_of_Release'].max()
#Timestamp('2015-01-01') (latest game)

Want to learn TimeSeries Analysis? Watch the below video.

Data analysis is not limited to this and you can certainly do a lot of stuff like plotting 3D graphs, histograms, finding mathematical relations between different columns of the dataset, scatter-plots, detecting outliers,etc. I advice you take this data and analyse/visualize more cool stuff with it.

That’s it for now. Don’t forget to leave some claps if you enjoyed the blog. If you have any queries, you can reach out to me on Linkedin.

Link to the code and dataset, Click here.

Some great free resources for beginners to learn data analysis and visualization in Python:

  1. https://www.youtube.com/playlist?list=PL-osiE80TeTvipOqomVEeZ1HRrcEvtZB_
  2. https://www.youtube.com/playlist?list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS
  3. https://seaborn.pydata.org/tutorial.html
  4. https://www.youtube.com/playlist?list=PLQVvvaa0QuDc-3szzjeP6N6b0aDrrKyL-
  5. https://github.com/guipsamora/pandas_exercises

--

--