Video Games Sales Analysis

8 min readOct 13, 2020

An attempt to extract useful information from the Video Games Sales dataset.

Introduction

In this project, I will use the video games sales data for analysis. I will try to gain useful insights from this dataset and improve my skills while doing so. This analysis is part of the Zero to Pandas course offered by Jovian ML and FreecodeCamp.

Downloading the Dataset

The dataset can be obtained from this Kaggle link: https://www.kaggle.com/gregorut/videogamesales/notebooks This dataset contains a list of video games with sales greater than 100,000 copies. It has sales records for North America, Europe, Japan, and the rest of the world in millions. Though we can conclude that some countries must have been excluded but it is not clear.

Let’s begin by downloading the data, and listing the files within the dataset.

dataset_url = 'https://www.kaggle.com/gregorut/videogamesales/notebooks'import opendatasets as od
od.download(dataset_url)Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: aditya7jetely
Your Kaggle Key: ········


100%|██████████| 381k/381k [00:00<00:00, 46.9MB/s]

Downloading videogamesales.zip to ./videogamesales

The dataset has been downloaded and extracted.

data_dir = './videogamesales'import os
os.listdir(data_dir)['vgsales.csv']

Data Preparation and Cleaning

This is the main step before the data analysis portion. In this step I will try to make the dataset ready for analysis, i.e. I would choose useful columns, handle missing or incorrect values and would perform parsing if necessary.

import pandas as pd

Loading the dataset into a pandas data frame.

video_sales_df = pd.read_csv('./videogamesales/vgsales.csv')
video_sales_df.head() # Having a look at the first five rows of the dataframe

Evaluating the shape of our data frame.

video_sales_df.shape(16598, 11)

We can see that our dataset has 16598 rows and 11 columns

List of columns in our data frame.

video_sales_df.columnsIndex(['Rank', 'Name', 'Platform', 'Year', 'Genre', 'Publisher', 'NA_Sales',
       'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'],
      dtype='object')

As the number of columns in our dataset is not very high we can use all the columns to some extend.

Now let's see some basic information about our dataset.

video_sales_df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB

As we can observe in the above result that the year column has the data type as a float which should have been datetime, so we must convert it into datetime by the help of pandas functions.

video_sales_df['Year'] =  pd.to_datetime(video_sales_df['Year'], format='%Y', errors='coerce')

Now we can confirm the data type of Year column.

video_sales_df.dtypesRank                     int64
Name                    object
Platform                object
Year            datetime64[ns]
Genre                   object
Publisher               object
NA_Sales               float64
EU_Sales               float64
JP_Sales               float64
Other_Sales            float64
Global_Sales           float64
dtype: object

We can see that the data type of our Year column is now datetime.

Now let us have some stats about the numeric columns of our data set.

video_sales_df.describe()

As we see that our Rank column is also numeric but it is not meant to perform statistics on it so we must remove it from the statistics info.

video_sales_df_copy = video_sales_df.copy() # A copy dataframe to remove Rank column from statistical descriptionvideo_sales_df_copy.drop(columns = 'Rank', inplace = True) # droping the rank columnvideo_sales_df_copy.describe() # Now we dont have the Rank column in the statistical info

There doesn’t seem to be any problem with our numeric columns as of now. It is to be noted that all the sales are in millions.

Exploratory Analysis and Visualization

Before we can ask interesting questions about the dataset it would help to understand what the different columns of our dataset look like.

Let’s begin by importingmatplotlib.pyplot and seaborn.

import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Genre

Let's explore the genre column of our data frame to see what is the majority of games whose sales record are present.

video_sales_df.Genre.value_counts()Action          3316
Sports          2346
Misc            1739
Role-Playing    1488
Shooter         1310
Adventure       1286
Racing          1249
Platform         886
Simulation       867
Fighting         848
Strategy         681
Puzzle           582
Name: Genre, dtype: int64plt.pie(video_sales_df.Genre.value_counts(), labels=video_sales_df.Genre.value_counts().index, autopct='%1.0f%%', pctdistance=1.1, labeldistance=1.2)
plt.title('Video Game Genre Share')
plt.ylabel(' ')
plt.show()

Action games have the highest count in the dataset will the Puzzle games have the least count. It could be due to the fact that many people are interested in video games that involve action in them as many consider puzzle-based games to be boring.

Platform

Let's analyse the number of games based on the platform.

video_sales_df.Platform.value_counts()DS      2163
PS2     2161
PS3     1329
Wii     1325
X360    1265
PSP     1213
PS      1196
PC       960
XB       824
GBA      822
GC       556
3DS      509
PSV      413
PS4      336
N64      319
SNES     239
XOne     213
SAT      173
WiiU     143
2600     133
NES       98
GB        98
DC        52
GEN       27
NG        12
WS         6
SCD        6
3DO        3
TG16       2
GG         1
PCFX       1
Name: Platform, dtype: int64plt.figure(figsize = (15,9))
plt.title('Number of different games per game platforms')
sns.countplot(y='Platform', data=video_sales_df)
plt.show()

The most games are available for the DS and PS2 platforms which might mean that DS and PS2 are used by more customers But we can't be sure about that.

Publisher

Let's analyze the Publisher column to see which company publishes the what number of games. Let's consider publishers with game counts above 100 only.

plt.figure(figsize=(15,12))
publishers = video_sales_df.Publisher.value_counts()
publishers = publishers[publishers.values > 100]
publishers.plot(kind = 'barh')
plt.xlabel('Count')
plt.ylabel('Publishing Company')
plt.title('Publishing companies with more than 100 games')
plt.show()

We see that Electronics Arts is the undisputed leader in terms of the video games sale count. One of the reason may be the popularity of its sports-based games like FIFA and Cricket

Year

video_sales_df['Year'].value_counts().sort_values(ascending=False)
video_sales_df['year_only'] = video_sales_df['Year'].dt.year
video_sales_df['year_only'] = video_sales_df['year_only'].astype(int, errors='ignore')plt.figure(figsize = (15,9))
plt.title('Number of different games per game platforms')
sns.countplot(y='year_only', data=video_sales_df)
plt.show()

We can see that the sales were the highest in the years 2006 to 2011 and after that, we can see a downfall of the total sales.

Sales

Let's analyse the numeric sales column for North America, Europe, Japan, and the Rest of the World.

fig, axes = plt.subplots(2,2, figsize= (18, 14))
axes[0, 0].scatter(video_sales_df.year_only, video_sales_df.NA_Sales, color = 'b')
axes[0, 0].set_title('North America Video Game Sales')
axes[0, 0].set_xlabel('Year')
axes[0, 0].set_ylabel('Sales in Million')
axes[0, 1].scatter(video_sales_df.year_only, video_sales_df.EU_Sales, color = 'r')
axes[0, 1].set_title('Europe Video Game Sales')
axes[0, 1].set_xlabel('Year')
axes[0, 1].set_ylabel('Sales in Million')
axes[1, 0].scatter(video_sales_df.year_only, video_sales_df.JP_Sales, color = 'g')
axes[1, 0].set_title('Japan Video Game Sales')
axes[1, 0].set_xlabel('Year')
axes[1, 0].set_ylabel('Sales in Million')
axes[1, 1].scatter(video_sales_df.year_only, video_sales_df.Other_Sales)
axes[1, 1].set_title('Other county Video Game Sales')
axes[1, 1].set_xlabel('Year')
axes[1, 1].set_ylabel('Sales in Million')
plt.show()

In the above visualization, we saw the video game sales in the four different regions, the outliers in all the four regions are very few. We can see that sales have a higher number in the North American region.

Asking and Answering Questions

We’ve already gained several insights about the dataset and video game sales in different regions, let us further improve our insights by asking some interesting questions.

Q1: What are the top five game genres?

video_sales_df.Genre.value_counts().head(5)Action          3316
Sports          2346
Misc            1739
Role-Playing    1488
Shooter         1310
Name: Genre, dtype: int64video_sales_df.Genre.value_counts().head(5).plot(kind = 'bar', color = 'g')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.title('Top 5 game genre')
plt.show()

Action and Sports are the top-grossing genres of video games followed by miscellaneous, role-playing and shooting games. This also shows why Electronics Art has a large share of sales because it mainly makes sports-based games.

Q2: Which region has the largest share in the global video games sales?

Total_Video_Games_Sales = video_sales_df.Global_Sales.sum()
North_America_Sales_percent = (video_sales_df.NA_Sales.sum()*100)/Total_Video_Games_Sales
Europe_Sales_percent = (video_sales_df.EU_Sales.sum()*100)/Total_Video_Games_Sales
Japan_Sales_percent = (video_sales_df.JP_Sales.sum()*100)/Total_Video_Games_Sales
Other_Region_Sales_percent = (video_sales_df.Other_Sales.sum()*100)/Total_Video_Games_Salesplt.figure(figsize=(16,9))
plt.pie([North_America_Sales_percent, Europe_Sales_percent, Japan_Sales_percent, Other_Region_Sales_percent], labels=['North America', 'Europe', 'Japan', 'Other Regions'], autopct='%1.0f%%', pctdistance=1.1, labeldistance=1.2)
plt.title('Video Game Genre Share')
plt.ylabel(' ')
plt.show()

Through this pie chart, we see that the major shareholder is North America. Though this result can be a bit biased because of the nature of the survey in which more importance might have been given to the North American area.

Q3: What are the top 5 highest-grossing games in North America?

video_sales_NA_sorted_df = video_sales_df.sort_values(by = ['NA_Sales'], ascending = False)
video_sales_NA_sorted_df.head()[['Name', 'NA_Sales']]

The top-grossing game in North America is Wii Sports with a total sales of 41.49 million in North America followed by Super Mario Bros., Duck Hunt, Tetris and Mario Kart Wii.

Q4: Which game has the highest sales in the other region(Excluding North America, Europe and Japan)?

video_sales_Other_sorted_df = video_sales_df.sort_values(by = ['Other_Sales'], ascending = False)
video_sales_NA_sorted_df.head(1)[['Name', 'Other_Sales']]

GTA: San Andreas has the highest sales in the other regions which are 10.57 million.

Q5: Which game has the highest difference in their sales in North America and Europe?

video_sales_df['NA_EU_Sales_Diff'] = abs(video_sales_df['NA_Sales'] - video_sales_df['EU_Sales'])
video_sales_df.head()

video_sales_df.sort_values(by = ['NA_EU_Sales_Diff'], ascending = False).head(1)[['Name', 'NA_EU_Sales_Diff']]

Duck Hunt has the highest difference in their sales when the comparison is made between North America and Europe.

Inferences and Conclusion

Based on this dataset we cant say that equal importance was given to all the regions of the world as it might be the case that the data was collected mostly from North American stores.
Games in the Action genre tend to fetch the most revenue followed by Sports based games.
Electronic Arts are the top-grossing publisher clearly because of their supper popular sports-based games.
GTA: San Andreas is the top-grossing game in the other Regions of the World.
Top-grossing game in North America is Wii Sports.
Duck Hunt has the highest difference in the sales made in Europe and North America which shows it is more popular in North America when compared with Europe.
These stats can be a bit biased based on the nature of data collection.

References and Future Work

This dataset has a lot of information about the sales trend of video games which can be further exploited to gain useful insights.

A region-based analysis can be performed in-depth to compare video games sales trend

References:

Stack Overflow Developer Survey: https://insights.stackoverflow.com/survey
Pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html
Matplotlib user guide: https://matplotlib.org/3.3.1/users/index.html
Seaborn user guide & tutorial: https://seaborn.pydata.org/tutorial.html
open datasets Python library: https://github.com/JovianML/opendatasets