Video Games Sales Analysis

Aditya Jetely
8 min readOct 13, 2020

--

An attempt to extract useful information from the Video Games Sales dataset.

Photo by Kelly Sikkema on Unsplash

Introduction

In this project, I will use the video games sales data for analysis. I will try to gain useful insights from this dataset and improve my skills while doing so. This analysis is part of the Zero to Pandas course offered by Jovian ML and FreecodeCamp.

Downloading the Dataset

The dataset can be obtained from this Kaggle link: https://www.kaggle.com/gregorut/videogamesales/notebooks This dataset contains a list of video games with sales greater than 100,000 copies. It has sales records for North America, Europe, Japan, and the rest of the world in millions. Though we can conclude that some countries must have been excluded but it is not clear.

Let’s begin by downloading the data, and listing the files within the dataset.

dataset_url = 'https://www.kaggle.com/gregorut/videogamesales/notebooks'import opendatasets as od
od.download(dataset_url)
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: aditya7jetely
Your Kaggle Key: ········


100%|██████████| 381k/381k [00:00<00:00, 46.9MB/s]

Downloading videogamesales.zip to ./videogamesales

The dataset has been downloaded and extracted.

data_dir = './videogamesales'import os
os.listdir(data_dir)
['vgsales.csv']

Data Preparation and Cleaning

This is the main step before the data analysis portion. In this step I will try to make the dataset ready for analysis, i.e. I would choose useful columns, handle missing or incorrect values and would perform parsing if necessary.

import pandas as pd

Loading the dataset into a pandas data frame.

video_sales_df = pd.read_csv('./videogamesales/vgsales.csv')
video_sales_df.head() # Having a look at the first five rows of the dataframe
png

Evaluating the shape of our data frame.

video_sales_df.shape(16598, 11)

We can see that our dataset has 16598 rows and 11 columns

List of columns in our data frame.

video_sales_df.columnsIndex(['Rank', 'Name', 'Platform', 'Year', 'Genre', 'Publisher', 'NA_Sales',
'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'],
dtype='object')

As the number of columns in our dataset is not very high we can use all the columns to some extend.

Now let's see some basic information about our dataset.

video_sales_df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rank 16598 non-null int64
1 Name 16598 non-null object
2 Platform 16598 non-null object
3 Year 16327 non-null float64
4 Genre 16598 non-null object
5 Publisher 16540 non-null object
6 NA_Sales 16598 non-null float64
7 EU_Sales 16598 non-null float64
8 JP_Sales 16598 non-null float64
9 Other_Sales 16598 non-null float64
10 Global_Sales 16598 non-null float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB

As we can observe in the above result that the year column has the data type as a float which should have been datetime, so we must convert it into datetime by the help of pandas functions.

video_sales_df['Year'] =  pd.to_datetime(video_sales_df['Year'], format='%Y', errors='coerce')

Now we can confirm the data type of Year column.

video_sales_df.dtypesRank                     int64
Name object
Platform object
Year datetime64[ns]
Genre object
Publisher object
NA_Sales float64
EU_Sales float64
JP_Sales float64
Other_Sales float64
Global_Sales float64
dtype: object

We can see that the data type of our Year column is now datetime.

Now let us have some stats about the numeric columns of our data set.

video_sales_df.describe()
png

As we see that our Rank column is also numeric but it is not meant to perform statistics on it so we must remove it from the statistics info.

video_sales_df_copy = video_sales_df.copy() # A copy dataframe to remove Rank column from statistical descriptionvideo_sales_df_copy.drop(columns = 'Rank', inplace = True) # droping the rank columnvideo_sales_df_copy.describe() # Now we dont have the Rank column in the statistical info
png

There doesn’t seem to be any problem with our numeric columns as of now. It is to be noted that all the sales are in millions.

Exploratory Analysis and Visualization

Before we can ask interesting questions about the dataset it would help to understand what the different columns of our dataset look like.

Let’s begin by importingmatplotlib.pyplot and seaborn.

import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Genre

Let's explore the genre column of our data frame to see what is the majority of games whose sales record are present.

video_sales_df.Genre.value_counts()Action          3316
Sports 2346
Misc 1739
Role-Playing 1488
Shooter 1310
Adventure 1286
Racing 1249
Platform 886
Simulation 867
Fighting 848
Strategy 681
Puzzle 582
Name: Genre, dtype: int64
plt.pie(video_sales_df.Genre.value_counts(), labels=video_sales_df.Genre.value_counts().index, autopct='%1.0f%%', pctdistance=1.1, labeldistance=1.2)
plt.title('Video Game Genre Share')
plt.ylabel(' ')
plt.show()
png

Action games have the highest count in the dataset will the Puzzle games have the least count. It could be due to the fact that many people are interested in video games that involve action in them as many consider puzzle-based games to be boring.

Platform

Let's analyse the number of games based on the platform.

video_sales_df.Platform.value_counts()DS      2163
PS2 2161
PS3 1329
Wii 1325
X360 1265
PSP 1213
PS 1196
PC 960
XB 824
GBA 822
GC 556
3DS 509
PSV 413
PS4 336
N64 319
SNES 239
XOne 213
SAT 173
WiiU 143
2600 133
NES 98
GB 98
DC 52
GEN 27
NG 12
WS 6
SCD 6
3DO 3
TG16 2
GG 1
PCFX 1
Name: Platform, dtype: int64
plt.figure(figsize = (15,9))
plt.title('Number of different games per game platforms')
sns.countplot(y='Platform', data=video_sales_df)
plt.show()
png

The most games are available for the DS and PS2 platforms which might mean that DS and PS2 are used by more customers But we can't be sure about that.

Publisher

Let's analyze the Publisher column to see which company publishes the what number of games. Let's consider publishers with game counts above 100 only.

plt.figure(figsize=(15,12))
publishers = video_sales_df.Publisher.value_counts()
publishers = publishers[publishers.values > 100]
publishers.plot(kind = 'barh')
plt.xlabel('Count')
plt.ylabel('Publishing Company')
plt.title('Publishing companies with more than 100 games')
plt.show()
png

We see that Electronics Arts is the undisputed leader in terms of the video games sale count. One of the reason may be the popularity of its sports-based games like FIFA and Cricket

Year

video_sales_df['Year'].value_counts().sort_values(ascending=False)
video_sales_df['year_only'] = video_sales_df['Year'].dt.year
video_sales_df['year_only'] = video_sales_df['year_only'].astype(int, errors='ignore')
plt.figure(figsize = (15,9))
plt.title('Number of different games per game platforms')
sns.countplot(y='year_only', data=video_sales_df)
plt.show()
png

We can see that the sales were the highest in the years 2006 to 2011 and after that, we can see a downfall of the total sales.

Sales

Let's analyse the numeric sales column for North America, Europe, Japan, and the Rest of the World.

fig, axes = plt.subplots(2,2, figsize= (18, 14))
axes[0, 0].scatter(video_sales_df.year_only, video_sales_df.NA_Sales, color = 'b')
axes[0, 0].set_title('North America Video Game Sales')
axes[0, 0].set_xlabel('Year')
axes[0, 0].set_ylabel('Sales in Million')
axes[0, 1].scatter(video_sales_df.year_only, video_sales_df.EU_Sales, color = 'r')
axes[0, 1].set_title('Europe Video Game Sales')
axes[0, 1].set_xlabel('Year')
axes[0, 1].set_ylabel('Sales in Million')
axes[1, 0].scatter(video_sales_df.year_only, video_sales_df.JP_Sales, color = 'g')
axes[1, 0].set_title('Japan Video Game Sales')
axes[1, 0].set_xlabel('Year')
axes[1, 0].set_ylabel('Sales in Million')
axes[1, 1].scatter(video_sales_df.year_only, video_sales_df.Other_Sales)
axes[1, 1].set_title('Other county Video Game Sales')
axes[1, 1].set_xlabel('Year')
axes[1, 1].set_ylabel('Sales in Million')
plt.show()
png

In the above visualization, we saw the video game sales in the four different regions, the outliers in all the four regions are very few. We can see that sales have a higher number in the North American region.

Asking and Answering Questions

We’ve already gained several insights about the dataset and video game sales in different regions, let us further improve our insights by asking some interesting questions.

Q1: What are the top five game genres?

video_sales_df.Genre.value_counts().head(5)Action          3316
Sports 2346
Misc 1739
Role-Playing 1488
Shooter 1310
Name: Genre, dtype: int64
video_sales_df.Genre.value_counts().head(5).plot(kind = 'bar', color = 'g')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.title('Top 5 game genre')
plt.show()
png

Action and Sports are the top-grossing genres of video games followed by miscellaneous, role-playing and shooting games. This also shows why Electronics Art has a large share of sales because it mainly makes sports-based games.

Q2: Which region has the largest share in the global video games sales?

Total_Video_Games_Sales = video_sales_df.Global_Sales.sum()
North_America_Sales_percent = (video_sales_df.NA_Sales.sum()*100)/Total_Video_Games_Sales
Europe_Sales_percent = (video_sales_df.EU_Sales.sum()*100)/Total_Video_Games_Sales
Japan_Sales_percent = (video_sales_df.JP_Sales.sum()*100)/Total_Video_Games_Sales
Other_Region_Sales_percent = (video_sales_df.Other_Sales.sum()*100)/Total_Video_Games_Sales
plt.figure(figsize=(16,9))
plt.pie([North_America_Sales_percent, Europe_Sales_percent, Japan_Sales_percent, Other_Region_Sales_percent], labels=['North America', 'Europe', 'Japan', 'Other Regions'], autopct='%1.0f%%', pctdistance=1.1, labeldistance=1.2)
plt.title('Video Game Genre Share')
plt.ylabel(' ')
plt.show()
png

Through this pie chart, we see that the major shareholder is North America. Though this result can be a bit biased because of the nature of the survey in which more importance might have been given to the North American area.

Q3: What are the top 5 highest-grossing games in North America?

video_sales_NA_sorted_df = video_sales_df.sort_values(by = ['NA_Sales'], ascending = False)
video_sales_NA_sorted_df.head()[['Name', 'NA_Sales']]
png

The top-grossing game in North America is Wii Sports with a total sales of 41.49 million in North America followed by Super Mario Bros., Duck Hunt, Tetris and Mario Kart Wii.

Q4: Which game has the highest sales in the other region(Excluding North America, Europe and Japan)?

video_sales_Other_sorted_df = video_sales_df.sort_values(by = ['Other_Sales'], ascending = False)
video_sales_NA_sorted_df.head(1)[['Name', 'Other_Sales']]
png

GTA: San Andreas has the highest sales in the other regions which are 10.57 million.

Q5: Which game has the highest difference in their sales in North America and Europe?

video_sales_df['NA_EU_Sales_Diff'] = abs(video_sales_df['NA_Sales'] - video_sales_df['EU_Sales'])
video_sales_df.head()
png
video_sales_df.sort_values(by = ['NA_EU_Sales_Diff'], ascending = False).head(1)[['Name', 'NA_EU_Sales_Diff']]
png

Duck Hunt has the highest difference in their sales when the comparison is made between North America and Europe.

Inferences and Conclusion

  • Based on this dataset we cant say that equal importance was given to all the regions of the world as it might be the case that the data was collected mostly from North American stores.
  • Games in the Action genre tend to fetch the most revenue followed by Sports based games.
  • Electronic Arts are the top-grossing publisher clearly because of their supper popular sports-based games.
  • GTA: San Andreas is the top-grossing game in the other Regions of the World.
  • Top-grossing game in North America is Wii Sports.
  • Duck Hunt has the highest difference in the sales made in Europe and North America which shows it is more popular in North America when compared with Europe.
  • These stats can be a bit biased based on the nature of data collection.

References and Future Work

This dataset has a lot of information about the sales trend of video games which can be further exploited to gain useful insights.

  • A region-based analysis can be performed in-depth to compare video games sales trend

References:

--

--

Aditya Jetely

Final Year Electronics and Communication Engineering Student with a keen interest in data science and open source. https://www.linkedin.com/in/aditya-jetely