A Deep Dive Into Exploratory Data Analysis: Pokemon Dataset

Introduction

Exploratory data analysis is one of the most important and useful aspects of Data Science and Machine Learning. It is one of the most powerful approaches to analyzing data using various visualization techniques.

In this article, I have tried to cover some of the most important visualization techniques that are required to understand the data in a better way.

I have taken one of the most interesting datasets of Pokemon for all those 90s people, but don’t worry if you are not aware of Pokemon, the data set is pretty easy to understand.

So here we Go!

Getting Ready

There are many visualization libraries present today in Python which gives amazing graphs and plots, some of these libraries are:

  • Matplotlib: Python’s 2D plotting library
  • Seaborn: Provides a high-level interface
  • Plotly: Creates interactive plots
  • Bokeh: It’s a highly interactive library
  • ggplot: Based on R’s ggplot and the Grammar of Graphics

In this article, we are going to explore some of the most awesome visualization techniques using Matplotlib and Seaborn. Visualizing the data is the first step, after visualization, we are going to learn some of the important data manipulation techniques. I’ll be using pandas and numpy frequently, if you want to learn more about this, click here.

Installation

#install Matplotlib
pip install matplotlib
#install Seaborn
pip install seaborn

Dataset

As mentioned above the dataset is of Pokemon, containing 13 attributes for 800 different Pokemon.

Link to Dataset-> https://www.kaggle.com/abcsds/pokemon

We will try to see if it could predict the type of Pokemon using other attributes.

Getting started

Before getting started a few things must be kept in mind:

  • Check for Missing Values
  • Check for Class Imbalance
  • Check for Redundant Attributes
  • Check for the distribution of various attributes
  • Check for Outliers

Importing Data

Importing the dataset from CSV file using pandas
This gives the shape of the data(rows, columns)

Correcting spelling mistakes

As you can see the name of the Mega Pokemon are not correct, so correcting the names for mega evolved Pokemon. [ VenusaurMega Venusaur -> Mega Venusaur]

Now the names of Mega Evolved are correct

To check the Missing Values:

We can see the Type2 attribute has 386 missing values

Missing values are a great pain if not dealt with, missingno module is one of the great ways to visualize the missing values present in the dataset. You can even use the seaborn heatmap to plot a beautiful graph depicting all the missing values.

Read this article “Visualizing the patterns of missing value occurrence with Python”on visualising the missing values using the above two approaches.

What to do next after detecting the missing value?

We can do the following things once we have found the missing values:

  1. Remove Rows With Missing Values.
  2. Choose a Global Value to replace all the missing values from the attribute.
  3. Use the Mean of the attribute and replace all the missing values with this mean.
  4. Use the median of the attribute and replace all the missing values with this median.
  5. There are various predictive models to estimate the value to be filled in.

You can go to this awesome article for filling of missing value: How to Handle Missing Data

Taken from the article “How to Handle Missing Data

Here as the Type2 attribute has 386 missing values, I have just simply put the Type1 values in the Type2 due to the fact that some Pokemon have only 1 type present, and even if there are 2 types, Generally, Pokemon have 1 type stronger than 2 type, so for missing values, this will work.

For other Datasets, any of the above 5 things can be done to fill in the missing values.

Now there are no missing values in the dataset

Check for class Imbalance

There are a lot of ways in which you can check for the class imbalance, among those I prefer to use either a Histogram or Pie chart of all the classes present in the target attribute.

Gives all the unique values of the attribute
Gives counts of all the values in the attribute
Plotting the Pie chart using the above data

A Pie Chart is one of the ways to represent different classes but a Histogram can be used as well.

The pie chart count of each class must be known and fed into plt.pie to return the percentage of each class.

Check for Redundant Attributes:

Selecting the best features that represent the Dataset has always been the most important step before putting our data in any kind of machine learning model.

There are various ways of selecting the best features, some of them are:

  1. Univariate Selection
  2. Feature Importance
  3. Correlation Matrix with Heatmap

You can visit this article “Feature Selection Techniques in Machine Learning with Python” for more information.

I have used Heat Map, which tells the correlation between different attributes. We have to check for Highly correlated attributes in the Heat Map and remove them.

From the scale of 0 being least correlated and 1 being most

For now, removing the categorical features to analyze the non-categorical data effect on Type1 and Type2.

new_Data is a copy of the daily_Data dataset, ‘Generation’ and ‘Legendary’ are removed for now

Check for Outliers

Box plots are a great way to depict the attributes, they show quartile ranges and Outliers.

BoxPlots:

Median (Q2/50th Percentile): the middle value of the dataset.First quartile (Q1/25th Percentile): the middle number between the smallest number and the median of the dataset.

Third quartile (Q3/75th Percentile): the middle value between the median and the highest value of the dataset.Interquartile range (IQR): 25th to the 75th percentile.whiskers (shown in blue)Outliers (shown as green circles)maximum”: Q3 + 1.5*IQR“minimum”: Q1 -1.5*IQR

Refer to this article “Understanding Boxplots” to know more about boxplots.

Visualizing different attributes

Strip-plot, Box-plot and Bar-plots, Swarm-plots, Factor-plots, and Radar Chart are a great way to Visualise various attributes in the dataset.

The function takes Type1 or Type2(Target) and draws a strip plot and box plot corresponding to the attribute selected in the function, moreover a box plot is also drawn using the mean of the taken attribute grouped by Type1 or Type2(Target).

Function mean_attribute can be called for each attribute(other Non categorical)and Type1/Type2(Target)

Checking the Categorical Attributes: ‘Generation’ and ‘Legendary’ attributes effects on Dataset.

Swarm plots

They can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.

Further,we have to see how the number of pokemon of each type depend on generation as seen below:

We have seen how the Types of Pokemon are related to Generation and how each generation of Pokemon contributes to different attributes in a swarm plot.

Now we will see the effect of Legendary on the different attributes, I have used Factor-Plot to clearly differentiate between Legendary and Non-Legendary Pokemon.

Swarm kind of factor plot
Bar kind of factor plot

We can see from the above plot that Legendary Pokemon(orange) have a higher Total than the Non-Legendary Pokemon(Blue).

Comparison of Different Pokemon

  • Bar-plot for each attribute for comparison of two different Pokemon
  • Radar chart for comparison of two different Pokemon
#Code for Radar Chartpok1=daily_Data.loc[pokemon1]
pok2=daily_Data.loc[pokemon2]
# Libraries
import matplotlib.pyplot as plt
import pandas as pd
from math import pi
fig = plt.figure(figsize=(10, 10))
# Set data
df = pd.DataFrame({
'group': ['A','B'],
'HP': [pok1["HP"], pok2["HP"]],
'Attack': [pok1["Attack"], pok2["Attack"]],
'Defense': [pok1["Defense"], pok2["Defense"]],
'Sp. Atk': [pok1["Sp. Atk"], pok2["Sp. Atk"]],
'Sp. Def': [pok1["Sp. Def"], pok2["Sp. Def"]],
'Speed': [pok1["Speed"], pok2["Speed"]]
})
#maximum among all
maximum=max([pok1["Attack"], pok2["Attack"],pok1["Defense"], pok2["Defense"],pok1["Sp. Atk"], pok2["Sp. Atk"],
pok1["Sp. Def"], pok2["Sp. Def"],pok1["Speed"], pok2["Speed"]])

# ------- PART 1: Create background

# number of variable
categories=['HP', 'Attack', 'Defense', 'Sp. Atk',
'Sp. Def', 'Speed']
N = len(categories)
#to select the range, means till where the graph can go for a variable.
aa=[]
aaa=[]
for i in range(maximum):
aa.append(i*10)
aaa.append(str(i*10))
# What will be the angle of each axis in the plot? (we divide the plot / number of variable)
angles = [n / float(N) * 2 * pi for n in range(N)]
angles += angles[:1]

# Initialise the spider plot
ax = plt.subplot(111, polar=True)

# If you want the first axis to be on top:
ax.set_theta_offset(pi / 2)
ax.set_theta_direction(-1)

# Draw one axe per variable + add labels labels yet
plt.xticks(angles[:-1], categories,size=15,color="black")
# Draw ylabels
ax.set_rlabel_position(0)
plt.yticks(aa, aaa, color="black", size=15)
plt.ylim(0,maximum)
# ------- PART 2: Add plots

# Plot each individual = each line of the data
# Ind1
values=df.loc[0].drop('group').values.flatten().tolist()
values += values[:1]
ax.plot(angles, values, linewidth=4, linestyle='solid', label=pokemon1,color=color1)
ax.fill(angles, values, color1, alpha=0.1)

# Ind2
values=df.loc[1].drop('group').values.flatten().tolist()
values += values[:1]
ax.plot(angles, values, linewidth=4, linestyle='solid', label=pokemon2,color=color2)
ax.fill(angles, values, color2, alpha=0.1)

# Add legend
plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))

Code for entire article can be found on https://www.kaggle.com/ayush01/complete-pokemon-stats-visualization

--

--

Ayush Gupta
A Deep Dive Into EDA: Learn Visualization From Most Interesting Pokemon Dataset

I am working as a Senior Data Scientist at Aplazo, India, trying to solve real-world problems with AI. Previously worked at IBM Research and Swiggy Research