A Deep Dive Into Exploratory Data Analysis: Pokemon Dataset
Introduction
Exploratory data analysis is one of the most important and useful aspects of Data Science and Machine Learning. It is one of the most powerful approaches to analyzing data using various visualization techniques.
In this article, I have tried to cover some of the most important visualization techniques that are required to understand the data in a better way.
I have taken one of the most interesting datasets of Pokemon for all those 90s people, but don’t worry if you are not aware of Pokemon, the data set is pretty easy to understand.
So here we Go!
Getting Ready
There are many visualization libraries present today in Python which gives amazing graphs and plots, some of these libraries are:
- Matplotlib: Python’s 2D plotting library
- Seaborn: Provides a high-level interface
- Plotly: Creates interactive plots
- Bokeh: It’s a highly interactive library
- ggplot: Based on R’s ggplot and the Grammar of Graphics
In this article, we are going to explore some of the most awesome visualization techniques using Matplotlib and Seaborn. Visualizing the data is the first step, after visualization, we are going to learn some of the important data manipulation techniques. I’ll be using pandas and numpy frequently, if you want to learn more about this, click here.
Installation
#install Matplotlib
pip install matplotlib#install Seaborn
pip install seaborn
Dataset
As mentioned above the dataset is of Pokemon, containing 13 attributes for 800 different Pokemon.
Link to Dataset-> https://www.kaggle.com/abcsds/pokemon
We will try to see if it could predict the type of Pokemon using other attributes.
Getting started
Before getting started a few things must be kept in mind:
- Check for Missing Values
- Check for Class Imbalance
- Check for Redundant Attributes
- Check for the distribution of various attributes
- Check for Outliers
Importing Data
Correcting spelling mistakes
As you can see the name of the Mega Pokemon are not correct, so correcting the names for mega evolved Pokemon. [ VenusaurMega Venusaur -> Mega Venusaur]
To check the Missing Values:
Missing values are a great pain if not dealt with, missingno module is one of the great ways to visualize the missing values present in the dataset. You can even use the seaborn heatmap to plot a beautiful graph depicting all the missing values.
Read this article “Visualizing the patterns of missing value occurrence with Python”on visualising the missing values using the above two approaches.
What to do next after detecting the missing value?
We can do the following things once we have found the missing values:
- Remove Rows With Missing Values.
- Choose a Global Value to replace all the missing values from the attribute.
- Use the Mean of the attribute and replace all the missing values with this mean.
- Use the median of the attribute and replace all the missing values with this median.
- There are various predictive models to estimate the value to be filled in.
You can go to this awesome article for filling of missing value: How to Handle Missing Data
Here as the Type2 attribute has 386 missing values, I have just simply put the Type1 values in the Type2 due to the fact that some Pokemon have only 1 type present, and even if there are 2 types, Generally, Pokemon have 1 type stronger than 2 type, so for missing values, this will work.
For other Datasets, any of the above 5 things can be done to fill in the missing values.
Check for class Imbalance
There are a lot of ways in which you can check for the class imbalance, among those I prefer to use either a Histogram or Pie chart of all the classes present in the target attribute.
A Pie Chart is one of the ways to represent different classes but a Histogram can be used as well.
The pie chart count of each class must be known and fed into plt.pie to return the percentage of each class.
Check for Redundant Attributes:
Selecting the best features that represent the Dataset has always been the most important step before putting our data in any kind of machine learning model.
There are various ways of selecting the best features, some of them are:
- Univariate Selection
- Feature Importance
- Correlation Matrix with Heatmap
You can visit this article “Feature Selection Techniques in Machine Learning with Python” for more information.
I have used Heat Map, which tells the correlation between different attributes. We have to check for Highly correlated attributes in the Heat Map and remove them.
For now, removing the categorical features to analyze the non-categorical data effect on Type1 and Type2.
Check for Outliers
Box plots are a great way to depict the attributes, they show quartile ranges and Outliers.
BoxPlots:
Median (Q2/50th Percentile): the middle value of the dataset.First quartile (Q1/25th Percentile): the middle number between the smallest number and the median of the dataset.
Third quartile (Q3/75th Percentile): the middle value between the median and the highest value of the dataset.Interquartile range (IQR): 25th to the 75th percentile.whiskers (shown in blue)Outliers (shown as green circles)“maximum”: Q3 + 1.5*IQR“minimum”: Q1 -1.5*IQR
Refer to this article “Understanding Boxplots” to know more about boxplots.
Visualizing different attributes
Strip-plot, Box-plot and Bar-plots, Swarm-plots, Factor-plots, and Radar Chart are a great way to Visualise various attributes in the dataset.
The function takes Type1 or Type2(Target) and draws a strip plot and box plot corresponding to the attribute selected in the function, moreover a box plot is also drawn using the mean of the taken attribute grouped by Type1 or Type2(Target).
Checking the Categorical Attributes: ‘Generation’ and ‘Legendary’ attributes effects on Dataset.
They can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.
Further,we have to see how the number of pokemon of each type depend on generation as seen below:
We have seen how the Types of Pokemon are related to Generation and how each generation of Pokemon contributes to different attributes in a swarm plot.
Now we will see the effect of Legendary on the different attributes, I have used Factor-Plot to clearly differentiate between Legendary and Non-Legendary Pokemon.
We can see from the above plot that Legendary Pokemon(orange) have a higher Total than the Non-Legendary Pokemon(Blue).
Comparison of Different Pokemon
- Bar-plot for each attribute for comparison of two different Pokemon
- Radar chart for comparison of two different Pokemon
#Code for Radar Chartpok1=daily_Data.loc[pokemon1]
pok2=daily_Data.loc[pokemon2]
# Libraries
import matplotlib.pyplot as plt
import pandas as pd
from math import pi
fig = plt.figure(figsize=(10, 10))
# Set data
df = pd.DataFrame({
'group': ['A','B'],
'HP': [pok1["HP"], pok2["HP"]],
'Attack': [pok1["Attack"], pok2["Attack"]],
'Defense': [pok1["Defense"], pok2["Defense"]],
'Sp. Atk': [pok1["Sp. Atk"], pok2["Sp. Atk"]],
'Sp. Def': [pok1["Sp. Def"], pok2["Sp. Def"]],
'Speed': [pok1["Speed"], pok2["Speed"]]
})#maximum among all
maximum=max([pok1["Attack"], pok2["Attack"],pok1["Defense"], pok2["Defense"],pok1["Sp. Atk"], pok2["Sp. Atk"],
pok1["Sp. Def"], pok2["Sp. Def"],pok1["Speed"], pok2["Speed"]])
# ------- PART 1: Create background
# number of variable
categories=['HP', 'Attack', 'Defense', 'Sp. Atk',
'Sp. Def', 'Speed']
N = len(categories)
#to select the range, means till where the graph can go for a variable.
aa=[]
aaa=[]
for i in range(maximum):
aa.append(i*10)
aaa.append(str(i*10))# What will be the angle of each axis in the plot? (we divide the plot / number of variable)
angles = [n / float(N) * 2 * pi for n in range(N)]
angles += angles[:1]
# Initialise the spider plot
ax = plt.subplot(111, polar=True)
# If you want the first axis to be on top:
ax.set_theta_offset(pi / 2)
ax.set_theta_direction(-1)
# Draw one axe per variable + add labels labels yet
plt.xticks(angles[:-1], categories,size=15,color="black")# Draw ylabels
ax.set_rlabel_position(0)
plt.yticks(aa, aaa, color="black", size=15)
plt.ylim(0,maximum)# ------- PART 2: Add plots
# Plot each individual = each line of the data
# Ind1
values=df.loc[0].drop('group').values.flatten().tolist()
values += values[:1]
ax.plot(angles, values, linewidth=4, linestyle='solid', label=pokemon1,color=color1)
ax.fill(angles, values, color1, alpha=0.1)
# Ind2
values=df.loc[1].drop('group').values.flatten().tolist()
values += values[:1]
ax.plot(angles, values, linewidth=4, linestyle='solid', label=pokemon2,color=color2)
ax.fill(angles, values, color2, alpha=0.1)
# Add legend
plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))
Code for entire article can be found on https://www.kaggle.com/ayush01/complete-pokemon-stats-visualization