Statistical Analysis with Python: Pokémon
Editor — Ishmael Njie
Why analyse Pokémon?
I wanted to start off with a dataset that was relatively small and not too complicated. I found this dataset on Kaggle: “Pokémon with stats”. I have a fair understanding of the columns and the data in the CSV file from that page so I thought, why not? The file consists of 800 rows and 13 columns, detailing the features of each Pokémon spanning 6 Generations.
For this post, it may be worthwhile to have my Kaggle Kernel alongside to follow it in its entirety. Now let’s start…:
Preliminaries: Import Libraries
Read CSV file and save as a variable
As you can see, from taking the first instance in the data frame, there are two indices to identify the Pokémon, one formed when the rows were entered in the data frame, and one from the file itself: ‘#’.
We will set the ‘#’ column as our index. We are also going to rename the column names so that all the spaces in the names are removed.
Let’s look at the head of the data frame we have.
After looking at the head and tail of the Dataframe, there is unnecessary text in front of some Pokémon names. This needs to be removed using regular expression (Regex). With more research through the data, I also found that in front of other names, there was unnecessary text, this is all rectified in the cell below.
Mega, Primal and Legendary Pokémon
Generation 6 saw the introduction of Mega Pokémon. This evolution is not applicable to all Pokémon. An example of this evolution is the Mega Evolution of Charizard. In this case, Charizard has two Mega forms, where they both have a Total base stat of 634, as opposed to Charizard’s base form which has a Total base stat of 534.
Legendary Pokémon are Pokémon that feature in myths in the Pokémon world; two of these take a Primal form. Primal Reversion is a transformation affecting Legendary Pokémon Kyogre and Groundon.
All of the above are very powerful and therefore, their base stats are expected to be of the highest level amongst the dataset. Since not all Pokémon can take these forms, it would be a good idea to omit these types of Pokémon from our analysis.
Following this, we can look at the proportion of Pokemon in the dataset that are not Legendary and those that are.
The dataset we have consists of Pokémon from 6 Generations. Conventionally, generations work independent of each other so an option would be to potentially analyse Pokémon with respect to their region.
2. Type Analysis
Single vs Dual. We can look at the proportion of Pokémon that are dual types vs those that are not.
The pie chart shows that the split is fairly even. We have 50.9% of the Pokémon in this data frame that have only a single PokeType. Moving on, we can analyse the distribution of primary and secondary Pokémon types.
Primary Types
Secondary Types
A reminder to follow this post along side the Kaggle Kernel.
3. Base Stat Analysis
The following cell and graphic will express the correlation between each of the base stats against each other.
From the heat map, we can see that the correlation between the Sp.Def and Total is 0.68, which is the highest in the matrix (excluding the diagonal). We can go one step further and create a scatter plot of the Sp.Def and Total.
In the game, there are two types of attacks: Attack and Special Attack.
- Attacks (Physical Attack) make contact with the Pokémon and damage is calculated based off of the opponent’s Defense.
- Special Attacks (Sp.Atk) do not make contact with the Pokémon and damage is calculated based off of the opponent’s Special Defense.
To reinforce the comments made above, we can print the summary statistics of the fields in our dataframe.
By calling on the summary statistics, we can see that the assumption about the variance and skewness of both plots was correct. The ‘std’ metric (standard deviation) of the Attack is less than the Defense, meaning that the Defense statistics are more spread. Similarly, the Sp.Atk ‘std’ is larger than that of the Sp.Def. Skewness is determined by the positions of the median (50%) and the mean. Since in all instances (Attack, Defense, Sp.Attack and Sp.Defense) the mean is greater than the median, it is emphasised that the distribution is right-skewed (positively skewed).
Here, we will create a user defined function (UDF) for the minimum and maximum of the base statistics. The user can input any frame along with the stats array to find the Pokemon with the highest and lowest stats. Initially, an array needs to be formed as the set of base stats to analyse.
An example of using the function:
Visually, we can compare the base stat total of each generation.
From this, we can see instantly that Generation 3 has the Pokémon with the highest total base stat. From printing the max stats using our UDF, we know that Pokemon is Slaking, with a base stat total of 670. All the other Generations share a high base stat total of 600. Dragonite for Generation 1, Tyranitar for Generation 2, Garchomp for Generation 4, Hydregion for Generation 5 and Goodra for Generation 6.
That is it for the statistical analysis! There is more on the Kaggle Kernel I have published so check that out here. It follows on with some great user defined functions:
- One about finding the Top 10 strongest Pokémon based on Generation and base stat.
- Another that allows the user to find the Top 6 Pokémon that can combat a specific PokéType (a concept based on the handheld games).