Statistical Analysis with Python: Pokémon

Editor — Ishmael Njie

DataRegressed Team

Published in

DataRegressed

6 min readJan 1, 2018

Why analyse Pokémon?

I wanted to start off with a dataset that was relatively small and not too complicated. I found this dataset on Kaggle: “Pokémon with stats”. I have a fair understanding of the columns and the data in the CSV file from that page so I thought, why not? The file consists of 800 rows and 13 columns, detailing the features of each Pokémon spanning 6 Generations.

For this post, it may be worthwhile to have my Kaggle Kernel alongside to follow it in its entirety. Now let’s start…:

Preliminaries: Import Libraries

Read CSV file and save as a variable

As you can see, from taking the first instance in the data frame, there are two indices to identify the Pokémon, one formed when the rows were entered in the data frame, and one from the file itself: ‘#’.

We will set the ‘#’ column as our index. We are also going to rename the column names so that all the spaces in the names are removed.

Let’s look at the head of the data frame we have.

After looking at the head and tail of the Dataframe, there is unnecessary text in front of some Pokémon names. This needs to be removed using regular expression (Regex). With more research through the data, I also found that in front of other names, there was unnecessary text, this is all rectified in the cell below.

Mega, Primal and Legendary Pokémon

From top left: Mega Charizard X, Mega Charizard Y and Charizard

Generation 6 saw the introduction of Mega Pokémon. This evolution is not applicable to all Pokémon. An example of this evolution is the Mega Evolution of Charizard. In this case, Charizard has two Mega forms, where they both have a Total base stat of 634, as opposed to Charizard’s base form which has a Total base stat of 534.

Legendary Pokémon are Pokémon that feature in myths in the Pokémon world; two of these take a Primal form. Primal Reversion is a transformation affecting Legendary Pokémon Kyogre and Groundon.

From left to right: Primal Groundon and Primal Kyogre

All of the above are very powerful and therefore, their base stats are expected to be of the highest level amongst the dataset. Since not all Pokémon can take these forms, it would be a good idea to omit these types of Pokémon from our analysis.

Omitting Mega and Primal Pokémon; an indication of this is seeing that Mega Venusaur is not present in this dataframe

Poke holds all Pokémon that are not legendary, Poke L holds all Pokémon that are.

Following this, we can look at the proportion of Pokemon in the dataset that are not Legendary and those that are.

The dataset we have consists of Pokémon from 6 Generations. Conventionally, generations work independent of each other so an option would be to potentially analyse Pokémon with respect to their region.

2. Type Analysis

Single vs Dual. We can look at the proportion of Pokémon that are dual types vs those that are not.

The pie chart shows that the split is fairly even. We have 50.9% of the Pokémon in this data frame that have only a single PokeType. Moving on, we can analyse the distribution of primary and secondary Pokémon types.

Primary Types

Water has the highest frequency as a primary Poketype. Flying has the lowest. We can see that the bar plot has taken into consideration the ‘type1_colours’ to colour the bars appropriately. The ‘type1_colours’ were set in a cell before hand.

Secondary Types

The ‘None’ type was set in this cell, for Pokémon that did not have a secondary PokéType.

Here, we can see that the ‘None’ field has the highest frequency. We can also see that the PokeType ‘Flying’ is the highest secondary PokeType but lowest on the Primary plot.

A reminder to follow this post along side the Kaggle Kernel.

3. Base Stat Analysis

The following cell and graphic will express the correlation between each of the base stats against each other.

From the heat map, we can see that the correlation between the Sp.Def and Total is 0.68, which is the highest in the matrix (excluding the diagonal). We can go one step further and create a scatter plot of the Sp.Def and Total.

Overall, for all generations, the correlation metric of 0.68 is echoed as scatter plot shows a positive correlation between the Sp.Def and Total.

In the game, there are two types of attacks: Attack and Special Attack.

Attacks (Physical Attack) make contact with the Pokémon and damage is calculated based off of the opponent’s Defense.

Here, the distribution of both attributes are similar and one could suggest that both are positively skewed. We can see that there is a significant tail end to the Defense stat as opposed to the Attack stat, portraying that there are more Pokemon with high Defense stats than Attack. You could argue that the Defense stat has a higher variance than the Attack stat also.

Special Attacks (Sp.Atk) do not make contact with the Pokémon and damage is calculated based off of the opponent’s Special Defense.

Sp.Def and Sp.Atk have a similar distribution. One could argue that both are positively skewed, as a large number of the Pokemon have relatively low base statistics, with a few Pokémon having a large Sp.Def and/or Sp.Atk stat. Visually, you could argue that Special attack, in blue, has a larger variance than that of Special Defense. One can also see that Sp.Def holds the higher stat of the two, approximately at 225.

To reinforce the comments made above, we can print the summary statistics of the fields in our dataframe.

By calling on the summary statistics, we can see that the assumption about the variance and skewness of both plots was correct. The ‘std’ metric (standard deviation) of the Attack is less than the Defense, meaning that the Defense statistics are more spread. Similarly, the Sp.Atk ‘std’ is larger than that of the Sp.Def. Skewness is determined by the positions of the median (50%) and the mean. Since in all instances (Attack, Defense, Sp.Attack and Sp.Defense) the mean is greater than the median, it is emphasised that the distribution is right-skewed (positively skewed).

Here, we will create a user defined function (UDF) for the minimum and maximum of the base statistics. The user can input any frame along with the stats array to find the Pokemon with the highest and lowest stats. Initially, an array needs to be formed as the set of base stats to analyse.

An example of using the function:

Shows the Pokemon with the highest stat in each attribute, along with the generation they belong to.

Visually, we can compare the base stat total of each generation.

From this, we can see instantly that Generation 3 has the Pokémon with the highest total base stat. From printing the max stats using our UDF, we know that Pokemon is Slaking, with a base stat total of 670. All the other Generations share a high base stat total of 600. Dragonite for Generation 1, Tyranitar for Generation 2, Garchomp for Generation 4, Hydregion for Generation 5 and Goodra for Generation 6.

That is it for the statistical analysis! There is more on the Kaggle Kernel I have published so check that out here. It follows on with some great user defined functions:

One about finding the Top 10 strongest Pokémon based on Generation and base stat.
Another that allows the user to find the Top 6 Pokémon that can combat a specific PokéType (a concept based on the handheld games).

Also check out the code here at my GitHub.

Statistical Analysis with Python: Pokémon

Editor — Ishmael Njie

Written by DataRegressed Team