Pokémon Data Analysis

Nike Halifax
CodeX
Published in
9 min readAug 14, 2021
Photo by Michael Rivera 🇵🇭 on Unsplash

My goal was to look at the type distribution of Pokémon throughout the series, what types were the strongest and weakest, and which individual Pokémon were the strongest and weakest.

Pokémon possess various attributes that lend themselves to statistical analysis such as individual Attack, Defense, Special Attack, Special Defense, Speed, and Health stats. Additionally, they have at least one (often two) elemental type(s). They can be further sorted by what generation they debuted in, what egg group they belong to, what typing they possess, and whether they’re “legendary” or not.

An added bonus is Pokémon data is messy. Pokédex numbers are useless as an indexing tool, as multiple Pokémon have multiple forms that share the same number. Mega evolutions (super-powered versions of certain Pokémon) skew statistical results if not taken into account. Also, depending on the dataset and your goals, there might be extraneous data like height, weight, egg groups (what Pokémon are able to breed with each other), or even color or typing matchups. While these can be useful, they aren’t needed in my project.

I initially used this dataset from Kaggle containing all Pokémon between Generations 1–6. It has 13 columns: Pokédex #, Name, Type 1, Type 2, Base Stat Total, Base HP, Attack, Defense, Sp. Atk, Sp. Def, Speed, Generation, and a True/False “Legendary” column. Long story short, Legendary Pokémon tend to have higher stats and more advantageous typing.

At time of writing there are 8 Pokémon generations, so to add the remaining two I cobbled together a web scraping program in Python to rip them from Bulbapedia. That program has been lost to time and can’t be shown here. However, I can tell you that this data possessed fewer columns and thus required alteration in Excel to make it congruous with the rest of the dataset. For example, I had to add data for the legendary column and what generation the Pokémon belonged to. With the Pokédex of all 8 generations sitting in one Excel file, I set up a table to import it into MS SQL Server:

Then I added a Region ID column. “Generation” and “Region” are usually redundant, but when you run into regional forms (e.g. Diglett introduced in Gen I vs. Alolan Diglett introduced in Generation VII) and Mega Evolutions (unveiled in Generation VI), Generation loses some of its utility. Initially the dataset had Mega Evolutions share the generation of their base Pokémon, but I found it useful to confine Megas to their respective generation, if not filter them out entirely.

I’m sure there’s a sexier way to do this.

Next was manually inputting which Pokémon were “Legendary.” If you ask most Pokémon fans, Pokémon like Mew, Celebi, Zapdos, and Jirachi all qualify. However, a lot of Pokémon that Western merchandise refer to as “Legendary” are in fact “Mythical” Pokémon. The difference is annoying and doesn’t mean much, but the original dataset uses the Legendary/Mythical distinction. For statistical purposes I found this could skew the results of type-wide analyses, as Mythical Pokémon still tend to have significantly higher stat spreads than non-legendary/mythical Pokémon. In the future, a better move might be to create 3 categories in one column: Legendary, Mythical, and Neither.

At the time it felt more concise to just put all the “special” Pokémon together and have the column store their legendary status as a binary value. This requires knowledge of what Pokémon are considered what category, but the dataset is small enough that manual entry isn’t a hassle.

Next, I had to fix the Mega entries for all Mega Evolutions. The original dataset has all Mega Pokémon as “NameMega Name” (e.g. VenusaurMega Venusaur). I updated the column to put a space between all name entries with “Mega” or “Primal” in them, as well as to set their Generations to 6. I then switched the Pokémon Meganium and Yanmega back to their correct generations, as I knew the previous query would change them.

Then I split the table into two: one table containing the statistics for each Pokémon, and one containing the Name, Generation, Region, and Legendary status. I also made a view replicating the original combined table, just in case. These steps aren’t necessary for a dataset this small, but it’s good practice for database design. It also makes Power Pivot a little easier to use.

From here I corrected a few small errors I noticed on initial export back to Excel. This highlights some of the minutiae of Pokémon I don’t think you’d catch unless you have no life (like me):

I also made an embarrassing rookie mistake and forgot to trim the type columns for errant spaces, resulting in extra type combinations:

I also created a view that excluded Mega evolutions, as well as updated the database with 2 more monsters that were missing:

Now it was time to export this data back to Excel.

Originally I did the following the old-fashioned way — a series of row-by-row calculations within the spreadsheet itself. It was clunky, it was ugly, it made my budget laptop scream when running the calculations, and the file took forever to open because it was 38 megabytes. It worked, but I wasn’t satisfied.

This was slow and ugly.

I decided to use Excel Power Pivot and the data model. Power Pivot rocks, even if the interface is a nightmare. It lets you create a data model using data from various sources, which then allows you to run a variety of calculations you can’t do in Excel — all while taking up less space. My final file, visualizations and all, was barely over a megabyte.

To get my answer to which Pokémon and type combo were the most powerful, a simple summary of what types have the highest-stat Pokémon wouldn’t work because many Pokémon share the same base stat total.

Here we see the perfidious headache that is Pokémon forms.

Instead, I needed to get the z-score of each battle statistic in order to account for the variance across each one. That meant getting the mean and standard deviation for each stat, with the ensuing Standard Deviation formula set up in such a way that the results will update depending on what criteria the end user decides to filter by. Unfortunately, there’s no interesting way to show Power Pivot in a blog post, so we’ll just jump to the results. Keep in mind the images in this article will be woefully compressed. I highly recommend clicking on the link to the full workbook so you can see and interact with the original dashboards as intended.

First, let’s look at total type distribution throughout the series:

A quick glance shows us that Water and Normal types are the most popular by pretty significant margins. Additionally, Normal-Flying is the most common type combination. Here are some less impressive charts that weren’t murdered by compression:

Water is the most popular Primary typing, Flying the least.
Flying is the most popular secondary typing, it’s not even close.

So Normal-Flying is the most popular type combination, and Normal and Flying seem to have an inverse relationship with one another. Normal is one of the most popular primary typings, but the least popular secondary typing. Conversely, Flying is the most popular second typing, but the least popular primary typing. Just a neat observation.

Remember when I said Pokémon data is messy and Pokédex numbers are useless? A fun caveat of every chart in this post: I did not remove Pokémon forms. For analysis of stats, this isn’t a big issue. When tabulating type averages and totals, however, it makes things tricky. For example, there aren’t really 11 Ghost-Grass type Pokémon. There are, however, different-sized versions of the same two Ghost-Grass types. Each one has its own unique set of stats.

A similar case is Deoxys clogging up the ranks for Psychic types. What’s more, Deoxys’ forms all have specialized stat distributions that could affect statistical analysis:

That’s just the tip of the iceberg. There are other Pokémon with conditional or regional forms. It’s easy to justify removing the Mega evolutions; they’re a well-defined category of conditional transformations with high stats. It’s harder to justify removing the Deoxys forms or to just choose one Pumpkaboo or Gourgeist size, or to remove Zen-Mode Darmanitan. At that point it’s just nit-picking what stays and what goes.

Alright, so what typings are the most powerful? Using the median Z-score of base stat totals for each type combination as a measure of total statistical “strength,” here are the results:

All these years and Psychic and Dragon are still on top.

For anyone who plays the games, these results are unsurprising: Dragon and Psychic have the highest frequency of powerful type combos. Both types are commonly found on legendary Pokémon, and Dragon is both one of the rarer types and in general one of the most powerful by base stat total. With that in mind, what type combinations specifically are the most and least powerful?

Bug can’t catch a break.

Psychic and Dragon dominate the list of the strongest types, with Bug cowering in submission. However, it’s important to keep in mind that these lists include Legendaries, which often possess rare or unique type combinations. For example, the only Dragon-Ice type is Kyurem, a powerful legendary that has 3 different forms, 2 of which have very high base stat totals. Without legendaries, the list looks like this:

Bug still can’t catch a break.

Bug is still annihilated, but at least it gets a representative in the elite (thanks, Heracross!). Fire and Rock also have more room to flex their muscles without the giants stomping around.

Finally, what are the strongest and weakest overall Pokémon, based on total z-score strength?

Oops, all Ubers!

Okay, better question, what are the strongest non-legendary monsters?

This comes as a surprise to no one who’s played this game competitively.

And the ten weakest?

This is almost one-to-one with weakest by Base Stat Total.

Alright, so what have we learned? Even when adjusting for Legendaries and their absurd stats, Dragons hold the crown for the most powerful type and make up the majority of the ten strongest Pokémon when basing your ranking off of z-scores. Meanwhile, Bug types are overwhelmingly relegated to the bottom of the barrel. Obviously this doesn’t tell the whole story: Base stats and statistical analyses leave out other important aspects of a Pokémon’s usability. Things like movesets, typing, held items, and abilities all play arguably as important a role in determining a Pokémon’s viability (which is why nobody uses Slaking). Still, it’s interesting to note that most of the Pokémon in the top 10 have or at one time had a presence in the competitive battle scene.

Well, this was a lot longer than expected. If you’d like to play around with the results yourself, as well as some additional charts and tables I left out for the sake of brevity (yeah, I know), I highly recommend clicking the link to the Excel Dashboards I made down below. The scatterplot has interactive features that won’t work unless you download it to your desktop, but everything else should work fine:

--

--