PokeML: Understanding Principal Component Analysis through Pokemon Data

Patrick Martin
7 min readJul 12, 2021

Now that we have our Pokemon matrix from the last chapter, we can start doing some basic statistics on it. You can follow along by downloading the data and a Jupyter notebook here. The first things we might look at are conditional statistics, what the statistics of one feature are conditioned on the value of another feature.

The average Base Stat over all Pokemon of a certain Type (here, Water) is a conditional expectation. (Image from Bulbapedia)

Even experiments as simple as the conditional mean can be very informative. For example, the average Base Speed of Rock-type Pokemon is about 14 points lower than the total average, while Electric-type Pokemon are 18 points faster!

That there is great variability in the average Base Speed across different types is interesting!

The issue is that there are too many of these fields to check — in our 734-by-1550 matrix, there are 1551*1550 many pairs of features to compare (although a fraction of those will be uninformative due to the duplication of Base Stat information). With over two million ordered pairs of features, not only is that too many to parse through by hand, but there is also the likely scenario that a correlation will appear purely by chance.

Instead, let’s look for a few special cases of feature combinations that particularly pique our interest. There are a few cases that are worth investigating: if we can reproduce our matrix with only a few feature combinations, and if there are feature combinations that yield the same value for every Pokemon. It turns out that both of these cases can be done using linear algebra!

The Singular Value Decomposition (SVD) of a matrix A is two collections of orthogonal unit vectors (uᵢ)ᵢ and (vᵢ)ᵢ and a collection of non-negative numbers (σᵢ)ᵢ in decreasing order such that we can write A as a sum of rank-1 matrices:

The Singular Value Decomposition of a matrix

This decomposition has an additional, crucial, property, which is that the unweighted reconstruction error of the first k terms is as small as possible; no other rank-k matrix is better.

There is a bit of a catch, however, which is that in requesting the uᵢ and vᵢ be orthogonal we implicitly ask for the features to be mean zero, and in asking the unweighted reconstruction error be minimized we implicitly ask that each of the features has unit variance. Centering and standardizing the features of our Pokemon matrix ‘pokemat’ is not too difficult:
pokemean = np.mean(pokemat, axis=0)
pokemat_centered = pokemat — pokemean
pokevar = np.std(pokemat_centered, axis=0)
pokemat_std = pokemat_centered / pokevar

Taking the SVD, we can look at the right singular vectors (the vᵢ) to see what features they care about. Each σᵢ uᵢ vᵢ’ summand represents a “principal component” of the data, hence this method is often called Principal Component Analysis. The uᵢ represent a “score” of how much each Pokemon belongs to each component (it is a multiple of the cosine of the angle between the Pokemon vector and vᵢ), and by multiplying our pokemat matrix on the left by uᵢ’, we can see what a stereotypical Pokemon of this component would be. We could alternatively analyze the entries of vᵢ for a similar analysis.

Without further ado, what are the important features of the first principal component?

Explained variance:  1.60%
Types
Ground : -7.842690639669932
Rock : -7.427309867430017
Fighting : -6.711810973239125
...
Grass : 7.991340126650072
Fairy : 9.733883089100411
Psychic : 15.433455316346226
Evo: 0.6233045567308383
Stats
HP: -2.7464194907248984
Atk: -11.298376124782424
Def: -5.549751665421634
SpA: 7.224040378967937
SpD: 5.06510658441219
Spe: 0.5509736771876438
Abilities
Rock Head : -5.366632148394648
Mold Breaker : -4.583375128153399
Sheer Force : -4.576362112944537
Sturdy : -4.27359708807994
Swift Swim : -4.190268693604895
...
Healer : 5.367723614159558
Chlorophyll : 5.478950433731844
Frisk : 5.5382971667483325
Magic Guard : 6.597401331384066
Synchronize : 8.47350624096705
Moves
rocktomb : -13.646225131301534
stoneedge : -12.63027441284487
rockslide : -12.282327981594978
earthquake : -11.530355194925749
bulldoze : -11.232702455810752
...
trick : 16.657537410346166
trickroom : 16.796498775672784
psychic : 16.872067529716976
skillswap : 16.902851483352702
psyshock : 17.13967137265958
Largest features:
Move: psyshock : 17.13967137265958
Move: skillswap : 16.902851483352702
Move: psychic : 16.872067529716976
Move: trickroom : 16.796498775672784
Move: trick : 16.657537410346166
Move: storedpower : 16.118377397035715
Move: lightscreen : 16.07773481468094
Move: energyball : 15.665947638171573
Type: Psychic : 15.433455316346226
Move: dazzlinggleam : 15.14701664142972

This first component appears to heavily prefer Psychic types, as well as Grass and Fairy types, and disfavor Ground, Rock, and Fighting types. The component has higher Special Attack and Special Defense than the mean, while having lower Attack and Defense. The Abilities and Moves also reinforce this, with the component preferring moves that are Special or Status and Psychic-typed and disfavoring moves that are Physical and Rock- or Gound-typed. From a purely data-driven analysis we have discovered what any Pokemon fan will tell you: the primary division in Pokemon is the Physical/Special split!

Let’s take a look at the second component:

Explained variance:  1.26%
Types
Water : -18.343344582560572
Ice : -5.109397300953924
Rock : -3.868821527236046
...
Electric : 6.517038067430568
Fire : 6.811680923962266
Fighting : 9.581870685075659
Evo: -4.9444025216422425
Stats
HP: 2.103884825699211
Atk: 8.50298827038834
Def: -1.6591581964655318
SpA: 2.035036169380743
SpD: -0.26441867161077665
Spe: 10.286840438998459
Abilities
Swift Swim : -9.414848563938712
Water Absorb : -7.688068913052541
Damp : -6.975738175084524
Torrent : -6.474954322989725
Rain Dish : -5.849056523697185
...
Inner Focus : 4.160532791942758
Steadfast : 4.699784219214815
Guts : 4.771739344037478
Static : 5.178942222904696
Blaze : 5.913742466080827
Moves
watergun : -17.505184571571117
scald : -16.828757114621947
dive : -16.657356068198325
hail : -16.41303978782391
hydropump : -16.21926921790699
...
coaching : 10.370799382360444
reversal : 10.650091267194984
bulkup : 11.04593130157761
lowsweep : 11.516235648612527
thunderpunch : 12.048234455130169
Largest features:
Type: Water : -18.343344582560572
Move: watergun : -17.505184571571117
Move: scald : -16.828757114621947
Move: dive : -16.657356068198325
Move: hail : -16.41303978782391
Move: hydropump : -16.21926921790699
Move: waterfall : -15.823592607035485
Move: whirlpool : -15.424096570855415
Move: brine : -14.931194467442863
Move: waterpulse : -14.575918523918133

The Pokemon in this component are definitely not Water types, as can be seen from the Type components, the Abilities, and the Moves. Identifying what this component favors is a little trickier; certainly Fighting types are preferred, but also Electric and Fire types. The Stats, Abilities, Moves, however, appear to indicate a preference for “fighters” — Pokemon with reasonable offensive power that learn punching or kicking moves. One thing we can do to help better identify this component is to ask what Pokemon align best with it: look at the entries when we multiply the pokemat on the right by v₂. Among the most disfavored Pokemon are Tentacool and Omanyte, while among the most favored are Cinderace and Hawlucha. It almost appears that the second most distinguishing feature of Pokemon is whether or not they have strong appendages!

Another benefit of this Principal Component Analysis is that it allows for easily visualization of data. By projecting our Pokemon onto the first two singular vectors, we can easily create a scatterplot of our data.

We can moreover color each dot by the Type of the Pokemon in order to see how Type is represented on this plot, and also label certain Pokemon so we can see which is which.

Dots colored approximately according to Type, too many Types to give a key.

The point of visualization is to identify non-uniformities in the data. For example, the strong Special attackers tend to not have a strong inclination on the second ‘Water vs fighter’ component, while the Physical attackers do vary a lot in the second component.

It can also be useful to look at the smallest singular values. In the case of this matrix, there are nine very small singular values, explaining about 10⁻³⁰% of the variance, representing the fact that our matrix is not full rank. These vectors will represent combinations of features that identify a small class of Pokemon almost uniquely, with signs given to the features so that the values largely cancel out. These highlight Pokemon with unique features, like Magearna, Blissey, or Poison-types that don’t learn Toxic (like the Galarian Slowpoke line).

Besides the approximately zero singular values, there are also the smallest “nonzero” singular values, which explain about 0.001% of the variance. Interestingly, these identify the Rotom Formes, pitting Frost against Fan in one component and Frost and Fan against Mow and Heat in another. Other small principal components identify evolutionary lines, with one component preferring Piloswine and Walrein while disfavoring Mamoswine and Sealeo.

One final analysis we can make of the singular value decomposition is to look at the distribution of the non-zero singular values. To do this, we can plot the singular values in order and look for abrupt changes in their distribution. It is not uncommon for the singular values to have a power or exponential distribution (see Zipf’s Law and Benford’s Law), and so using semi-log and log-log plots can help identify if and when the distribution deviates from those distributions.

Using linear, semi-log, and log-log plots can help identify indices of interest. Here, #30 (orange) appears to be distinctive.

In our dataset, it appears that there is a change in behavior around index 30, and so further investigations can be done to determine whether that implies a change in the meaning of the components.

Despite being a relatively simple method, Principal Component Analysis through Singular Value Decomposition is extremely powerful. Our analysis of our Pokemon matrix has uncovered some crucial axes distinguishing Pokemon, from the known Physical-Special split to a more subtle Water-fighter spectrum. What else can you uncover in your datasets?

In the next part of this series, we’ll examine a way to refine our encoding of the Pokemon movesets: topic modeling!

--

--

Patrick Martin

I’m a mathematician and strategy gamer who enjoys looking for patterns in data and investigating what those patterns mean.