PokeML — An Introduction to Data Science… using Pokemon!

Patrick Martin
6 min readJul 12, 2021

--

The world of data science involves organizing, processing, and interpreting large amounts of data, searching for trends, patterns, or other interesting structures. These investigations have myriad uses: identifying redundancies which allow for compressed representations, creating artificial features that assist downstream prediction tasks, or even simply learning what a ‘typical’ datapoint looks like!

Step up to your Rotom PC and let’s do some data science!

This series will demonstrate several data science techniques, from principal component analysis to topic modeling to clustering to neural networks. Often, data science tutorials will use either artificial data, in which you already know what you’re looking for, or some messy real-world dataset, in which the noise inherent in the real world requires you to really squint in order to identify trends. We are going to work in a middle-ground: data from the Pokemon video game. In this first installment, we will create the Pokemon dataset — you can follow along by downloading the data and a Jupyter notebook here.

Pokemon has fairly complex mechanics, with a lot of variation across different Pokemon. Each Pokemon has a unique set of qualities: the Moves it can learn, its Types, its potential Abilities and its base Stats are the primary attributes that distinguish the over 730 different Pokemon and Pokemon forms available in the Sword and Shield games. The available variety in these qualities is also large, with over 650 different Moves, 18 Types, over 240 Abilities, and six Base Stats that take values between 1 and 255.

The designers of Pokemon had enormous options when creating these creatures, and yet it is not the case that any combination of Moves is learnable by some Pokemon (for example, no Pokemon learns both ‘Moonlight’ and ‘Electroweb’). It is then reasonable to expect that there is some inherent structure as to how each Pokemon is designed — this is what we will investigate!

The fundamental element of data science is the matrix. When I teach linear algebra to undergraduates, I introduce matrices as being any of four objects:

  • A bunch (m⋅n) of numbers arranged in an m-by-n rectangle
  • A bunch (m) of n-dimensional points (or vectors) stacked on top of each other as rows
  • A bunch (n) of m-dimensional points (or vectors) stacked next to each other as columns
  • A linear transformation from the n-dimensional vector space ℝⁿ to the m-dimensional vector space ℝᵐ

The fact that these four ideas yield the same object is fundamental to both linear algebra and data science. What we have with our Pokemon dataset, is a bunch of Pokemon each described by the various attributes I listed above. If we can convert those attributes to be vectors, then we can create a Pokemon matrix and begin our investigation!

Of course, most of our attributes are not given to us as numbers or vectors. Take the Pokemon Magikarp, for example:

A cute little fellow (Image from Serebii)

Magikarp
Type: Water
Moves: Bounce, Flail, Hydro Pump, Splash, Tackle
Abilities: Swift Swim, Rattled
Base Stats: 20 HP, 10 Atk, 55 Def, 15 SpA, 20 SpD, 80 Spe

The only numbers here are the Base Stats! To put this Pokemon into a vector space, we’ll have to think mathematically: our assumption is that there is some vector space, P, in which these Pokemon live and where distances between the points in this vector space correspond to similarity of the Pokemon themselves. The assumption of a vector space also implies that there are inherent features that can be combined to represent a Pokemon, and that these features have some scalar relationship. Our issue, however, is that we have no idea what these features are, or what the scalar relationships should be — this is exactly our goal to find!

We will instead have to embed our Pokemon into some vector space V that is compatible with P; in other words, that we can later map V to P, once we’ve learned more about the data. In doing so, we still need to decide how addition and scalar multiplication should interact, and we are effectively going to punt on this question by encoding everything as a bit vector.

A bit vector is a vector consisting only of zeros and ones. We can describe a Pokemon’s movepool as a bit vector in this way: consider the vector space ℝ⁶⁶⁶ where each Move is paired with a standard basis element eᵢ — Bounce might be #323, corresponding to the vector with 665 zeros and a 1 in its 323rd spot, and Splash #65, corresponding to the vector with a 1 in its 65th spot and zeros elsewhere. A Pokemon’s movepool can then be described as the sum of all of the Moves it learns, making Magikarp’s movepool a vector with 661 zeros: a zero everywhere except in five coordinates, including the 323rd and the 65th.

The benefit of this encoding is that it respects the metric of P. Given two vector spaces there are many maps between them, however their flexibility is not unlimited. If two movepool vectors are equal, then no mapping can separate them (this is the “vertical line test”) — our choice of encoding means that two movepool vectors are equal only if the movepools are equal, alleviating this concern. Moreover, linear transformations of vector spaces must work well with scalar multiplication, however no movepool vector is a scalar multiple of another and so we avoid this restriction entirely.

The final concern is that linear transformations must preserve linear combinations, and the number of Pokemon (734) is larger than the number of moves (666); hence the Pokemon cannot be linearly independent in this space. Thankfully, by extending the bit vector to account for the Pokemon Types (18) and Abilities (243), the dimension increases to where we can hope to have linear independence — meaning functions from our constructed vector space are less-constrained in how they can move the Pokemon around.

We have yet to consider the Base Stats, however. These are given to us as numbers, but as data scientists we must ask ourselves if they behave like numbers in the data. Certainly, when it comes to Base Stats, Magikarp is more similar to Feebas (20 HP, 15 Atk, 20 Def, 10 SpA, 55 SpD, 80 Spe) than it is to Tyranitar (100 HP, 134 Atk, 110 Def, 95 SpA, 100 SpD, 61 Spe). But, by encoding the Base Stats as their numeric values, we also inadvertently enforce that Glalie (80 HP, 80 Atk, 80 Def, 80 SpA, 80 SpD, 80 Spe) is equidistant from Spinda (60 HP, 60 Atk, 60 Def, 60 SpA, 60 SpD, 60 Spe) and Mew (100 HP, 100 Atk, 100 Def, 100 SpA, 100 SpD, 100 Spe). On the other hand, encoding these as a bitvector of the unique values the stats can take makes it difficult to recognize the importance of certain numerical relationships among the stats, for example the total of the Base Stats or the maximum of Atk and SpA.

The nice thing is, we can do both! We can also throw in some additional bit vectors to flag other attributes we might think are important, for example whether a Pokemon can evolve. Hence, we have constructed our Pokemon matrix: a 734-by-1550 matrix where for each of the 734 Pokemon we record

  • A length 18 bit vector of the Pokemon’s Types
  • A single bit for whether the Pokemon can evolve
  • The six numerical values of the Pokemon’s Base Stats
  • Six bit vectors of length 98, 107, 103, 96, 95, and 117 for the unique Base Stat values
  • A length 243 bit vector of the Pokemon’s Abilities
  • A length 666 bit vector of the Pokemon’s Moves

In later entries of this series, we’ll use this matrix and similar datasets to explore various data science techniques!

Check out Part 2, in which we explore Principal Component Analysis.

--

--

Patrick Martin

I’m a mathematician and strategy gamer who enjoys looking for patterns in data and investigating what those patterns mean.