PokeML: Understanding Topic Modeling through Pokemon Data

Patrick Martin
Geek Culture
Published in
9 min readJul 20, 2021

Continuing our analysis of Pokemon, we recall that when we constructed the Pokemon matrix in Part 1, we encoded a Pokemon’s moveset as a bit vector. At the time, we justified this as a not-incorrect encoding, as it would accommodate the inherent structure of the moves. When we performed Principal Component Analysis in Part 2, this encoding sufficed for an initial understanding of the relationships among Pokemon. It is now time to analyze the Pokemon Moves themselves and search for structure among them.

We will make the following assumption on how Moves are created. First, that there exist various “aspects” of Moves that imply some of their properties; the meaning of these aspects, like the meaning of the Principal Components in the last chapter, is only interpretable through examination. Our hope is that they will align with our intuitive classes of Moves, for example that Double-Edge and Wood Hammer are similar in a way that Slam and Solar Beam are not. These aspects are probability distributions over features, assigning higher probability mass to features that better “belong” to the aspect (for example, having recoil damage).

Second, we assume that a Move is represented as a combination of these aspects, interpreted either as a weighted sum of them or as a probability distribution over the aspects.

This set-up is commonly seen in Natural Language Processing as a special type of topic model; the previously mentioned aspects of a Move are similar to topics in a document. When analyzing a topic model, and in particular a Latent Dirichlet Allocation (LDA) model, we assume that the meaning of a document can be inferred from its word cloud — often called a bag of words in this context — and that punctuation and word order can be discarded. In the LDA model, the words in a text document are assumed to be drawn from a mixture of probability distributions based on what topics the document is about: for each word, first a topic is chosen, then a word from that topic.

Pokemon Moves, like Thunderbolt, have various characteristics in common with each other and differ in others. Topic modeling helps us find groups of characteristics that are meaningful.

In our case, we could expect there to be a topic of “priority moves”, in which the conditional probability ℙ[move|topic] for Moves like Aqua Jet, Quick Attack, and Shadow Sneak is large; or a topic of “Special moves with PP 10, Power at least 90, and Accuracy 100” that picks out Moves like Thunderbolt and Ice Beam.

Of course, as an unsupervised method, the topics identified from our model are not going to perfectly match how we would group Moves together as humans — and that’s ok! Our hope is that the topics will align well enough with what we expect and that they will moreover be useful for downstream tasks, which we’ll talk about in the next article!

Once we develop our topic model, we’ll discuss how to interpret the topics, but first, in order to discover these interconnections we are going to need to do something better than encoding the moves as bit vectors of themselves— we’re going to need some more information.

Each move has several defining features to it, much like Pokemon do! For example, take all the descriptive information about the move Thunderbolt:

Name: Thunderbolt
Num: 85
Accuracy: 100
Base Power: 90
Category: Special
PP: 15
Priority: 0
Flags: Blocked by Protect, reflected by Mirror Coat
Secondary: 10% chance of paralysis
Target: One target
Type: Electric
Contest type: Cool
Description: Has a 10% chance to paralyze the target
Short description: 10% chance to paralyze the target

If you liked the bit vector encoding before, you’re really going to love it now. We can encode a move by a bit vector of all the relevant fields of a move: its Accuracy, Base Power, Category, PP, Priority, Flags, Secondary, Target, Type, and Description. For the non-description fields, we encode each possible option as a bit vector: the 504 different mechanics a Move can have, from Hydro Pump’s 5 PP to Thousand Arrows’ ability to ignore Ground immunity.

The descriptions we handle similarly, using a method that is very common in Natural Language Processing. We encode these descriptions as a bag of words, a generalization of the bit vector in which we count each time each word appears in the description. This process depends crucially on the definition of a word, which is a concept that is linguistically and practically difficult to define — in our case, since we are working in English and with text that is not too ill-behaved, a word shall be a sequence of characters separated by whitespace and with the following special characters removed: ‘,.;:)(. In practice, it may be better to instead set a whitelist for characters, but for any fixed dataset deciding on a whitelist is equivalent to deciding on a blacklist. The implied whitelist for this dataset is letters, numbers, and the special characters %-/+*. With this definition, there are 1090 unique words, from ‘the’, appearing 1942 times, to the 281 words that appear only once, such as ‘jungle’.

The features of a Move are thus its defining effects and mechanics, as a bit vector, and the bag of words that appear in its description, yielding a 666 by 1490 matrix to train a topic model on. In the rest of this article, I’ll say ‘word’ to mean a column in this matrix; we won’t need to distinguish between feature names and words in the description from here on out, and this will align more closely with the usual description of a topic model.

One drawback of a topic model is that you do have to specify the number of topics beforehand. This can only be done in an ad hoc, exploratory fashion: let’s go with 20 topics for now. If we don’t like the output of this model, changing the number of topics will be the first knob we fiddle with.

The output of a topic model is primarily two matrices: a topic-words matrix (20×1490) representing ℙ[topic|word] and a doc-topics (666×20) matrix representing ℙ[topic|document]; the former helps you determine what the topics mean, and the latter tells you what topics are represented in each document.

It is extremely common for topic model references to describe topics by words in decreasing ℙ[topic|word] order — this is incorrect both philosophically and in practice. If we want to know what a topic consists of, the question we ask ourselves is “if I read something in this topic, what should I expect to see?” This is a statement of ℙ[word|topic].

In practice, we have the issue of what are called “stop words”; words that offer little semantic meaning (i.e. ‘the’, ‘of’, ‘or’). These are problematic for a topic model: first, these words have a generative structure that is different from the more contentful words, as they are almost closer to punctuation in meaning. Second, because they are so common, ℙ[topic|stop word] is generally appreciable for most topics. If a topic is such that there are words that strongly imply it, then this is not an issue, but it is entirely possible for ℙ[topic|word] to be less than 1/20 for every word and allow the stop words to dominate.

Indeed, let’s consider one topic from our model, considering the 10 words and Moves that maximize ℙ[topic|word] and ℙ[topic|move]:

Topic  2
Words:
the : 0.08832
turn : 0.04595
move : 0.04128
on : 0.03569
and : 0.03432
this : 0.03338
is : 0.03051
if : 0.0286
user : 0.02693
a : 0.02279
Moves:
fly : 0.9875
dive : 0.98603
dig : 0.98603
frenzyplant : 0.97121
rockwrecker : 0.97121
gigaimpact : 0.97121
blastburn : 0.97031
hydrocannon : 0.97031
hyperbeam : 0.97031
prismaticlaser : 0.97031

Looking just at the words, it is not at all clear what this topic is attempting to describe! Yet, looking at the moves, we can tell that this topic is meaningful: it is clearly describing two-turn moves that require either a charging turn (Fly, Dig) or a recharging turn (Frenzy Plant, Hyper Beam). If we instead rank terms by ℙ[word|topic], we get a much better description:

Topic  2
Words:
charges: 0.940620
executes: 0.940620
herb: 0.940620
completes: 0.940620
flags: charge; 1: 0.940620
self: volatileStatus; mustrecharge: 0.913640
following: 0.913640
must: 0.913640
flags: recharge; 1: 0.913640
recharge: 0.837500
Moves:
fly : 0.9875
dive : 0.98603
dig : 0.98603
frenzyplant : 0.97121
rockwrecker : 0.97121
gigaimpact : 0.97121
blastburn : 0.97031
hydrocannon : 0.97031
hyperbeam : 0.97031
prismaticlaser : 0.97031
These are similarity groupings, not suggestions to make actual movesets!

This approach does have its downsides in general, however. It is not uncommon for text documents to have rare or misspelled words that are strongly associated with a certain topic. There are a couple of ways of dealing with this: one could exclude all words that do not occur a certain number of times (either from these wordlists or from the analysis altogether), or one could soften the ℙ[word|topic] computation by including a confidence interval.

The Student’s t-test is a standard way of producing a confidence interval. It requires an estimate of the mean (the computed ℙ[word|topic]), the variance (since ℙ[word|topic] is Dirichlet distributed, the variance is approximately ℙ[word|topic](1-ℙ[word|topic]) ), and the number of observations (the total number of times the feature appears across all moves). With those computed, one could instead sort words based on the lower limit of the confidence interval for ℙ[word|topic].

In real-world datasets, stop words are usually dealt with by strict removal via stop word lists. I find this extremely unsatisfying philosophically. First, language is an ever-evolving tool, and so stop word lists are always going to be incorrect — ‘dis’ is rarely on a stop word list while ‘this’ nearly always is. Second, what is and is not a stop word depends strongly on the dataset and problem; ‘you’ is often considered a stop word, but I have found that it is very useful for analyzing topic models trained on Wikipedia, where it identifies pages about songs. Third, stop words are by their nature language-specific and thus expensive to create; if your methods require stop word removal and you are working in a language (or dialect) without a readily available stop word list you’re out of luck.

It turns out that the improvement between the ℙ[word|topic] and ℙ[topic|word] rankings can sometimes be used to produce a data-driven (but still ad hoc!) description of stop words. Stop words in particular have their rankings change drastically between these two systems, so a stop word list can be created by analyzing the changes in rankings.

One potential measurement of stop words, which tracks how much the ranking falls when we consider the better ranking system.
There’s a clear elbow in the plot of these RankChange values (orange), so we can decide to treat all subsequent words as stop words.

In any case, the 20 topics of our model appear to be quite cramped. Moves like Aura Sphere, Bullet Punch, and Slam are all grouped into topic 1, and Moves like Trick and Water Shuriken are grouped in topic 9. What’s likely happening is that there simply aren’t enough topics to describe all the different classes of Moves that exist. We’ll round out this article by increasing the number of topics to 60, removing the stop words identified by that model, and then examining the topics in a new model run on the cleaned data.

Here are the descriptions of every tenth topic identified by our new model (check out the Jupyter notebook to explore the full model and try things out for yourself!):

Topic  0
Words:
raise : 0.73941
heal: 1 : 0.70799
heal: 2 : 0.65454
secondary: self; boosts; spe; 1 : 0.57579
secondary: self; boosts; spa; 1 : 0.44497
Moves:
chargebeam : 0.61229
fierydance : 0.40577
flatter : 0.38524
meteormash : 0.29398
doubleteam : 0.27497

Topic 10
Words:
basePower: 70 : 0.75706
secondary: volatileStatus; confusion : 0.74688
confuse : 0.47798
basePower: 75 : 0.35326
doubles : 0.15786
Moves:
dizzypunch : 0.94825
boltbeak : 0.91806
fishiousrend : 0.73784
diamondstorm : 0.67699
boltstrike : 0.64442

Topic 20
Words:
more : 0.5081
result : 0.44497
120 : 0.36646
weight : 0.28042
60 : 0.19581
Moves:
quash : 0.6746
heatcrash : 0.66714
electroball : 0.66439
heavyslam : 0.64363
gyroball : 0.29317

Topic 30
Words:
typeless : 0.70799
types : 0.32481
type : 0.26893
copied : 0.181
include : 0.14959
Moves:
reflecttype : 0.98388
soak : 0.36077
magicpowder : 0.35143
conversion : 0.24423
electrify : 0.22328

Topic 40
Words:
recoil : 0.8188
recoil: 100 : 0.65454
recoil: 33 : 0.65454
33% : 0.51437
less : 0.45537
Moves:
doubleedge : 0.96927
woodhammer : 0.96927
headcharge : 0.86724
submission : 0.85934
bravebird : 0.85476

Topic 50
Words:
boosts: def; 1 : 0.8188
boosts: spd; 1 : 0.77652
2 : 0.73822
stages : 0.72812
defense : 0.70986
Moves:
cosmicpower : 0.94825
amnesia : 0.94216
cottonguard : 0.93854
defendorder : 0.8102
calmmind : 0.75486

What do you think? In the next part of this series, we will use these topics to measure distances between Pokemon and identify clusters.

--

--

Patrick Martin
Geek Culture

I’m a mathematician and strategy gamer who enjoys looking for patterns in data and investigating what those patterns mean.