PokeML: Understanding Topic Modeling through Pokemon Data

Published in

Geek Culture

9 min readJul 20, 2021

Continuing our analysis of Pokemon, we recall that when we constructed the Pokemon matrix in Part 1, we encoded a Pokemon’s moveset as a bit vector. At the time, we justified this as a not-incorrect encoding, as it would accommodate the inherent structure of the moves. When we performed Principal Component Analysis in Part 2, this encoding sufficed for an initial understanding of the relationships among Pokemon. It is now time to analyze the Pokemon Moves themselves and search for structure among them.

We will make the following assumption on how Moves are created. First, that there exist various “aspects” of Moves that imply some of their properties; the meaning of these aspects, like the meaning of the Principal Components in the last chapter, is only interpretable through examination. Our hope is that they will align with our intuitive classes of Moves, for example that Double-Edge and Wood Hammer are similar in a way that Slam and Solar Beam are not. These aspects are probability distributions over features, assigning higher probability mass to features that better “belong” to the aspect (for example, having recoil damage).

Second, we assume that a Move is represented as a combination of these aspects, interpreted either as a weighted sum of them or as a probability distribution over the aspects.

This set-up is commonly seen in Natural Language Processing as a special type of topic model; the previously mentioned aspects of a Move are similar to topics in a document. When analyzing a topic model, and in particular a Latent Dirichlet Allocation (LDA) model, we assume that the meaning of a document can be inferred from its word cloud — often called a bag of words in this context — and that punctuation and word order can be discarded. In the LDA model, the words in a text document are assumed to be drawn from a mixture of probability distributions based on what topics the document is about: for each word, first a topic is chosen, then a word from that topic.

Pokemon Moves, like Thunderbolt, have various characteristics in common with each other and differ in others. Topic modeling helps us find groups of characteristics that are meaningful.

In our case, we could expect there to be a topic of “priority moves”, in which the conditional probability ℙ[move|topic] for Moves like Aqua Jet, Quick Attack, and Shadow Sneak is large; or a topic of “Special moves with PP 10, Power at least 90, and Accuracy 100” that picks out Moves like Thunderbolt and Ice Beam.

Of course, as an unsupervised method, the topics identified from our model are not going to perfectly match how we would group Moves together as humans — and that’s ok! Our hope is that the topics will align well enough with what we expect and that they will moreover be useful for downstream tasks, which we’ll talk about in the next article!

Once we develop our topic model, we’ll discuss how to interpret the topics, but first, in order to discover these interconnections we are going to need to do something better than encoding the moves as bit vectors of themselves— we’re going to need some more information.

Each move has several defining features to it, much like Pokemon do! For example, take all the descriptive information about the move Thunderbolt:

Name: Thunderbolt
Num: 85
Accuracy: 100
Base Power: 90
Category: Special
PP: 15
Priority: 0
Flags: Blocked by Protect, reflected by Mirror Coat
Secondary: 10% chance of paralysis
Target: One target
Type: Electric
Contest type: Cool
Description: Has a 10% chance to paralyze the target
Short description: 10% chance to paralyze the target

If you liked the bit vector encoding before, you’re really going to love it now. We can encode a move by a bit vector of all the relevant fields of a move: its Accuracy, Base Power, Category, PP, Priority, Flags, Secondary, Target, Type, and Description. For the non-description fields, we encode each possible option as a bit vector: the 504 different mechanics a Move can have, from Hydro Pump’s 5 PP to Thousand Arrows’ ability to ignore Ground immunity.

The descriptions we handle similarly, using a method that is very common in Natural Language Processing. We encode these descriptions as a bag of words, a generalization of the bit vector in which we count each time each word appears in the description. This process depends crucially on the definition of a word, which is a concept that is linguistically and practically difficult to define — in our case, since we are working in English and with text that is not too ill-behaved, a word shall be a sequence of characters separated by whitespace and with the following special characters removed: ‘,.;:)(. In practice, it may be better to instead set a whitelist for characters, but for any fixed dataset deciding on a whitelist is equivalent to deciding on a blacklist. The implied whitelist for this dataset is letters, numbers, and the special characters %-/+*. With this definition, there are 1090 unique words, from ‘the’, appearing 1942 times, to the 281 words that appear only once, such as ‘jungle’.

The features of a Move are thus its defining effects and mechanics, as a bit vector, and the bag of words that appear in its description, yielding a 666 by 1490 matrix to train a topic model on. In the rest of this article, I’ll say ‘word’ to mean a column in this matrix; we won’t need to distinguish between feature names and words in the description from here on out, and this will align more closely with the usual description of a topic model.

One drawback of a topic model is that you do have to specify the number of topics beforehand. This can only be done in an ad hoc, exploratory fashion: let’s go with 20 topics for now. If we don’t like the output of this model, changing the number of topics will be the first knob we fiddle with.

The output of a topic model is primarily two matrices: a topic-words matrix (20×1490) representing ℙ[topic|word] and a doc-topics (666×20) matrix representing ℙ[topic|document]; the former helps you determine what the topics mean, and the latter tells you what topics are represented in each document.

It is extremely common for topic model references to describe topics by words in decreasing ℙ[topic|word] order — this is incorrect both philosophically and in practice. If we want to know what a topic consists of, the question we ask ourselves is “if I read something in this topic, what should I expect to see?” This is a statement of ℙ[word|topic].

In practice, we have the issue of what are called “stop words”; words that offer little semantic meaning (i.e. ‘the’, ‘of’, ‘or’). These are problematic for a topic model: first, these words have a generative structure that is different from the more contentful words, as they are almost closer to punctuation in meaning. Second, because they are so common, ℙ[topic|stop word] is generally appreciable for most topics. If a topic is such that there are words that strongly imply it, then this is not an issue, but it is entirely possible for ℙ[topic|word] to be less than 1/20 for every word and allow the stop words to dominate.

Indeed, let’s consider one topic from our model, considering the 10 words and Moves that maximize ℙ[topic|word] and ℙ[topic|move]:

Topic  2
Words: 
	 the :  0.08832
	 turn :  0.04595
	 move :  0.04128
	 on :  0.03569
	 and :  0.03432
	 this :  0.03338
	 is :  0.03051
	 if :  0.0286
	 user :  0.02693
	 a :  0.02279
Moves: 
	 fly :  0.9875
	 dive :  0.98603
	 dig :  0.98603
	 frenzyplant :  0.97121
	 rockwrecker :  0.97121
	 gigaimpact :  0.97121
	 blastburn :  0.97031
	 hydrocannon :  0.97031
	 hyperbeam :  0.97031
	 prismaticlaser :  0.97031

Looking just at the words, it is not at all clear what this topic is attempting to describe! Yet, looking at the moves, we can tell that this topic is meaningful: it is clearly describing two-turn moves that require either a charging turn (Fly, Dig) or a recharging turn (Frenzy Plant, Hyper Beam). If we instead rank terms by ℙ[word|topic], we get a much better description:

Topic  2
Words:
	 charges: 0.940620
	 executes: 0.940620
	 herb: 0.940620
	 completes: 0.940620
	 flags: charge; 1: 0.940620
	 self: volatileStatus; mustrecharge: 0.913640
	 following: 0.913640
	 must: 0.913640
	 flags: recharge; 1: 0.913640
	 recharge: 0.837500
Moves: 
	 fly :  0.9875
	 dive :  0.98603
	 dig :  0.98603
	 frenzyplant :  0.97121
	 rockwrecker :  0.97121
	 gigaimpact :  0.97121
	 blastburn :  0.97031
	 hydrocannon :  0.97031
	 hyperbeam :  0.97031
	 prismaticlaser :  0.97031

These are similarity groupings, not suggestions to make actual movesets!

This approach does have its downsides in general, however. It is not uncommon for text documents to have rare or misspelled words that are strongly associated with a certain topic. There are a couple of ways of dealing with this: one could exclude all words that do not occur a certain number of times (either from these wordlists or from the analysis altogether), or one could soften the ℙ[word|topic] computation by including a confidence interval.

The Student’s t-test is a standard way of producing a confidence interval. It requires an estimate of the mean (the computed ℙ[word|topic]), the variance (since ℙ[word|topic] is Dirichlet distributed, the variance is approximately ℙ[word|topic](1-ℙ[word|topic]) ), and the number of observations (the total number of times the feature appears across all moves). With those computed, one could instead sort words based on the lower limit of the confidence interval for ℙ[word|topic].

In real-world datasets, stop words are usually dealt with by strict removal via stop word lists. I find this extremely unsatisfying philosophically. First, language is an ever-evolving tool, and so stop word lists are always going to be incorrect — ‘dis’ is rarely on a stop word list while ‘this’ nearly always is. Second, what is and is not a stop word depends strongly on the dataset and problem; ‘you’ is often considered a stop word, but I have found that it is very useful for analyzing topic models trained on Wikipedia, where it identifies pages about songs. Third, stop words are by their nature language-specific and thus expensive to create; if your methods require stop word removal and you are working in a language (or dialect) without a readily available stop word list you’re out of luck.

It turns out that the improvement between the ℙ[word|topic] and ℙ[topic|word] rankings can sometimes be used to produce a data-driven (but still ad hoc!) description of stop words. Stop words in particular have their rankings change drastically between these two systems, so a stop word list can be created by analyzing the changes in rankings.

One potential measurement of stop words, which tracks how much the ranking falls when we consider the better ranking system.

There’s a clear elbow in the plot of these RankChange values (orange), so we can decide to treat all subsequent words as stop words.

In any case, the 20 topics of our model appear to be quite cramped. Moves like Aura Sphere, Bullet Punch, and Slam are all grouped into topic 1, and Moves like Trick and Water Shuriken are grouped in topic 9. What’s likely happening is that there simply aren’t enough topics to describe all the different classes of Moves that exist. We’ll round out this article by increasing the number of topics to 60, removing the stop words identified by that model, and then examining the topics in a new model run on the cleaned data.

Here are the descriptions of every tenth topic identified by our new model (check out the Jupyter notebook to explore the full model and try things out for yourself!):

Topic  0
Words: 
	 raise :  0.73941
	 heal: 1 :  0.70799
	 heal: 2 :  0.65454
	 secondary: self; boosts; spe; 1 :  0.57579
	 secondary: self; boosts; spa; 1 :  0.44497
Moves: 
	 chargebeam :  0.61229
	 fierydance :  0.40577
	 flatter :  0.38524
	 meteormash :  0.29398
	 doubleteam :  0.27497

Topic  10
Words: 
	 basePower: 70 :  0.75706
	 secondary: volatileStatus; confusion :  0.74688
	 confuse :  0.47798
	 basePower: 75 :  0.35326
	 doubles :  0.15786
Moves: 
	 dizzypunch :  0.94825
	 boltbeak :  0.91806
	 fishiousrend :  0.73784
	 diamondstorm :  0.67699
	 boltstrike :  0.64442

Topic  20
Words: 
	 more :  0.5081
	 result :  0.44497
	 120 :  0.36646
	 weight :  0.28042
	 60 :  0.19581
Moves: 
	 quash :  0.6746
	 heatcrash :  0.66714
	 electroball :  0.66439
	 heavyslam :  0.64363
	 gyroball :  0.29317

Topic  30
Words: 
	 typeless :  0.70799
	 types :  0.32481
	 type :  0.26893
	 copied :  0.181
	 include :  0.14959
Moves: 
	 reflecttype :  0.98388
	 soak :  0.36077
	 magicpowder :  0.35143
	 conversion :  0.24423
	 electrify :  0.22328

Topic  40
Words: 
	 recoil :  0.8188
	 recoil: 100 :  0.65454
	 recoil: 33 :  0.65454
	 33% :  0.51437
	 less :  0.45537
Moves: 
	 doubleedge :  0.96927
	 woodhammer :  0.96927
	 headcharge :  0.86724
	 submission :  0.85934
	 bravebird :  0.85476

Topic  50
Words: 
	 boosts: def; 1 :  0.8188
	 boosts: spd; 1 :  0.77652
	 2 :  0.73822
	 stages :  0.72812
	 defense :  0.70986
Moves: 
	 cosmicpower :  0.94825
	 amnesia :  0.94216
	 cottonguard :  0.93854
	 defendorder :  0.8102
	 calmmind :  0.75486

What do you think? In the next part of this series, we will use these topics to measure distances between Pokemon and identify clusters.

PokeML: Understanding Topic Modeling through Pokemon Data

Written by Patrick Martin