Predicting Pokémon Type Using Pokédex Entries
Note: You may access the full code on GitHub by clicking here.
For 25 years, the Pokémon series has been capturing the hearts of children (and adults) around the world, becoming a pop culture icon with incredible staying power. With nearly 900 unique Pokémon across manga, television, apparel, video and board games, trading cards — Pokémon merchandise runs the gamut. My first-ever video game was Pokémon Red; I wasn’t very good, but I enjoyed the little blurbs in the Pokédex that appeared every time I caught something new. Once I became an experienced trainer, I became interested in the type match-ups and imagined what different type combinations would look like. I’m sure I’m not alone in having been asked, “if you were a Pokémon, what type would you be?” Personally, I always considered myself an electric or ice type, but I began to think, could I find a type based on a little blurb of my own? With Natural Language Processing (NLP), I set out on my own journey of predicting Pokémon type through text.
This post will be walking through three different levels of techniques used in NLP:
- Simple one-hot encoding of tokens with supervised learning and neural network examples to predict the primary type.
- Neural network using Global Vector (GloVe) embeddings to predict the primary type.
- Transformers for multi-label sequence classification.
The goal is also to generate interest in data science, through the lens of Pokémon, since it is an approachable and popular topic.
Sourcing the Data
There are many great sites that keep track of Pokémon data, including Bulbapedia and Serebii. For ease of use, we will be scraping from Pokémon Database; the set-up allows us to get all entries across the 8 generations. The only downside to this is that it lumps Alola and Galar regional variants onto the same page as the original, and it is difficult to separate out. So for example, we will have a few entries for Alola’s Fire-Ghost Marowak labeled as Ground. We first import some of the common packages and libraries and set up our Pandas dataframe (libraries will be imported right before their relevant application to the code):
We use the
BeautifulSoup library to scrape from the site. Documentation can be found here, but the gist is that we look for unique ID and class names in the HTML. The HTML source of a webpage can be accessed manually by pressing Ctrl+U in Chrome.
Let’s take a look at the distribution of types. It is also possible to check if any types are missing, but all the data seems to be there.
It seems that Water and Normal types are the most common, which is to be expected. Flying type is generally the secondary type, so there are only 7 Pokémon with a primary Flying type; this and other imbalances may pose a problem if we do not under- or over-sample. Oversampling is preferred because there are less than 1000 data points.
We can save this as a CSV file so that we do not need to scrape again.
pokemon = pd.read_csv('pokedex_db.csv')
Method 1: One Hot Encoding Tokens
The goal of this method is to show how text can be converted into normal features to use in ML models. In this case, everything will be binary features, meaning the value can either be 0 or 1.
Now that we have our data, we start encoding. For many supervised learning methods, it would be okay to leave the types as strings, but we need to encode them as labels for neural networks. We can use either the
OrdinalEncoder functions from the
sklearn library, and here we use the former.
Looking at the Flying type again, most players will know that a lot of bird-like Pokémon are actually Normal type first, with Flying as the secondary type. In fact, about a quarter of all Normal types are also Flying! To help type balance and prevent Normal entries from having lots of references to bird associations like “peck”, “wings”, etc., we make an adjustment to these Normal-Flying types to give Flying as the primary type.
Next, we will adjust the text to get it ready for tokenization. We make sure to remove the names of Pokémon. One of the apostrophes we remove is actually a non-keyboard character that appears in the scraping.
We use the
nltk package to prepare the one-hot encoding. We remove stopwords, which are common words in the English language, like “the” or “can”, that don’t add any separation between different types. The
MultiLabelBinarizer creates the one-hot encoded arrays.
The data is split 80:20 between training and test, with no validation set for these examples. It is stratified along the types so that the test set doesn’t get most of the Ice types, for example.
Since there is an imbalance in primary types, we want to oversample. Without, the models would predict Water and Normal types at much higher rates (predicting Water for everything is much more accurate than random guesses), and probably not predict Fairy or Flying type at all. There are a few sampling options, but given that we have thousands of binarized features that are quite sparse, we write off using SMOTE, a method of creating synthetic points by giving features values close to those found within the same label. We can use random oversampling to duplicate data points in the minority classes, using the
Below is an example of a Bernoulli Naive Bayes model, which actually outperforms a neural network. Test accuracy comes to 0.48, not bad when there are 18 possible labels. The neural network in the latter half of the code block requires 18 nodes in the final layer, one for each typing. The softmax activation is a way to normalize probabilities using a logistic-like function (think of an “s” stretched horizontally, or a sideways cubic function). The 18 probabilities sum to one, but we only care about the highest value of those.
Method 2: GloVe Embeddings
The goal of this method is to show that each word of text can be represented as a vector of numbers, which can be fed easily in neural networks.
Words can be represented in a vector space, where distances are minimized when words are very similar. Additionally, vectors between pairs of words may be similar if they are analogies (the vector between “man” and “king” may be similar to the one between “woman” and “queen”). GloVe provides various sizes of vectors for word representation; these vectors are also called “word embeddings”. You can read more and download the embeddings on the GloVe site. It is important to note that GloVe does not account for context; only the words themselves are important, which means we will perform very similar tokenization steps as in the first method.
We will use glove-6b-200d for this, but we could have obtained higher scores by using more dimensions per token or the 42B data (1.9M vs 400k vocab words).
Setup is slightly different this time, in that we will separate the entries, per Pokémon, per game. This means that Bulbasaur would have a couple dozen rows of data. This gives many more data points to work with.
To convert to embeddings, we use the GloVe file. Since the vast majority of entries have 20 or less tokens, we can choose 20 as the length of our data. The model requires that all input data is the same shape, so this step must be done.
We once again oversample, though the type distribution is not quite the same, since each Pokémon is in a different number of games. After balancing and getting train and test data, the neural network is ready for input. Because the data is a vector of vectors, we will have to add a
Unfortunately, the accuracy from this model is not anywhere near the first model’s.
Method 3: BERT Transformers and Multi-label Prediction
The goal of the final method is to show how to use Transformers. If you’d like to read more about how they work, there are many articles on the topic, as well as the original BERT paper from the team at Google. Transformers are great for many different tasks, including sequence classification (though next sentence prediction is even better). Code for transformers can be found in both TensorFlow Hub and Hugging Face; we will use the
transformers library from Hugging Face. Don’t forget to do a
!pip install transformers line!
Because we are going to do multi-label classification, we will have use both the Type1 and Type2 columns. I also did a single-type prediction so that we can look at the confusion matrix at the end, but for now I will only focus on the code for the multi-label.
The initial dataframe and regex replacements for this method are the same as in GloVe. Even better, we don’t need to remove any stopwords, because BERT works with sentence context! We can also keep the Pokémon names, since BERT will actually break apart individual words, instead of having each word as a token. Each token has an ID and, like GloVe, BERT creates embeddings vectors, this time with a dimension of 768. The sentence context means that the word “jog” in “going for a jog” is represented differently than “jog your memory” because of the context and part of speech.
Unlike the first two methods, we use PyTorch to build our neural network. This runs on a GPU (a CPU can be used, but not preferred), and not all computers have GPUs ready for this purpose. In this case, we use Google Colab, which will allow us to run an environment with a powerful GPU.
After setting up the GPU for use, we do tokenization and multi-label binarization. These steps give us our data and our labels so that we can later create the training and test sets.
Next we have to prep our training and test data. Like before, we will want to oversample, but the problem is much more difficult now that some data points have 2 labels! We need to use the
sklearn-multilabel library to get the correct balance. This will blow up our training set size to over 100,000 data points. That means longer calculation times, but we should get a much nicer accuracy.
BERT requires attention masks, which are used to guess the masked token(s) based on the surrounding tokens; imagine you want the model to be able to predict “snow” in “Mary had a little lamb, whose fleece was white as [MASK],” based on the sentence. In our case, the mask array is simply a 1 in positions to be masked, and 0 for no mask; we mask all of the real (non-padding) tokens, but when training, only about 15% of the tokens are masked at a given time instead of all together at once.
Batching is used for the model. Here, we set the size to 16, so each batch randomly selects 16 samples from the data to train at a time. The model can’t use all of the data at once because it can only hold so much data and hidden states in memory, and small batch size also helps update gradients more frequently, using a simple
loss.backwards() line. The one other important hyperparameter is the learning rate, which determines how far the gradient should jump. A rate too high means that the local minimum could be jumped over and missed. We use 3E-5 as the rate here. Please check out the .ipynb file on GitHub to see the full code for training and evaluating the model.
Because a Pokémon can only ever have at most two types, the labels are not truly independent of one another. This means that to calculate test accuracy this time, we can’t simply use the model accuracy. If we did, we’d be guaranteed that at least 14 labels will match, or a minimum accuracy close to 80%. We instead need to look at the accuracy of the predicted types only.
- If Pokémon has 1 type for true and predicted, accuracy is either 0 or 100%.
- If Pokémon has 2 types for true, accuracy gets 50% for each type predicted correctly.
- If Pokémon has 1 type for true, 2 for predicted, the initial thought was to have accuracy as 89% if the true type is in one of those two predicted types (since there is a 2/18 probability of getting the correct type). However, this encourages having two types, whereas many only have one type. So we lower this to 0.67 to try to have a better match in number of types.
Additionally, we need a cutoff level on the probability to determine whether or not to include a second type. We use a cutoff of 0.01 here, which sounds very low, but resulted in the highest accuracy.
We checked to see if the model predicted the correct number of types as well, and it seems correct. The calculated accuracy was around 82%, much better than the first two methods!
BERT Results for Primary Typing Only
In comparison, the accuracy for BERT for predicting primary typing was around 84%, which we can see with this confusion matrix:
We can also look at the classification report, which shows us how accurate the single type model is by each type. It looks like it is best at predicting Electric-types correctly!
I hope this walkthrough of different NLP methods using Pokémon gave some inspiration and insight in how interesting (and fun) data science can be! Lastly, I also included a function in the code to run new sentences. You can describe something in text and see what type it would predict. The length is capped to small snippets, like a 140-character tweet. For example, I put in the description on the Blue Eyes White Dragon card from Yu-Gi-Oh! (removing the word “dragon”), and it predicted a Dragon type!
Here are a few other examples; I threw in a few other cartoon references. Do they make sense?
You may access the code on GitHub here. Thank you for reading!