Finding Pokemon names in text using dictionaries and tagtog
By Jorge Campos
Today we are going to build with tagtog a very simple pipeline using a dictionary to recognize Pokémon names.
tagtog is a text annotation tool to annotate manually or automatically. Automatic annotations facilitates information retrieval, and they are powered by machine learning or dictionaries.
Dictionaries are simple controlled vocabularies, and yet a powerful resource when you have a well-defined list of items you want to recognize in text, especially if those items are identified with different names.
OK, let's get started. Where can I find a dictionary with Pokémon names? First, you should find the names somewhere and, if possible, all together. In this particular scenario, it is quite easy. There are many resources out there with such data.
At the moment of writing this post, according to Bulbapedia, there are 807 Pokémon discovered from all 7 generations of the series. There you can find the names you will add to your dictionary. Simply copy and paste the tables to your favorite spreadsheet editor and clean a bit the data to meet the TSV (tab-separated values) dictionary format.
The format is very simple, each row represents one unique entity, in our case one Pokémon. The first value in each row is the Id of the entity (we will use the National Pokédex number). Each of the following values in the row is a name for this Pokémon. There are two types of names: recommended names and alternative names. The score assigned to the later is lower.
In this example we will just use recommended names: the English and the trademarked Japanese names. You can download the resulting dictionary 🗂 here.
Let's move to tagtog, the text annotation tool. If you don’t have an account yet, just sign up. The start plan is free. You can manually annotate all the text you want, train custom Machine Learning models, dictionaries and annotate automatically up to a certain number of requests per month. We will use this plan to build a very simple pipeline to recognize Pokémon names in text. Once you have an account, create a new project. I will name mine
pokemonMiner. Don’t select any pre-trained model, we want to start from scratch 💪.
You can now define the type of entities you want to extract from text. In our case we will only use one Entity type:
pokemon. If you don't like the default color, pick the one you like. It will be used to highlight Pokémon names found in text.
Now we will import a dictionary for the new entity type created. Go to the Dictionaries tab, create a dictionary and upload the dictionary file.
Well, let's try it out! Go to the Documents section and import some text.
Yeee! Names are now recognized automatically when you import text to your project in tagtog, either using the web interface or the API. This isn't a string matching process. Slight morphological or grammatical changes on Pokémon names are considered for recognition. Notice that names are also normalized to National Pokedéx Ids. That is fantastic, we could finish here, but I would like to show you something else that might be useful.
Let's say there is a new Pokémon discovered or you find out a new name for one of the creatures in your dictionary. You have two ways to update your dictionary:
- Replace your dictionary. This is the best option if you want to apply large changes. Just go Settings > Dictionaries, download your current dictionary and apply the changes. When you are ready, replace the dictionary.
- Use the annotation editor. This is the easiest way to apply small changes. Update your dictionary in iterations while you or a group of people analyze the results coming from tagtog.
The later is the most interesting option because you can update your dictionary interactively and continuously. For example, consider this text:
It was good to see his fellow cloned Pokémon again, and Mewtwo was glad they were happy. However, there was one visit that turned out to be less than ideal. And it happened when he and Mew teleported over a prairie. This was where they ended up when Mewtwo had teleported to where Pikatwo was.
He found it strange that Pikatwo had chosen to live so close to a human home, for there was a house only a few meters away from where he and Mew now floated.
Let's say you imported this piece of text, and you also want to recognize Pikatwo (Pokémon originally cloned from Ash’s Pikachu) and add this name to the Pikachu entity in your dictionary.
This is what you should do:
- Open the imported text in the text annotation editor.
- Highlight a mention of Pikatwo. Other entities using the same name are also highlighted.
- Click on the new entity and start typing Pikachu in the normalization box. Entries in your dictionary show up if the names contains the string typed.
- Select the dictionary entry you want to add this new name to.
- Press Enter or click on ⏎
From now on Pikatwo will be recognized in text and normalized to the same Id as Pikachu. If you would like to add it as a new entry, you could simply type the new Id in the normalization box.
Now if you download your dictionary, you will see how the names have been added.
And that's all folks! Recap:
- Use dictionaries to bootstrap the recognition of controlled vocabularies.
- With tagtog you can handle dictionaries with ease: import, download and edit them interactively within the annotation text editor. It is dead easy to build a pipeline to recognize terms in controlled vocabularies automatically.
Now you know how to deal with dictionaries and nothing stops you to build something great. Do you participate in any Pokémon forum? How awesome would be to see an emoji with a specific Pokémon next to a name? Just use these images, they are already numbered using the same Ids we have used to build our dictionary. Import the text to tagtog using the API, and use the response (which contain the Ids recognized) to understand which image to display 🤘.
I hope this was useful. Thanks for reading and please let me know if you have any questions or feedback. You can find more tutorials here or following our blog.
At 🍃tagtog.net we aim to democratize text analytics with our text annotation tool.
👏 👏 👏 if you liked the post and want to share it with others!