fastText word vectors demo @ EuroPython 2018
Two weeks ago, two members of our omni:us tech team attended the EuroPython 2018 Conference held in Edinburgh. With over 1100 attendees, and more than 120 talks, this was a massive conference! As the name suggests, it consisted primarily of a broad range of Python-related topics. This 7-day event included training sessions, keynotes and talks, as well as coding sprints.
For those interested, you can find videos of the talks here.
It was by far one of the more interesting conferences, with a variety of presentations revolving around NLP, deployment methods, and useful tools and techniques that tech teams all over the world could implement.
Each evening, one hour was reserved for spontaneous lightning talks. The idea of a lightning talk is that any attendee can volunteer to present on any topic, with a limit of 5 minutes per talk. Participation is straightforward: Write down your topic of choice in the morning, jump on stage in the evening. Genius! And we are proud to say that one of our very own scientific engineers participated in this spontaneous lightning presentation.
Marianne’s very enlightening presentation was titled Robust Word Vectors with fastText. It explains why using the open-source library, fastText, to train word vectors is worth a shot.
The need for tools to be able to attain more accurate classification results from large bodies of text gave rise to the development of fastText. fastText is a library open-sourced by Facebook AI Research (FAIR) Lab. The goal of fastText is to create a more scalable and efficient solution for text representation and classification. Credits to FAIR for this flexible and useful tool. Check out their post on fastText here!
So, let’s get down to her presentation topic.
At its core, word vectors map a word to a vector of numbers. Word vectors are trained by iterating over large bodies of text — words that appear in similar contexts (like walk and go) end up with similar word vectors. For applications like text classification, looking up these vectors for input words is usually the first step! The problem is that new or misspelled words do not have any assigned word vector. And here comes fastText to fix it!
As opposed to the usual training method of retaining a single word vector per word, fastText splits each word into smaller parts (with a minimum of 3 characters) and trains one word vector per sub-word. fastText would, for example, split the word walk into two vectors, wal and alk.
In case a word like walked was not part of the training text corpus, fastText can still come up with a word vector: At least the subwords wal and alk are known from training, so fastText returns a combination of their vectors.
One part of Marianne’s lightning talk was a live demo, where she trained word vectors on a very small dataset consisting of the EuroPython 2018 talk descriptions. She then showed that fastText provides a meaningful word vector for a misspelled word like Pythom, by listing the words with the most similar word vectors. Without fastText, words with similar word vectors are not similar in meaning.
For more insights and detailed explanations, you can catch Marianne’s engaging talk here:
Here at omni:us, we work a lot on text that has been extracted from scanned documents using OCR (optical character recognition). These algorithms usually allow some character errors to surface. By using fastText word vectors in our NLP pipelines, we are able to compensate for these OCR errors.
fastText is a versatile tool which is easy to use and very helpful for what we do. But, more on our work next time!