Using SpaCy for Natural Language Processing
A guide for everyone to spaCy: from installation to training the model with your own data.
It’s easy to forget how powerful the human brain is. Not only do we understand language, we’ve figured out how to teach computers to understand human language, referred to as Natural Language — and make it easy for anyone to use NLP. It’s all around us now. If you use a computer or smartphone today, you’ve probably taken advantage of machine translation, sentence generation, and error correction.
People have even made it easier for normal users to look behind the curtain and watch the magic happen. Stick around even if you don’t know programming. Trust me, you won’t need it.
I considered using NLTK for this article, since it’s aimed towards research and teaching. SpaCy is just closer to my heart, since I learned working with on my own. SpaCy is free, open-source, designed for production, and wonderfully documented.
NLTK, spaCy, and PyTorch work with python. If you’re a java enthusiast, be sure to check out Apache OpenNLP afterwards.
Let’s get right into it
Here’s the colab where I ran all of this code: SpaCy for NLP
Go to File -> Save copy in Drive
Now, feel free to play around as you wish! Ctrl+enter executes the cell that you are in and Runtime -> Run all will run all the cells for you.
So, what is SpaCy?
SpaCy is a library for Natural Language Processing that can process and “understand” large volumes of text. SpaCy does this through a variety of features. You need to load a core statistical model for the features we’ll be using today. Statistical models essentially means that the “decisions” spaCy makes are actually statistical predictions.
Although I’ll start at the very basics, by the end of this article, you’ll know how to train spaCy to make new predictions.
When we call spacy.load(‘en’), we’re calling the default English model ‘en_core_web_sm’. The vocab length of the small model is only 478, compared to 1340242 for the large model. The small model is sufficient for preprocessing, so we’ll start there.
What’s a Vocab?
Strings take a lot of space to store, so spaCy stores them as hash-values instead. Since the hash values hold no meaning on their own, the conversion between them is stored in Vocab.
Vocab is sort of like a translator between the hash values and the words. Each entry in the vocab is an object called a Lexeme, that has a hash value, a text, and some other data like whether the strings consist of numbers or alphabets (‘is_num’ or ‘is_alpha’).
What does the model do?
As soon as the model is called on a text, it puts the text through a preprocessing pipeline. A preprocessing pipeline is a series of steps (functions) that the model performs to “understand” the text and prepare it for further processing.
Doc is the convention for storing the objects returned after preprocessing. They seem to be normal strings of text, but they’re not. The doc object contains the data that was found during preprocessing, such as the ‘tokens’, sentences, entities and parts of speech found in a sentence.
Tokenize is performed before the rest of the steps in the pipeline. It splits the text into tokens. Tokens are the smallest units of the string that have semantic meaning, i.e. words and punctuation. The rest of the pipeline can be customized.
The tagger assigns a part of speech (POS) to every token. The POS tag indicates how a token functions in meaning as well as grammatically within the sentence. We all know common parts of speech, such as noun, pronoun, adjective, verb, adverb, etc. There are 36 different commonly accepted parts of speech in the English language.
The parser distinguishes sentence boundaries. It also shows the dependencies the tokens have with each other. This is call dependency parsing. Unlike humans, spaCy cannot “instinctively” understand which words depend on others. However, it has been trained on a lot of data to predict dependencies between words.
The text output format for dependency parsing is quite difficult to understand.
Displacy is a visualiser that can be useful in showing the dependencies between the tokens. The tokens are joined by arrow marks and the type of dependency is mentioned above the arrow.
The Spacy Visualiser is a great external tool that you can use to interactively understand the dependencies in a sentence.
Named Entity Recognition (NER)
Named entities are “real-world objects” that have been assigned a name — for example a place, a country, a product, or a work of art. A proper noun, just about. However, as you’ll see, NER works on more than the POS of the token.
When NLP is used in production, it is often important to find the people and places in large volumes of text, which is why NER is included in the basic pipeline.
As you can see, Dorothy, who is most definitely a person, is not considered a named entity here. If you go back to where we printed the POS of the tokens, Dorothy is considered a proper noun. These discrepancies happen because spaCy is not taught rules. It has to make its own statistical predictions based on the data given to it.
You’ll also notice that ‘Uncle Harry’ and ‘Aunt Em’ are correctly identified as one person each. While a POS is given to each individual token, NERs can have multiple tokens. Look at ‘Charles Dickens’ and ‘The French Revolution’.
Changing the pipeline
A very practical feature of spaCy is that it allows you to modify the pipeline. You can add steps (functions) that process the doc as you desire.
I’ve showcased this with a custom sentence parser that splits the sentence at commas. You can add any function you want here, even one that just prints out the length of the doc. We’ll look at a more practical use of changing the pipeline when we train the model.
SpaCy can predict similarity. It returns the similarity between two objects on a scale of 0 (no similarity) to 1 (completely the same). For similarity, you’ll need to use either the large or medium models.
The similarities are more complex than just comparing every single word. They’re compared using word vectors, which are multi-dimensional meaning representation of words. I won’t get into the details here. While you’re playing around, you’ll find short phrases fare better than long documents with irrelevant words.
If you want to see NLTK’s cosine similarity used on a larger document, check out this code where I used it to compare the similarity between two Jane Austen books.
Training the Model
This is the fun bit.
As you have seen, the named entity recognition is not the greatest.
NER is mainly used in applications where it’s important to find the names of the people. What happens when you need spaCy to identify something else that’s necessary for your applications?
If you worked with shelters, you might be interested if an animal was being put up for adoption on social media. If you were working with a company, you’d want to know when a specific product is mentioned in the open-ended customer support form.
Let’s teach spaCy to recognise two new types of entities- products and animals.
We’ll provide labelled training data to spaCy, that will have the exact position of the token and the ner tag for that token. Our training data is not very good for practical purposes. It is too small and only marks one entity per sentence. This may cause model to learn in some unexpected ways.
Making changes to the training data (adding a new label type or adding more data) won’t affect how the code runs, so you can experiment with this as well!
At this point, explaining the code depends a bit more on machine learning. The code has comments that gives an overview of what is happening. Basically, we are teaching spaCy how to make statistical predictions based on the limited data we have provided.
While our training data is limited, we can still see that spaCy has learned how to identify (some) animals and products.
You can see here that spaCy does not just memorize words and sentences. No training sentence mentioned weak legs, but spacy identifies Horse as an animal. The second and third sentence both use the adjective “nice”, but the animal and the person are correctly identified. The trained model is better at identifying people. Although “snake” is identified in one sentence, it isn’t in the last one. Unfortunately, our model unlearned ‘today’ as a date.
This is a basic introduction to NLP with spaCy, focusing mostly on the powerful preprocessing capabilities of the library. The more you play around with the code, the better you’ll understand how it works. I hope you learned something new today!
Coding is for everyone ❤