A few weeks ago I started working on a text summarisation project and I needed a Natural Language Processing library with comprehensive features. The project had several potentially computationally expensive components where I wanted to try out different things. So I needed something flexible that was also fast enough for me to experiment with. The search brought me to spaCy. spaCy was very helpful and I decided to summarise its features in this two part guide. Full code for the post can be found on GitHub. If you are new to NLP perhaps you might consider reading this intro first — it explains many concepts refered to in this post.
Why I like spaCy:
- It is fast because of Cython’s data structures and parallelism magic: so I can experiment with different approaches.
- It is accurate: there are three pretrained english models to choose from, small, medium and large. As accuracy and speed change accordingly, this gives flexibility to balance them based on a given task.
- It is user-friendly: the documentation is good for most of the package and there is a built-in
explainfunction for quick access.
- There is a free interactive four hour course to get one started.
- It has mild learning curve: it was possible to get started after taking the course and learn as needed on the go.
- Preprocessing: tokenisation, sentence segmentation, lemmatisation, stopwords
- Linguistic features: part of speech tags, dependency parsing, named entity recogniser
- Visualisers for dependency trees and named entity recogniser
- Pre-trained word vectors and models
- Flexibility: can augment or replace any pipeline component or add new components such as TextCategorizer.
- Transfer learning with BERT-style pretraining
- The documentation becomes rather hard to navigate through as one gets to training and pretraining of models.
- There is some common NLP functionality missing, such as scikit-learn-style vectorisers for term-document or TFIDF matrices. Even though these are not necessary if you are training your models with spacy, they are still handy if you want to combine spaCy with other tools.
First we need to install spaCy and download a pretrained model. There are three english language models, small, medium and large as well as the model with only GloVe words vectors. The medium and large models also come with GloVe words vectors. All
en_core_web_* models come with tokeniser, tagger, parser and entity recogniser components but accuracy improves with model size. I will use the large model here. To get started run the following commands in terminal.
pip install spacy
python -m spacy download en_core_web_lg
Next we create an
nlp object and load a model in it. The
nlp object now has a tokeniser, tagger, parser and entity recogniser in its pipeline and we can use it to process a text and get all of those features.
At this point some of the usual text preprocessing tasks are a breeze. The
doc can be sliced with token indices to get single tokens or sequences of tokens (spans) and various token attributes such as text, lemma, index, pos, tag and etc. can be accessed. Some attributes extend to spans as well. Sentence segmentation is also available.
The medium and large english models also come with GloVe vectors and the vectors can be accessed through token/span/doc
.vector attribute. Vector of a span or a doc is calculated by taking the average of vectors for all tokens in the span. spaCy also has built in similarity function.
The similarity function is the same as the cosine similarity, or the cosine of the angle between two vectors. Cosine similarity ignores vector lengths and, in the two extreme cases, vectors pointing in the same direction have similarity 1, while vectors pointing in the opposite direction have similarity -1. So larger the value higher the degree of similarity.
We could also retrieve some linguistic features such as noun chunks, part of speech tags and dependency relations between tokens in each sentence. In order to understand what various tags such as
token.dep_ mean, we can use
spacy.explain() that will access annotation specifications.
token.tag_ carries extra morphological information compared with
token.pos_. These attribures, as well as some others, come in pairs with and without an underscore, corresponding to a unicode and integer values respectively. The
.dep_ attribute is for dependency relations between words and is best understood via spaCy’s built-in customisable visualiser.
Retrieving and visualising named entities is done very conveniently in spaCy.
You could introduce some custom options for visualisation part.
spaCy is very flexible. It is possible to add new pipeline components or replace existing ones. People have been building on top of spaCy and there is a myriad of packages in the spaCy universe. I will only mention two of the pipeline extentions, spacy_langdetect and neuralcoref, but there are many other packages worth spending time to play with.
Since a lot of text data comes from mostly uncurated sources such as the web the language detector functionality comes especially handy. For example, combining the lang_detect with with spaCy’s
.lang_ attribute (which points to the language of the model of the
nlp object used to process the text) one could ensure that the correct language model is used. Notice
._. is used to access extensions.
Another common feature often necessary in text processing is Coreference Resolution, or linking together all phrases that mention the same thing. For example, in “I like ice cream because it is tasty” both “ice cream” and “it” refer to the same thing. Coreference Resolution is a key feature of language understanding and while humans are pretty good at it, machines are not. The neuralcoref extension integrates this functionality in spaCy as another pipeline component. We need to re-process doc objects that were created before adding new pipeline components to make neuralcoref available.
We see that ‘it’ is incorrectly resolved while resolution for ‘he’ is correct and the corresponding scores show that the model wasn’t as confident in it’s decision for ‘it’ as it was for ‘he’. We can tweak the greediness of the model to make it stricter about resolving coreferences. The default greediness is .5 and if we decrease it to .45 we see that the coreferences are now resolved correctly.
neuralcoref also comes with a visualization client.
spaCy has almost all of the common preprocessing and linguistic features used for text processing. It is user-friendly and one can start using it with minimal initial prep. There is a considerable ecosystem of additional packages built on top of spaCy. The documentation is very helpful for using these features and there is a really handy
spacy.explain function to quickly access annotation specifications. Several of the features also come with built-in visualisers. We discuss BERT-style pretraining and pipeline component training examples in the second part of this tour.