A tour of awesome features of spaCy (part 1/2)

Nuszk
Nuszk
May 30 · 6 min read
  • It is accurate: there are three pretrained english models to choose from, small, medium and large. As accuracy and speed change accordingly, this gives flexibility to balance them based on a given task.
  • It is user-friendly: the documentation is good for most of the package and there is a built-in explain function for quick access.
  • There is a free interactive four hour course to get one started.
  • It has mild learning curve: it was possible to get started after taking the course and learn as needed on the go.
  • Linguistic features: part of speech tags, dependency parsing, named entity recogniser
  • Visualisers for dependency trees and named entity recogniser
  • Pre-trained word vectors and models
  • Flexibility: can augment or replace any pipeline component or add new components such as TextCategorizer.
  • Transfer learning with BERT-style pretraining
  • There is some common NLP functionality missing, such as scikit-learn-style vectorisers for term-document or TFIDF matrices. Even though these are not necessary if you are training your models with spacy, they are still handy if you want to combine spaCy with other tools.

Getting started

First we need to install spaCy and download a pretrained model. There are three english language models, small, medium and large as well as the model with only GloVe words vectors. The medium and large models also come with GloVe words vectors. All en_core_web_* models come with tokeniser, tagger, parser and entity recogniser components but accuracy improves with model size. I will use the large model here. To get started run the following commands in terminal.

pip install spacy
python -m spacy download en_core_web_lg

Preprocessing

At this point some of the usual text preprocessing tasks are a breeze. The doc can be sliced with token indices to get single tokens or sequences of tokens (spans) and various token attributes such as text, lemma, index, pos, tag and etc. can be accessed. Some attributes extend to spans as well. Sentence segmentation is also available.

Linguistic features

We could also retrieve some linguistic features such as noun chunks, part of speech tags and dependency relations between tokens in each sentence. In order to understand what various tags such as token.pos_, token.tag_ or token.dep_ mean, we can use spacy.explain() that will access annotation specifications.

Dependency tree

spaCy universe

spaCy is very flexible. It is possible to add new pipeline components or replace existing ones. People have been building on top of spaCy and there is a myriad of packages in the spaCy universe. I will only mention two of the pipeline extentions, spacy_langdetect and neuralcoref, but there are many other packages worth spending time to play with.

Conclusion

spaCy has almost all of the common preprocessing and linguistic features used for text processing. It is user-friendly and one can start using it with minimal initial prep. There is a considerable ecosystem of additional packages built on top of spaCy. The documentation is very helpful for using these features and there is a really handy spacy.explain function to quickly access annotation specifications. Several of the features also come with built-in visualisers. We discuss BERT-style pretraining and pipeline component training examples in the second part of this tour.

Eliiza-AI

AI & Machine Learning in Melbourne, Australia

Nuszk

Written by

Nuszk

Eliiza-AI

Eliiza-AI

AI & Machine Learning in Melbourne, Australia