Two minutes NLP — SpaCy cheat sheet

POS tagging, dependency parsing, NER, and sentence similarity

Fabio Chiusano
NLPlanet
3 min readJan 26, 2022

--

Photo by Shelbey Fordyce on Unsplash

SpaCy is a free, open-source library for advanced Natural Language Processing in Python. It is designed specifically for production use and helps build applications that process and understand large volumes of text. It can be used for a multitude of use cases, such as information extraction, natural language understanding systems or to pre-process text for deep learning.

List of spaCy tasks

Here’s a list of NLP tasks that spaCy can perform.

List of tasks that spaCy can perform. Image from https://spacy.io/usage/spacy-101.

While some of spaCy’s features work independently, others require trained pipelines to be loaded. SpaCy currently offers trained pipelines for a variety of languages, which can be installed as individual Python modules. Here’s an example where we download the trained pipeline en_core_web_sm.

The trained pipeline you choose always depends on your use case and the texts you’re working with. For a general-purpose use case, the small, default pipelines (i.e. the ones ending in sm) are always a good start.

Tokenization

Tokenization consists in segmenting text into words, punctuations marks, etc. This is done by applying rules specific to each language.

POS Tagging

POS (Part of Speech) Tagging refers to categorizing words in a text in correspondence with a particular part of speech, depending on the definition of the word and its context.

The pos_ attribute contains the simple UPOS part-of-speech tag, whereas the tag_ attribute contains the detailed POS tag.

Dependency Parsing

Dependency Parsing consists in assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.

Stopwords

Stopwords are the most common words of a language, which are often ignored in NLP tasks as they usually carry little meaning to the sentences.

Lemmatization

Lemmatization assigns the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “dogs” is “dog”.

Named Entity Recognition (NER)

Named Entity Recognition refers to labeling named “real-world” objects in texts, like persons, companies, or locations.

Word embeddings

A word embedding is a learned representation (usually a vector of numbers) for text where words that have the same meaning have a similar representation.

To make them compact and fast, spaCy’s small pipeline packages (all packages that end in sm) don’t ship with word vectors and only include context-sensitive tensors. This means you can still use the similarity() methods to compare sentences and words, but the result won’t be as good, and individual tokens won’t have any vectors assigned. So in order to use real word vectors, you need to download a larger pipeline package.

This is how you get word embeddings with spaCy.

Sentence similarity

With spaCy, you can compute similarities between sentences. This is done by averaging the word embeddings of the words in each sentence and then computing similarity with a similarity measure.

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence