SpaCy for Natural Language Processing

by Sayali Moghe and Sanjan Das

Sayali Moghe
6 min readMar 15, 2022

GETTING STARTED

SpaCy is an easy to use, open source free library designed for Natural Language Processing problems. SpaCy reads and understands large volumes of data i.e unstructured text, and analyzes it for further functionality. This involves information extraction, preprocessing for complex tasks of Deep Learning, natural language understanding etc.

SpaCy was released in 2015 to provide natural language processing (NLP) software for production use. Until then, NLTK (Natural Language Processing Toolkit) was popularly used for NLP tasks, however it was primarily built for teaching and research. Other toolkits focus less on the speed of processing and more on the accuracy which is why SpaCy has been adopted for large-scale production systems.

End-to-end processing speed on raw unannotated text from Reddit, measured in Words Per Second (WPS). Notice the speed SpaCy provides in comparison to other libraries!

Using SpaCy

SpaCy is installed using the pip, a Python package manager

pip install spacy
import spacy

Some features of SpaCy are described below.

Tokenization

Tokenization is the fundamental step in any NLP task. In tokenization, an unstructured piece of text is divided into several elements like words, punctuations and white spaces. These constitute what are called “tokens”. Tokenization takes an unstructured textual document and gives it a structural format.

# Tokenization with spaCynlp = spacy.load("en_core_web_sm")
doc = nlp("Claire had her mind set on buying the Chanel bag from London")
for token in doc:
print(token.text)

Part-of-Speech (POS) Tagging

Text cannot be understood unless some sense is made out of it and this can be done using SpaCy. Every token is assigned a part of speech which enables in making predictions.

# Part of Speech tagging with spaCynlp = spacy.load("en_core_web_sm")
doc = nlp("Claire had her mind set on buying the Chanel bag from London")
for token in doc:
print(token.text, token.pos_)

POS have a large range and some of which are listed in the figure below.

Examples of Parts-of-Speech used by SpaCy

Named Entity Recognition (NER)

Named entities are real world classifiable objects. A person, a location, an organization, a geopolitical entity — they all correspond to different named entities which can be recognized from a text. For the implementation of NER, spaCy provides a property called “ents” which enables recognition of these entities in a document.

# NER with spaCynlp = spacy.load("en_core_web_sm")
doc = nlp("Claire had her mind set on buying the Chanel bag from London")
for entity in doc.ents:
print(entity.text, entity.label_)

SpaCy supports a word embedding strategy which uses sub-features and a deep convolutional neural network with residual connections. This provides the system with an ability to work fast and accurately.

Word Vectors and Similarity

Word embeddings used by spaCy allow comparing different words to understand the similarity between them. SpaCy pipeline has built-in word vectors. Common words like “dog”, “orange”, etc. are easily found in the pipeline’s vocabulary and thus have existent word vectors for them. But suppose a word like “strenuous” is found in the document, it might not be existing in the document. For this, spaCy defines a vector representation consisting of all zeros. It just means it is non-existent. For these tasks, spaCy makes use of libraries like “en_core_web_sm” or “en_core_web_lg” which already contain word vectors.

nlp = spacy.load("en_core_web_sm")
doc = nlp("Claire had her mind set on buying the Chanel bag from London")
for token in doc:
print(token.text, token.has_vector, token.vector_norm)

HOW CAN SpaCy HELP

In a movie streaming scenario, there could be plenty of textual information such as movie abstracts, reviews and transcriptions. There could also be many text labels associated such as actor names, production company, director names, etc.

Consider the example of movie abstracts. We could build a pipeline in spaCy that generates an embedding representation for each movie. Simple vector similarity techniques such as cosine similarity could then be used to identify movies that are similar to each other on the basis of the abstract.

The movie streaming data could also contain features like overview of the movie, or description of song lyrics. These textual features help in determining the attributes of the movie. In a movie recommendation system that uses a content-based model, all these features are very important. They increase the similarity between movies by linking one attribute to another. For understanding such texts, natural language processing is done on them. This includes Named Entity Recognition, and Part-of-Speech Tagging.

ADVANTAGES AND LIMITATIONS OF SpaCy

ADVANTAGES

  • Easy to use: SpaCy comes with pre-built models for common NLP tasks like classification, named entity recognition and part-of-speech tagging. This proves very useful to developers who want to focus just on the product development with good enough machine learning involvement, rather than getting into the nitty gritties of NLP.
  • Very fast: SpaCy is written in Cython under the hood, which makes it faster than pure Python implementations. This makes it suitable for large scale production systems.
  • Auto updating of models: If a newer and more accurate model comes along for SpaCy’s pipeline methods, SpaCy can update itself to use the improved model without any user involvement. This way, developers do not need to worry about model versioning and ensuring they have the best model at all times.

LIMITATIONS

  • Not customizable: As SpaCy’s focus was on ease-of-use and speed, it leaves only very little room for customization of the NLP techniques and models. For example, you can’t build a classifier that takes text, numerical and image data at the same time to produce a classification.
  • Internals are opaque: Reviews from users generally say that the internal implementations are opaque and hard to work with. This tool would not be suitable for someone who wanted to modify the code base for a custom task.

SpaCy IN A MOVIE RECOMMENDATION SYSTEM

Consider a movie recommendation system like Netflix. We would need to log all that a user watches as well as the reviews they give on movies to gain a better understanding of the user itself. If the user looks for something new to watch, we would like to suggest movies to the user that keeps them happy with the movie recommendation system. Each movie has a lot of metadata with it including IMDB rating, title, genre, abstract, transcripts, etc.

Named Entity Recognition, like the name suggests, is used to determine the objects in the text which can be classified as a person, a location, or an organization.

NLP = spacy.load("en_core_web_sm")for i in range(len(data['overview'])):
raw_text = NLP(data['overview'][i])
ner = []
for word in raw_text.ents:
ner.append(word.text)

For example, a movie could have an overview as below -

The corresponding named entities for that movie would be

Not all of these entities are useful but some of them can definitely be used to describe the movie.

Similar work can be done with POS Tagging as well. For the same overview, every word will be assigned a tag and returned. In this as well, all parts of speeches are not important. Determiners like “the” or conjunctions like “and” are very common while describing a film but they don’t provide any details about the movie. For this task, tokens that correspond to verbs, adverbs, and adjectives are kept while others are discarded.

NLP = spacy.load("en_core_web_sm")for i in range(len(data['overview'])):
raw_text = NLP(data['overview'][i])
pos = []
for word in raw_text:
if word.pos_ == "VERB" or word.pos_ == "ADJ" or word.pos_ == "ADV":
pos.append(word.text)

The POS tags for the overview above is shown here.

CONCLUSION

SpaCy provides a great way for regular developers to include simple NLP models in their applications without much expense for research and development of these models. With SpaCy, a team can focus more on the actual objective of the project itself.

REFERENCES

--

--