Deep Dive into spaCy: Techniques and Tips

8 min readOct 1, 2023

spaCy is an open-source library for advanced natural language processing in Python. It is designed specifically for production use, which means it’s not only powerful but also fast and efficient. spaCy is widely used in academia and industry for various NLP tasks, such as tokenization, part-of-speech tagging, named entity recognition, and more.

Installing spaCy

You can install spaCy using pip, Python's package manager. Open your terminal or command prompt and run the following command:

pip install spacy

Once spaCy is installed, you’ll need to download a language model. spaCy provides pre-trained models for multiple languages. To download the English language model, run the following command:

python -m spacy download en_core_web_sm

Great! You now have spaCy installed and ready to use.

Basic Functionality

Let’s explore some of the basic functionalities of spaCy:

Tokenization

Tokenization is the process of splitting text into individual words or tokens. spaCy makes tokenization a breeze. Here’s a quick example:

import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")
# Process a text
text = "Hello, spaCy is amazing!"
doc = nlp(text)
# Print tokens
for token in doc:
    print(token.text)

You’ll see that spaCy has split the text into tokens like “Hello,” “spaCy,” “is,” and “amazing!”

Part-of-Speech Tagging

Part-of-speech (POS) tagging assigns grammatical categories to words in a text. spaCy can do this effortlessly:

# Part-of-speech tagging
for token in doc:
    print(f"{token.text}: {token.pos_}")

You’ll get output like “Hello: INTJ,” “spaCy: PROPN,” and “is: VERB.”

Named Entity Recognition (NER)

Named entity recognition is the process of identifying named entities in text, such as names of people, places, and organizations:

# Named entity recognition
text = "Apple Inc. was founded by Steve Jobs in Cupertino, California."
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)

This will extract entities like “Apple Inc.” (ORG) and “Steve Jobs” (PERSON).

Training Custom NER Models

While spaCy’s pre-trained models are powerful, they may not cover specific domain-specific entities. Training custom NER models allows you to teach spaCy to recognize entities relevant to your specific use case. Let’s go through the process with an example.

Gathering and Annotating Data

To train a custom NER model, you need labeled data where entities are annotated. Here’s an example of labeled data in spaCy’s training format:

TRAIN_DATA = [
    ("Apple is headquartered in Cupertino, California.", {"entities": [(0, 5, "ORG"), (27, 38, "LOC")]}),
    ("Elon Musk is the CEO of SpaceX.", {"entities": [(0, 9, "PERSON"), (29, 35, "ORG")]}),
    # Add more annotated examples here
]

In this example, we’ve annotated “Apple” as an organization (ORG) and “Cupertino, California” as a location (LOC) in the first sentence.

Training the Custom NER Model

Next, you’ll train the custom NER model using your annotated data:

import spacy
from spacy.training.example import Example

# Load spaCy's blank English model
nlp = spacy.blank("en")
# Create an NER component in the pipeline
ner = nlp.add_pipe("ner")
# Add entity labels to the NER component
ner.add_label("ORG")
ner.add_label("LOC")
# Training loop
for epoch in range(10):
    random.shuffle(TRAIN_DATA)
    for text, annotations in TRAIN_DATA:
        example = Example.from_dict(nlp.make_doc(text), annotations)
        nlp.update([example], drop=0.5)

In this code, we load a blank English model, add an NER component, and specify the entity labels we want to recognize. We then iterate through the labeled data, shuffle it, and update the model for a specified number of epochs.

Evaluating NER Performance

Once you’ve trained your custom NER model, it’s crucial to evaluate its performance. spaCy provides built-in evaluation capabilities:

# Evaluate the NER model
from spacy.training.example import Example

evaluate_data = [
    ("Apple Inc. is based in Cupertino, California.", {"entities": [(0, 10, "ORG"), (29, 40, "LOC")]}),
    ("Tesla, Inc. operates in Palo Alto.", {"entities": [(0, 11, "ORG"), (28, 37, "LOC")]}),
    # Add more evaluation examples here
]
scores = nlp.evaluate(evaluate_data)
print(scores)

This code evaluates the NER model’s performance on the provided evaluation data. You’ll receive metrics such as precision, recall, and F1-score to assess the model’s accuracy.

Text Classification in spaCy

Let’s dive right into text classification with spaCy. In this example, we’ll create a simple text classifier to distinguish between movie reviews as positive or negative.

Data Preparation

First, we need labeled data for training our text classifier. Here’s an example dataset:

TRAIN_DATA = [
    ("This movie is fantastic!", {"cats": {"positive": 1, "negative": 0}}),
    ("Worst film I've ever seen.", {"cats": {"positive": 0, "negative": 1}}),
    # Add more labeled examples here
]

Each example consists of a text review and a dictionary indicating whether it’s positive (1) or negative (0).

Training the Text Classifier

Let’s train our text classifier using spaCy:

import spacy
import random
from spacy.training.example import Example

# Load spaCy's blank English model
nlp = spacy.blank("en")
# Create a text classification component in the pipeline
textcat = nlp.add_pipe("textcat")
# Add class labels to the text classification component
textcat.add_label("positive")
textcat.add_label("negative")
# Training loop
random.shuffle(TRAIN_DATA)
for epoch in range(10):
    losses = {}
    for text, annotations in TRAIN_DATA:
        example = Example.from_dict(nlp.make_doc(text), annotations)
        nlp.update([example], drop=0.5, losses=losses)
    print(losses)

In this code, we load a blank English model, add a text classification component, and specify the class labels (“positive” and “negative”). We then iterate through the labeled data, shuffle it, and update the model for a specified number of epochs.

Making Predictions

Now that we have a trained text classifier, we can use it to make predictions on new text data:

# New text data
new_text = "This movie exceeded my expectations!"

# Process the new text
doc = nlp(new_text)
# Get the text classification scores
scores = doc.cats
# Determine the predicted class
predicted_class = max(scores, key=scores.get)
print(f"Predicted class: {predicted_class}")

In this example, we process the new text, obtain the class scores (“positive” and “negative”), and determine the predicted class based on the highest score.

Advanced NLP Techniques with spaCy

We’re diving deeper into advanced Natural Language Processing (NLP) techniques with spaCy. We’ll cover topics like dependency parsing, coreference resolution, and text summarization.

Dependency Parsing with spaCy

Dependency parsing is the process of analyzing the grammatical structure of a sentence to determine the relationships between words. spaCy makes dependency parsing straightforward:

import spacy

# Load spaCy's English language model
nlp = spacy.load("en_core_web_sm")
# Process a sentence
sentence = "The cat chased the mouse."
doc = nlp(sentence)
# Print the dependency tree
for token in doc:
    print(f"{token.text} --{token.dep_}--> {token.head.text}")

This code processes a sentence and prints the dependency relationships between words. For example, “cat” is the subject of the verb “chased,” so it has a dependency relationship of “nsubj” (noun subject) with “chased.”

Coreference Resolution

Coreference resolution is the task of determining when two or more expressions in a text refer to the same entity. spaCy provides support for coreference resolution:

import spacy

# Load spaCy's English language model with coreference resolution
nlp = spacy.load("en_coref_md")
# Process a text with coreference resolution
text = "John is a software engineer. He is very skilled at programming."
doc = nlp(text)
# Print resolved coreferences
for cluster in doc._.coref_clusters:
    print(cluster)

In this example, spaCy identifies that “He” refers to “John.”

Text Summarization

Text summarization is the process of generating a concise and coherent summary of a longer text. While spaCy doesn’t provide built-in text summarization, you can use other libraries and techniques in combination with spaCy to accomplish this task.

Building a Simple Chatbot

Let’s build a basic chatbot using spaCy to handle user queries. In this example, our chatbot will understand greetings and provide responses accordingly.

Setting Up the Chatbot

We’ll start by creating a Python script for our chatbot. Ensure you have spaCy and its language model installed. We’ll use spaCy for natural language understanding.

import spacy

# Load spaCy's English language model
nlp = spacy.load("en_core_web_sm")
# Define a dictionary of greetings and responses
greetings = {
    "hello": "Hello! How can I assist you today?",
    "hi": "Hi there! How can I help you?",
    "hey": "Hey! How can I assist you?",
}
# Function to generate chatbot responses
def chatbot_response(user_input):
    user_input = user_input.lower()
    if user_input in greetings:
        return greetings[user_input]
    else:
        return "I'm sorry, I don't understand that. How can I assist you?"
# Main loop
while True:
    user_input = input("You: ")
    if user_input.lower() == "exit":
        break
    response = chatbot_response(user_input)
    print("Chatbot:", response)

In this script, we’ve loaded spaCy’s English model and defined a dictionary of greetings and responses. The chatbot_response function processes user input, checks for greetings, and generates appropriate responses.

Comprehensive example

A comprehensive example that demonstrates various spaCy functionalities, including tokenization, part-of-speech tagging, dependency parsing, lemmatization, sentence boundary detection (SBD), named entity recognition (NER), entity linking (EL), text similarity, rule-based matching, the spaCy NLP pipeline, training a text classification model, and serialization.

import spacy
from spacy.matcher import Matcher
from spacy.training.example import Example

# Load spaCy's English language model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Apple Inc. is a technology company headquartered in Cupertino, California. " \
       "It was founded by Steve Jobs in 1976. " \
       "The company is known for its innovative products like the iPhone."

# Tokenization and Part-of-Speech Tagging
doc = nlp(text)
print("Tokenization and Part-of-Speech Tagging:")
for token in doc:
    print(f"{token.text} ({token.pos_})")

# Dependency Parsing and Lemmatization
print("\nDependency Parsing and Lemmatization:")
for token in doc:
    print(f"{token.text} ({token.dep_} -> {token.head.text}), Lemma: {token.lemma_}")

# Sentence Boundary Detection (SBD)
print("\nSentence Boundary Detection:")
for sent in doc.sents:
    print(sent)

# Named Entity Recognition (NER)
print("\nNamed Entity Recognition (NER):")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")

# Entity Linking (EL)
print("\nEntity Linking (EL):")
for token in doc:
    if token.ent_id_:
        print(f"{token.text} ({token.ent_id_} -> {token.ent_type_})")

# Text Similarity
text1 = "Apple is a technology company."
text2 = "Microsoft specializes in software."
doc1 = nlp(text1)
doc2 = nlp(text2)
similarity = doc1.similarity(doc2)
print(f"\nText Similarity: {similarity:.2f}")

# Rule-based Matching
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "technology"}, {"LOWER": "company"}]
matcher.add("TechCompany", [pattern])
matches = matcher(doc)
print("\nRule-based Matching:")
for match_id, start, end in matches:
    matched_text = doc[start:end].text
    print(f"Match found: {matched_text}")

# SpaCy NLP Pipeline
print("\nSpaCy NLP Pipeline:")
print(nlp.pipe_names)

# Training a Text Classification Model
TRAIN_DATA = [
    ("This product is amazing!", {"cats": {"positive": 1, "negative": 0}}),
    ("I'm not satisfied with this service.", {"cats": {"positive": 0, "negative": 1}})
]

textcat = nlp.add_pipe("textcat")
textcat.add_label("positive")
textcat.add_label("negative")

for epoch in range(10):
    for text, annotations in TRAIN_DATA:
        example = Example.from_dict(nlp.make_doc(text), annotations)
        nlp.update([example], drop=0.5)

new_text = "This product exceeded my expectations!"
doc = nlp(new_text)
scores = doc.cats
predicted_class = max(scores, key=scores.get)

print("\nText Classification:")
print(f"Predicted class: {predicted_class}")

# Serialization
nlp.to_disk("spacy_model")
loaded_nlp = spacy.load("spacy_model")
loaded_doc = loaded_nlp(new_text)
print("\nSerialization:")
print("Loaded Doc Tokens:")
for token in loaded_doc:
    print(f"{token.text} ({token.pos_})")

BERT + SPACY:

This simulates a real-life application where you might want to improve entity recognition by leveraging context-aware embeddings from BERT. You can adjust the similarity threshold to control the degree of linkage between entities and BERT embeddings, depending on your specific application and data.

import spacy
from transformers import BertTokenizer, BertModel
import torch

# Load spaCy's English language model
nlp = spacy.load("en_core_web_sm")

# Load the BERT model and tokenizer
bert_model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(bert_model_name)
bert_model = BertModel.from_pretrained(bert_model_name)

# Define a text to process
text = "Apple Inc. is a technology company headquartered in Cupertino, California. " \
       "It was founded by Steve Jobs in 1976. " \
       "The company is known for its innovative products like the iPhone."

# Tokenize the text with spaCy
doc = nlp(text)

# Tokenize the text with BERT tokenizer and convert to IDs
tokens = tokenizer.encode(text, add_special_tokens=True)

# Convert token IDs to tensor
input_ids = torch.tensor(tokens).unsqueeze(0)  # Batch size 1

# Pass the input through the BERT model to get embeddings
with torch.no_grad():
    outputs = bert_model(input_ids)

# Get the embeddings for the [CLS] token (representing the entire sentence)
bert_embedding = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Combine spaCy NER and BERT embeddings for entity recognition
for ent in doc.ents:
    # Calculate the cosine similarity between BERT embedding and entity's context embedding
    similarity = torch.cosine_similarity(torch.tensor(bert_embedding), torch.tensor(ent.vector))
    
    # Enhance entity recognition by considering BERT embeddings
    if similarity.item() > 0.75:  # Adjust similarity threshold as needed
        print(f"Entity: {ent.text}, Type: {ent.label_}, Similarity: {similarity.item()} (Linked)")
    else:
        print(f"Entity: {ent.text}, Type: {ent.label_}, Similarity: {similarity.item()} (Not Linked)")

Reference:
https://spacy.io/usage/spacy-101#features