SpaCy Basics: The Importance of Tokens in Natural Language Processing

Christopher Lewis
Analytics Vidhya
Published in
7 min readMay 12, 2021

What is Tokenization?

Before processing a natural language, we want to identify the words that constitute a string of characters. That’s why tokenization is a foundational step in Natural Language Processing. This process is important because the meaning of the text can be interpreted through analysis of the words present in the text. Tokenization is the process of breaking apart original text into individual pieces (tokens) for further analysis. Tokens are pieces of the original text; they are not broken down into a base form. In this blog, we will be using the spaCy library to tokenize some created text documents to help understand the meaning of the text by examining the relationship between the tokens. SpaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

Installing spaCy

To use the spaCy library, we must first install it. We can install spaCy using either conda or pip. Make sure you have administrative rights so you can successfully install the specific spaCy language package. For more information, please visit https://spacy.io/usage/.

  1. Installing spaCy with conda or pip:

conda install -c conda-forge spacy
OR
pip install -U spacy

2. Installing the language package (make sure you have admin rights):

python -m spacy download en

Setting Up A Jupyter Notebook

Once we have successfully installed the spaCy library and the language package, we can open a Jupyter Notebook and access spaCy! The first thing we need to do is import the spaCy library:

# Import spaCy and load the language library
import spacy

Once we’ve downloaded and installed a trained pipeline, we can load it via spacy.load(). This will return a Language object containing all components and data needed to process text. The variable that references the Language object is typically called nlp.

nlp = spacy.load(‘en_core_web_sm’)

We can also view the components within the NLP pipeline by saying:

for item in nlp.pipeline:
print(item)
Objects inside of the NLP pipeline

Creating A Text Document

Using the NLP pipeline, we can pass in a Unicode string that contains whatever words we want to represent our new document:

doc = nlp(u“Here is our new fancy document. It’s not very complex, but it will get the job done.”)

Viewing Each Token Within the Document

Understand that tokens are the basic building blocks of a document object. Everything that helps us comprehend the meaning of the text is derived from each token object and its relationship with other tokens. To view each token in our document, we can say:

for token in doc:
print(token.text)
Printing each token within our document

We can also view the number of tokens within the document by simply saying len(doc). Besides viewing the actual text of a token, we can also view the part of speech and syntactic dependency each token has. Below, we will create a simple function that allows us to see each token’s text, part of speech, and syntactic dependency:

Defining doc_breakdown function

When we run this on our doc variable:

doc_breakdown(doc)
The output of doc_breakdown function

It returns an output containing the actual text, part of speech, and syntactic dependency tags for each token within the document. Note that even though our document object technically contains a list of tokens, the document object does not support item reassignment (we cannot say doc[0] = ‘New’). Also if we are not sure what a certain tag or label stands for, we can take advantage of the spacy.explain() method by passing in the target label we want to identify:

We can also simply pass in the string of the tag or label we want spaCy to explain:

spacy.explain(‘advmod’)

Named Entity Objects

Named Entity objects take tokens to the next level. If we check the contents of the nlp pipeline, we see that it contains an N.E.R. object. This object is a Named Entity Recognizer. This allows the nlp pipeline to recognize that certain words are organizational names, locations, monetary values, dates, etc. Named entities are accessible through the .ents property of a Document object. Let’s create a new document and try it out! Let’s use this string as our new document:

“Tesla Company will pay $750,000 and build a solar roof to settle dozens of air-quality violations at its Fremont Factory.”

# Creating a new document
doc2 = nlp(u”Tesla Company will pay $750,000 and build a solar roof to settle dozens of air-quality violations at its Fremont factory.”)
For loop that prints the entity, the entity’s label, and an explanation of the label

Above we can see the tokens within the document that spaCy recognizes as named entities and the tags spaCy gives to each entity.

Using displacy With the Experimental Jupyter Parameter

One final thing we will touch on before ending this blog is the displacy module within the spaCy library. Displacy is a built-in dependency visualizer that lets us check our model’s predictions. We can pass in one or more Document objects and start a web server, export HTML files, or even view the visualization directly from a Jupyter Notebook. Since we are using a Jupyter Notebook for this blog, we will be viewing our visualizations directly from our notebook. To use displacy, we will import the module from the spaCy library:

from spacy import displacy

The method in displacy we are going to focus on is the .render() method. This method requires a Document object to work, and can be displayed in 2 different visualization styles: ‘ent’ or ‘dep’. The ‘ent’ style highlights the named entities within the document based on the label type.

The Entity Style

Let’s use doc2 with displacy to visualize the named entities within the document:

displacy.render(doc2, style=‘ent’, jupyter=True)

In the above visualization, we see the named entities within the document, their label, and a color associated with the type of label. Notice that the MONEY and CARDINAL labels are the same color. Let’s say we want our cardinal and money entities to be visually different. We can pass in a dictionary of options into the render() method.

The first dictionary we create is called colors, which contains key-value pairs of entity labels and the desired color to associate with those labels. Since we want to create different colors for MONEY and CARDINAL entities, we include their names as the keys in the colors dictionary. We can give the MONEY entity a light green color, and give the CARDINAL entity a linear-gradient of yellow to orange (to display how complex you can get with these color combinations). After that, we create another dictionary called options. This dictionary will have a key called ‘colors’ and the value will be the colors dictionary. The options dictionary is then put into the options parameter of the render() method.

Note that we could provide custom colors for every entity, and also exclude visualizing certain entities by passing in a key of ‘ents’ and a list of the target entities we wanted to visualize as the value in the options dictionary:

The Dependency Style

The ‘dep’ style creates a dependency plot that visualizes part-of-speech tags and syntactic dependencies for the tokens within the document object. Let’s create a new document to visualize a dependency plot:

doc3 = nlp(u"SpaCy Basics: The Importance of Tokens in Natural Language Processing")

Now let’s run our new document through the displacy.render() method:

displacy.render(doc3, style=’dep’, jupyter=True)

Same as before, we can customize our image using an options dictionary to make it pop more! Let’s add some background color, change the font, and minimize the visual distance between the tokens:

displacy.render(doc3, style=‘dep’, jupyter=True, options=options)

This concludes the blog, I hope you enjoyed it and maybe learned something new! If you have any questions, please feel free to reach out!

--

--

Christopher Lewis
Analytics Vidhya

I am an aspiring Data Scientist and Data Analyst skilled in Python, SQL, Tableau, Computer Vision, Deep Learning, and Data Analytics.