Spacy Basics

Shreya khandelwal
3 min readJul 19, 2023

--

Spacy is an open-source library used for natural language processing in Python. It extracts information from text and can understand large volumes of data. Different language models are available in Spacy.

We will cover an introduction to :

  • Loading the language library
  • Building pipeline object
  • Tokenization
  • POS — Part of speech
  • Syntactic dependency

The very first step before using Spacy is to import spaCy and load the language library. Here we are using the English model.

import spacy
nlp = spacy.load('en_core_web_sm')

After that, we create a pipeline object that takes raw text as input and performs a series of operations on it like tokenization, pos, etc. Here we create a doc object and pass the string in Unicode format.

doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

Now the language library will parse the entire string into separate components or tokens. Iterating through tokens of the text.

token.text returns the text of the token
token.pos_ returns the part of speech of the tokens
token.dep_ returns the syntactic dependency of the token

# Print each token separately
for token in doc:
print(token.text, token.pos_, token.dep_)

OUTPUT:
Tesla PROPN nsubj
is VERB aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj

When we create a pipeline object, the text is broken down into various components and it performs several operations on the text like tokenization, pos, etc.

nlp.pipeline

OUTPUT:
[('tagger', <spacy.pipeline.Tagger at 0x237cb1e8f98>),
('parser', <spacy.pipeline.DependencyParser at 0x237cb2852b0>),
('ner', <spacy.pipeline.EntityRecognizer at 0x237cb285360>)]

For getting basic names of the pipeline operation

nlp.pipe_names

OUTPUT:
['tagger', 'parser', 'ner']

Tokenization

The first step in processing text is to split up all the component parts (words & punctuation) into “tokens”. We’ll go into much more detail on tokenization in an upcoming article.

doc_2 = nlp(u"Tesla isn't     looking into startuos anymore.")

for token in doc_2:
print(token.text,token.pos_,token.dep_)

OUTPUT:
Tesla PROPN nsubj
is AUX aux
n't PART neg
SPACE dep
looking VERB ROOT
into ADP prep
startuos ADJ pobj
anymore ADV advmod
. PUNCT punct

We can also get tokens using index

doc_2[0]

OUtPUT:
Tesla

Part of Speech Tagging — POS

After creating the token we assign part of the speech to them. It categorizes the words in a particular part of speech like PROPN, AUX, SPACE etc depending on their meaning and reference.

doc[0].pos_

OUTPUT:
'PROPN'

Dependencies

This is used to find the semantic relation between words in a sentence. Ex. aux, nsub, prep, punct etc.

doc2[0].dep_

OUTPUT:
'nsubj'

Sentences

We can perform several operations on sentences like checking the start or end of sentences, and separating sentences from documents.

doc_3 = nlp(u"This is the first sentence. This is the seocnd sentence. This is the third sentence")

for sentence in doc_3.sents:
print(sentence)

OUTPUT:
This is the first sentence.
This is the seocnd sentence.
This is the third sentence

Check the start of the message

This function checks whether the particular index is the start of the sentence or not. We have more such functions

doc_3 = nlp(u"This is the first sentence. This is the seocnd sentence. This is the third sentence")

doc_3[6].is_sent_start
doc_3[4].is_sent_start

OUTPUT:
True
F

Spans

We can slice the documents and get a portion of text from it using indexes. Doc[start:stop]

doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

life_quote = doc3[16:30]
print(life_quote)

OUTPUT:
'Life is what happens to us while we are making other plans'

Find my source code for this article

About Me

I’m Shreya Khandelwal, Data Scientist at IBM. Feel free to connect with me on LinkedIn!

Follow me on Medium for regular updates on similar topics.

--

--