SpaCy Vs NLTK — Basic NLP Operations code and result comparison | Dev Skrol

Asha Ponraj
Analytics Vidhya
Published in
5 min readApr 17, 2021

In this article we are going to explore the code of basic NLP operations using NLTK and spaCy.

NLTK

NLTK is an open-source library and it is very suitable for teaching, and working in, computational linguistics using Python.

Also it is having industrial strength libraries.

spaCy

spaCy is an open-source library for advanced NLP in python.

It is specially designed for production use, which can handle large volume of text where as NLTK and CoreNLP were created specially for teaching and research purpose.

spaCy provides advanced NLP techniques which is widely used in complex applications such as text summarization, text to speech, domain specific NER, Q&A, Emotion Detection etc.

I am planning to explore one-by-one and share it with you in a series of posts.

First Releases

Image by Author — First Releases comparison

It is widely mentioned in many blog posts and articles that spaCy is faster, has almost all the features that is provided by other libraries (NLTK, CodeNLP etc). But more or less similar accuracy.

In this article we are going to analyze and compare code for the very basic operations of NLP in spaCy and NLTK.

And we are not going to compare the speed and accuracy of these libraries. However, knowing the code and results from these two libraries may help in future researches.

#SPACY
import spacy

#NLTK
import nltk

Word Tokenization

In spaCy:

text = "He is a 43 year old gentleman who is referred for consultation by Dr. Tamil Buhari.  About a week ago he slipped on the driveway at home and sustained an injury to his left ankle.  He was seen at My-City Hospital and was told he had a fracture.  He was placed in an air splint and advised to be partial weight bearing, and he is using a cane.  He is here for routine follow-up."nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)

Output:

['Most', 'of', 'the', 'outlay', 'will', 'be', 'at', 'home', '.', 'No', 'surprise', 'there', ',', 'either', '.', 'While', 'Samsung', 'has', 'expanded', 'overseas', ',', 'South', 'Korea', 'is', 'still', 'host', 'to', 'most', 'of', 'its', 'factories', 'and', 'research', 'engineers', '.']

In NLTK:

from nltk.tokenize import word_tokenize
print(word_tokenize(text))

Output:

['Most', 'of', 'the', 'outlay', 'will', 'be', 'at', 'home', '.', 'No', 'surprise', 'there', ',', 'either', '.', 'While', 'Samsung', 'has', 'expanded', 'overseas', ',', 'South', 'Korea', 'is', 'still', 'host', 'to', 'most', 'of', 'its', 'factories', 'and', 'research', 'engineers', '.']

Sentence Tokenization

In spaCy:

doc = nlp(text)
for sent in doc.sents:
print(sent)

Output:

He is a 43 year old gentleman who is referred for consultation by Dr. Tamil Buhari.
About a week ago he slipped on the driveway at home and sustained an injury to his left ankle.
He was seen at My-City Hospital and was told he had a fracture.
He was placed in an air splint and advised to be partial weight bearing, and he is using a cane.
He is here for routine follow-up.

In NLTK:

from nltk.tokenize import sent_tokenize

Output:

['He is a 43 year old gentleman who is referred for consultation by Dr. Tamil Buhari.', 'About a week ago he slipped on the driveway at home and sustained an injury to his left ankle.', 'He was seen at My-City Hospital and was told he had a fracture.', 'He was placed in an air splint and advised to be partial weight bearing, and he is using a cane.', 'He is here for routine follow-up.']

Stopword Removal

In spaCy:

text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """

doc = nlp(text)

text_without_sw = [token.text for token in doc if token.is_stop == False]
print(text_without_sw)

Output:

['outlay', 'home', '.', 'surprise', ',', '.', 'Samsung', 'expanded', 'overseas', ',', 'South', 'Korea', 'host', 'factories', 'research', 'engineers', '.']

In NLTK:

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

words = text.split()
text_without_sw = [token for token in words if not token in stop_words]
print(text_without_sw)

Output:

['Most', 'outlay', 'home.', 'No', 'surprise', 'there,', 'either.', 'While', 'Samsung', 'expanded', 'overseas,', 'South', 'Korea', 'still', 'host', 'factories', 'research', 'engineers.']

POS Tagging

In spaCy:

text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """

doc = nlp(text)

tokens_with_POS = [token.text + " - " + token.pos_ for token in doc]
print(tokens_with_POS)

Output:


['Most - ADJ', 'of - ADP', 'the - DET', 'outlay - NOUN', 'will - AUX', 'be - VERB', 'at - ADP', 'home - NOUN', '. - PUNCT', 'No - DET', 'surprise - NOUN', 'there - ADV', ', - PUNCT', 'either - ADV', '. - PUNCT', 'While - SCONJ', 'Samsung - PROPN', 'has - AUX', 'expanded - VERB', 'overseas - ADV', ', - PUNCT', 'South - PROPN', 'Korea - PROPN', 'is - AUX', 'still - ADV', 'host - NOUN', 'to - ADP', 'most - ADJ', 'of - ADP', 'its - PRON', 'factories - NOUN', 'and - CCONJ', 'research - NOUN', 'engineers - NOUN', '. - PUNCT']

In NLTK:

from nltk.tag import pos_tag

sent = nltk.word_tokenize(text)
sent = nltk.pos_tag(sent)
print(sent)

Output:

[('Most', 'JJS'), ('of', 'IN'), ('the', 'DT'), ('outlay', 'NN'), ('will', 'MD'), ('be', 'VB'), ('at', 'IN'), ('home', 'NN'), ('.', '.'), ('No', 'DT'), ('surprise', 'NN'), ('there', 'RB'), (',', ','), ('either', 'DT'), ('.', '.'), ('While', 'IN'), ('Samsung', 'NNP'), ('has', 'VBZ'), ('expanded', 'VBN'), ('overseas', 'RB'), (',', ','), ('South', 'NNP'), ('Korea', 'NNP'), ('is', 'VBZ'), ('still', 'RB'), ('host', 'VBN'), ('to', 'TO'), ('most', 'JJS'), ('of', 'IN'), ('its', 'PRP$'), ('factories', 'NNS'), ('and', 'CC'), ('research', 'NN'), ('engineers', 'NNS'), ('.', '.')]

Named Entity Recognization

NER using spaCy:

text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """

doc = nlp(text)

for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)

Output:

Samsung 69 76 ORG
South Korea 100 111 GPE

NER using NLTK:

Download NLTK words package:

nltk.download('words')

Output:

[nltk_data] Downloading package words to
[nltk_data] C:\Users\aasha\AppData\Roaming\nltk_data...
[nltk_data] Package words is already up-to-date!

Using ne_chunk, pos_tag:

pos_tag function takes in a tokenized sentence and returns a pos tagged — Noun, Verb etc words.

ne_chunk takes in a pos tagged sentence and returns a tree object which is labeled with Person, location — GPE, organization.

import nltk
from nltk import ne_chunk, pos_tag, word_tokenize
sent = "What is the weather in Chicago today?"
print(nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))))

Output:

(S
What/WP
is/VBZ
the/DT
weather/NN
in/IN
(GPE Chicago/NNP)
today/NN
?/.)

You can notice that the Named Entities are tagged: chicago location is tagged with GPR.

Complete code to extract Named entity using ne_chunk and pos_tag:

for sent in nltk.sent_tokenize(text):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
print(chunk.label(), ' '.join(c[0] for c in chunk))

Output:

PERSON Samsung
GPE South Korea

Conclusion:

Now we have seen how the basic NLP operations can be done in both NLTK and SpaCy. It will be useful and easy to compare the source codes to understand the basic features of two libraries.

We will see more about NLP techniques and its applications in this series.

Thank you for reading our article and hope you enjoyed it. 😊 Try all these techniques and play with words.

Like to support? Just click the like button ❤️.

Happy Learning! 👩‍💻

Originally published at https://devskrol.com on April 17, 2021.

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Asha Ponraj
Asha Ponraj

Written by Asha Ponraj

Data Science & Machine Learning Enthusiast | Software Developer | Blogger | https://devskrol.com/ | www.linkedin.com/in/asha-ponraj-a5a76b50

No responses yet