Natural Language Processing using spaCy in Python(part-1)

Aravind CR
Analytics Vidhya
Published in
9 min readMar 31, 2020

This blog post gives you a brief idea about python library spaCy used for NLP in python.

This post includes various preprocessing and feature extraction techniques using open source NLP library spaCy written in python and cython used for advanced natural language processing.

Before feeding your data to machine learning algorithms we need to process the data so that it is understandable by machine learning algorithms.

Before diving directly into NLP , its important to know…..

What is Text Analysis?

Text analysis is the technique of gathering useful information from the text. The purpose of text analysis is to create structured data out of free text content.Text analysis is close to other terms like Text Mining, Text Analytics and Information Extraction(IE).

Garbage in, Garbage out(GIGO)

GIGO is one of the important aspect when dealing with machine learning and even more when dealing with textual data. Garbage in, Garbage out means that, if we have poorly formatted data it is likely we will have poor results.

More data usually leads to better prediction, but it is not always true with text analysis, where more data can result in nonsense results. Examples might be stop words, these words are often removed from text before applying text analysis on them. In a similar way we remove words with high frequency in the body of text and words which may appear only once or twice-its highly likely that these words will not be useful for text analysis.

NLP techniques can also help us construct tools that can assist personal businesses or enterprizes, for example: chatbots are becoming increasingly common in major websites. This is largely due to subfield of machine learning, called Deep learning, where we use algorithms and structures that are inspired by the structure of human brain. These algorithms and structures are referred to as neural networks.

SpaCy is also one of the fastest NLP framework in python.

Install spaCy using the command:

pip3 install spacy

SpaCy’s Language models

One of the spaCy’s most interesting features is its language models. A language model is a statistical model that lets us perform NLP tasks such as POS-tagging and NER-tagging.

Download these models using:

spacy download en # English model

spacy download de # German model

spacy download xx # multi-language model

Download specific model for your spaCy installation:

python3 -m spacy download en_core_web_sm

Run the following command in your Python shell:

import spacy
nlp = spacy.load(‘en_core_web_sm’)
doc = nlp('This is nlp with spaCy')

We will be using spaCy’s language model to help us with preprocessing. Before getting into preprocessing steps , lets understand what happens when you run this:

doc = nlp('This is nlp with spaCy')

When you call NLP on text , spaCy first tokenizes the text to produce a Doc object. Doc is then is processed in several different steps, which we also refer to as pipeline.

Fig. Default pipeline
Fig. Default pipeline

Tokenizing is task of splitting sentence into meaningful segments called tokens. These segments could be words, punctuations, numbers or other special characters that are building blocks of sentence.

SpaCy’s default pipeline also preforms rule based matching. This annotates the text with more information and adds value during preprocessing.

Preprocessing:

With spaCy, stop words are easy to identify, each token has IS_STOP attribute, which lets us know if word is stopword or not.

We can add our on stopwords to list of stop words.

stop_words = ['say', 'said', 'it', 'an', 'none', 'all', 'saying']
for stopwords in stop_words:
lexeme = nlp.vocab[stopwords]
lexeme.is_stop = True

Stop words can also be added using :

from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS) # SpaCy's default stop words
STOP_WORDS.add("your_additional_stopwords_here")

If you have noticed say, said, saying provide the same information - gramatical difference aside, it wont affect the results.

Stemming and lemmatization are popular techniques. Stemming involves removing suffix in general. But often it may result in meaningless words. For example removing suffix from laziness will result in lazi, it dosen’t rely on parts of speech. Where as lemmatization on the other hand helps convert word into root word.

In spacy, the lemmatized form of word is accessed with the .lemma_ attribute.

Lets check the sentence “Real-time training during global emergencies is critical for effective preparedness and response.”:

doc = nlp(‘Real-time training during global emergencies is critical for effective preparedness and response. ‘)sentence = []for w in doc:  # if it’s not a stop word or punctuation mark, add it to our     article!  if w.text != ’n’ and not w.is_stop and not w.is_punct and not   w.like_num:    # we add the lemmatized version of the word    sentence.append(w.lemma_)print(sentence)

By using the .is_stop , is_punct , and w.like_num attributes, we could remove the parts of the sentence we did not need.

Output would be:

['real', 'time', 'training', 'global', 'emergency', 'critical', 'effective', 'preparedness', 'response']

We can further remove or not remove words based in use-case.

Vectorizing text and transformations and n-grams

We can think of vectors as a way of projecting words onto mathematical space while preserving the information provided by these words. In machine learning this feature is called feature vector as each value corresponds to some feature, which are used to make predictions.

Some concepts for these vector representation are…

Bag-of-words(BOW)

It is one of the straight forward form of representing a sentence as a vector. For example:

P1: “The dog sat near the door.”
P2: ”The bird likes grains.”

Follow the same pre-processing steps mentioned above, the above sentences become:

P1: “dog sat near door.”P2: “birds like grains.”

Vocabulary vector would be unique words from the sentences.

vocab = [‘dog’, ‘sat’, ‘near’, ‘door’, ‘birds', 'like', 'grains']

We can think of mapping each word in vocabulary to a number.

BOW model use word frequencies to construct vectors. Now our sentence looks like..

P1: [1, 1, 1, 1, 0, 0, 0]
P2: [0, 0, 0, 0, 1, 1, 1]

There is 1 occurrence of word dog and 0 occurrence of word birds, like and grains in first sentence and so on.

TF-IDF

Term frequency and inverse document frequency is largely used in search engines to find relevant document based on the query. Imagine you have a search engine and someone looks for Ronoldo. The results will be displayed in the order of relevance. The most relevant sports articles will be ranked higher because TF-IDF gives the word Ronoldo a higher score.

TF(t) = (number of times term t appears in a document) / (total    number of terms in the document)IDF(t) = log_e (total number of documents / number of documents with term t in it)

TF_IDF is simply the product of these 2 factors TF and IDF.

TF_IDF makes the rare words more prominent and ignore common words like is, or, an, that which may appear lot of time but may be less prominent.

N-grams

N-grams is the adjacent sequence of n-items in the text.

Bi-gram is one of the most popular n-gram. Removing stop words is necessary before running bigram model on your corpus or else there could be meaningless bi-gram formed.

For example: Machine Learning, Artificial Intelligence, New Delhi, Data Analytics, Big Data could be possible pairs of words created by bi-grams.

If you want to know more about n-grams refer the link(ngrams).

POS-Tagging

Parts-of-speech tagging is the process of tagging words in textual input with their appropriate parts of speech.

POS-tagging with spacy

This is one of the core feature loaded into the pipeline.

import spacy
nlp = spacy.load('en_core_web_sm')
sent1 = nlp('Washing your hands with soap and water or using alcohol-based hand rub kills viruses that may be on your hands.')sent2 = nlp('Antibiotics do not help, as they do not work against viruses.')sent3 = nlp(u'Marie took out her rather suspicious and fishy cat to gofish for fish.')for token in sent2:
print(token.text, token.pos_, token.tag_)

output:

Antibiotics NOUN NNS
do AUX VBP
not PART RB
help VERB VB
, PUNCT ,
as SCONJ IN
they PRON PRP
do AUX VBP
not PART RB
work VERB VB
against ADP IN
viruses NOUN NNS
. PUNCT .

As you can see, the words are tagged with appropriate parts of speech.

One important note, some words can be both noun or verb depending on context. Please do try out with the sentences where a word acts as noun in one and verb with other sentence and check if they are tagged with appropriate parts of speech. For example. ‘Marie took out her rather suspicious and fishy cat to go fish for fish.’

If you want to train your own pos-tagger, do check the link below….

Training spaCy’s Statistical Models.

NER-tagging

NER is Named Entity Recognition. A named entity is a real-world object with a proper name-for example India , Sunil Chetri, Google, Here India is a country and is identified as GPE(Geopolitical Entity), Sunil Chetri is PER(person), Google is an ORG(Organization).

This process involves finding named entities(people, place, organizations etc) from chunk of text and classify them into predefined set of categories.

SpaCy itself offers certain predefined set of entities. NER-tagging is not the end result , it end up being helpful for further tasks.

New Delhi is capital of India . NER-tagger would recognise New Delhi as palce (GPE), as well as India. Named entities can differ based on the context.

import spacy
nlp = spacy.load('en_core_web_sm')
sentence = 'Uber eats India was acquired by Zomato for $350 million'
doc = nlp(sentence)for token in doc:
print(token.text, token.ent_type_)

Output:

Uber
eats
India GPE
was
acquired
by
Zomato ORG
for
$ MONEY
350 MONEY
million MONEY

For those words that were not identified as named entities , an empty string is returned. For example in the above output eats, acquired dosent refer to particular entity.

We can use displaCy to visualise our text,

displacy.render(doc, style = "ent",jupyter = True)
Fig. Visualising parts-of-speech tags

Dependency parsing

Parsing can be understood as a way to analyze a sentence or break up the sentence to understand the structure of the sentence. Parsing in NLP is determining the syntactic structure of text by analyzing its constituent words based on an underlying grammar.

As the title suggests dependency parsing refers to understanding the structure of a sentence via dependencies between words in a sentence. When a sentence is dependency parsed it would give us information about relationships between words in a sentence.

Parsers break up a sentence into a subject and an object which is noun phrase and a verb phrase. Dependency parser considers the verb as a head of sentence and all dependencies are built around it.

Example: The dog is faster than the cat.

Fig. Visualizing dependency parsing

The dog is noun phrase which is marked as nsubj, which refers to subject of sentence. Acomp means adjectival complement, which means that it is a phrase that modifies an adjective or adds to the meaning to an adjective. The word than is preposition , pobj here stands for object of prepositon, which is the cat.

from spacy import displacy
nlp = spacy.load(‘en_core_web_sm’)
doc = nlp(u”displaCy uses JavaScript, SVG and CSS.”)
spacy.displacy.serve(doc, style=’dep’)

The above code will run a web server. You can see the visualization by running http://127.0.0.1:5000 in your browser.

Conclusion:

This blog gives brief idea about basic preprocessing techniques and feature extraction techniques such as BOW, TF_IDF, n-grams, POS-tagging, NER-tagging and dependency parsing in NLP pipeline . Next parts of the blog will contain topic modelling , text summarization ,Clustering and various word embedding techniques.

References:

-> Spacy documentation:

-> Some part of the blog is written with reference of a book called Natural language processing and computational linguistics by Barghav Srinivasa Desikan.

— — — — — — — — — — — Thank you — — — — — — — — — — — — — —

--

--