Introduction NLP(Natural Language Processing) With spaCy Using Python

Published in

Python Concepts

17 min readOct 16, 2020

Introduction:

I am going to explain here some basic and NLP topics. The advanced concepts like tokenization stemming limitation etc. will be published in our advanced NLP blogs. Here, I will be using Spacy library for explaining the NLP basic. Spacy is written in Python and used for advanced NLP(Natural language processing) including vocabulary matching as well.

I am going to have a few introductory statements that describe common libraries such as NLTK versus a general discussion of what natural language processing is in general.

Spacy for Natural Language Processing:

Setting up Spacy

Before, we set up Spacy and download the language library, let’s quickly discuss what Spacy actually is. Spacy is nothing but open-source libraries used for the advanced natural language processing library for Python web development company. It’s designed to effectively handle an NLP task with the most efficient implementation of common tasks and algorithms. That means for many natural language processing tasks Spacy only has one implemented a method choosing.
The most efficient algorithm was currently available. This means you often don’t actually have the option to choose between algorithms for a particular task. Spacy only has one option which is the most efficient currently available now.
NLTK (Natural Language Tool Kit) is another library and you may have already heard and it’s a very popular open-source library. It was initially released in 2001 so that means it’s much older than spacy which was released in 2015. It also provides many functionalities that include less efficient implementations and that’s probably one of the main differences between Spacy and NLTK.

NLTK VS SPACY:

Spacy is a lot quicker and more effective at the expense of the client not having the option to pick a particular algorithmic execution. For most use cases, that really doesn’t matter because you care about the result not using a particular form of the algorithm. So, for that reason space is much faster than an NLTK because NLTK has a variety of implementations for a lot of common tasks. But Spacy just defaults to the most efficient currently available and it’s also much newer.
Remember Spacy does exclude some pre-made models for certain applications that NLTK does, for example, “Sentiment Analysis”-> which is normally simpler to perform with NLTK. So, in this blog, due to Spacy’s state of the art approach and efficiency, we’re going to focus on using space where it really matters but will also introduce and use NLTK when it’s easier to use for certain tasks.
Spacy can tokenize tag and parse documents and if you compare it speed to NLTK, then you will find NLTK 20 times longer to actually tokenize in a document reading and it takes 443 times longer to tag and it actually doesn’t even perform parsing. So, we can already tell here that space is much more efficient than NLTK up to 400 times more efficient approx.
If you want to get more facts and figures of performance metrics of Spacy vs. other libraries you can check out this link (https://spacy.io/usage/facts-figures) and it will compare the capabilities a Spacy versus other libraries. Things like neural network model Spacy built ready to perform state of the art neural network analysis and NLTK really isn’t suited for that.

Spacy Installation Process:

So, if you’re on Mac OS or Linux Open your terminal and if you’re on Windows open up your either Anaconda prompt or your command prompt. So, depending on how you actually downloaded and installed Python in anaconda you should be opening up either the Anaconda prompt or the command prompt.
And if you’re on a Windows computer you want to make sure you run this as an administrator. You also want to make sure you don’t have a firewall because you have to download the language library from the Internet.
So, from a command prompt and willing to run this as an administrator and now in your command prompt. You probably just want to be located in your C folder.
Please make sure you actually have installed Spacy. Now there are several ways to do this. If you’re using anaconda, then in your command prompt, you can type the command as below:

c:\conda install -c conda -forge spacy

This command will install spacy for you from the Anaconda packaging system.

Once the installation is done then from the command line you need to download the language library that spacy needs and that language library are a big reason for why space it can operate so efficiently.
So, to do this you’re going to need to double-check that you have full administrative capabilities to download stuff onto a computer. You may also need to check that make sure you don’t have a firewall blocking your ability to download stuff from the Internet.
So, the next thing from the command line is you’re going to type the below command:

c:\python -m spacy download

Here “en” stands for English

Now what this is going to do is it’s going to automatically download from the command line the English language library. On success, you should see something that says something like “linking successful” message.
It will say that you can now load the model using spacy.load(‘en’). And so, this will actually load up our English model and there are various versions of the English model. You can check out the Spacy documentation for more details.
Two things we need to make sure we had Spacy installed and also make sure that we set up the spacy language library.

Anatomy of Natural language Processing?

So what is an NLP or natural language processing for a formal definition? According to Wikipedia, NLP or “Natural Language Processing” is a region of Artificial Intelligence and Computer Science problem statements regarding the effective interactions among PCs and human for example, specifically how to program a computer device to be smart enough to measure and examine and process a lot of natural language data.
Now to put this in simpler terms, often when performing some sort of analysis usually have lots of data that are numerical and that’s really convenient for example things like Share numbers, physical estimations, quantifiable classifications. Computer devices are now truly adept at taking care of direct mathematical data and numerical analysis. However, in real-time we consider the text data as the natural language data only while working with computers around the natural language processing.
As humans, it’s really easy for us to tell that there is a plethora of information inside of a text document like a PDF file, an e-mail, or a book. There’s heaps of data that we can without much of stretch access since we’re human beings and also, we learned and know to read that natural language and understand that easily, but for a computer device, it needs expertized processing strategies and programs to understand the raw text data of the natural language.
Natural language data is the text data that is exceptionally unstructured and can likewise be in different dialects not simply English. However, I am here dealing with only the English language here. NLP endeavors and deals with utilizing an assortment of various processing strategies to make a type of structure out of crude content information. I will be writing here some of these basic techniques and these techniques are built into libraries such as Spacy and NLTK.
Some examples of use cases of natural language processing are taking a raw text email and classifying it as spam versus a legitimate email or taking a raw text movie review and actually grabbing the sentiment analysis being able to tell if the review is positive or negative.
We can also do things like analyse trends from Britain customer feedback forums or even understand text commands such as when you’re talking to your smartphone and you say something like “Hey Google, play a specific song” or “Hey Siri or hey Alexa” which in turn requires natural language processing to change over that raw text data to something the computer can comprehend.
So natural language processing is constantly evolving. And great strides are made even every month. Here, in this blog, I am going to focus on fundamental ideas that all state of the art techniques are based on.

Usage of the Spacy library

There are a few key steps for working with Spacy. I am going to mention here.

The first step is loading the language library. Remember you need to have installed the language library as I have already mentioned in the Spacy setup process. Then once you’ve loaded in that language library, we build the pipeline object and from that pipeline object, we can use tokens, perform parts speech tagging and understand different token attributes.
So, as I mentioned Spacy works with a pipeline object and the main idea here is that there is an NLP function that has to be created from Spacy. That is going to automatically take in raw text and perform a series of operations that are going to tag, parse, and describe the text data. Those operations are known as tokenization, parsing, speech recognition, and so on.

So, my purpose here to make you understand the concepts i.e. to discover the pipeline object and various series of operations. There are many concepts in this aspect is tokenization, parts of speech or POS stemming limitation, and a lot more detail. So, keep in mind I am going to introduce those terms right now in this blog.
Jupyter’s notebook I am going used over here. I highly encourage you to also check out that notebook on your own. I’m going to open up a new notebook just create a new untitled notebook here. You can try the online Jupyter notebook for instant development. Please find the screenshot as below:

Then click File-> New Notebook and then you will get a new notebook page where you can write your code.
The first thing we need to do is actually import Spacy. So, we’ll say import Spacy and then we’re going to load the language library. We have to write here
“nlp = spacy.load(‘en’_core_web_sm’)”. Here ‘en’ is for English and “core_web_sm” for core small versions of the English language library.

And then we’re going to hit enter here. And don’t worry if this takes a long time to load the very first time you are running it. it usually takes a while. It’s a fairly large library but it’s also part of what makes Spacy so efficient is that a lot of what it’s running on top of it is already preloaded into this library.
The next step we’re going to do is create a doc object or a document object. So, we’ll create a variable called the “doc” and we’re going to pass just a Unicode string. That means we prefix that string with a ‘u’ and we’re just going to pass in a string “I am looking for a better U.S. job opportunity with an annual package of 15LPA”.

OK, so what’s actually going to happen here is that using the language library that we just loaded that Spacy developed. So, It’s going to essentially parse this entire stream into separate components for us and it’s going to parse it into what is known as “tokens essentially”. Each of these little words is going to become a “token”. So here, I can actually iterate through this document object.
And then I can print out the token and there are various attributes I can actually grab from each token. So, for example, I can grab the token text and the token text is actually the raw text that it grabs. Notice that it’s smart enough to actually treat dot as dot these capital U and S as a single token.
Spacy through a lot of Java application development is actually smart enough to realize that when we say something like capital U dot S dot, (U.S.), then we’re talking about the country.
Spacy can actually tell things like Amazon as being part of a company. So, we’re going to print token here and then we’re also going to print out some more stuff. Let’s go ahead and print out token POS which stands for part of speech.
Even though a Doc is processed — for example: split into singular words and annotated– it actually holds all data of the main content, as the whitespace characters. You can generally get the offset values of a token into the original/main string, or remake the main string by appending the tokens along with their trailing whitespace.

Output:

So when we run that we now see this ninety-five ninety-two ninety-three. And later on, I am going to show here that each of these numbers actually corresponds to the parts of speech like an adverb a verb a noun conjugation, etc. If you actually want the part-of-speech (POS) Tagging then we have to use “pos_” and it will tell you what part of speech it is.

Output is:

So is it smart enough to know that “I” is a proper Noun, “Opportunity” is a proper noun, ‘am’ is a verb, “looking” a verb proper noun for “U.S.” and so on. It is still smart enough to realize that “15LPA” here is a number. Spacy knows a lot of information about each of these tokens. And then you can also do things like print (token.dep_), then it’s going to give you even more information.

“dep” here stands for syntactic dependency. The output will be :

But hopefully from the above output, you can see that it’s really incredible the capabilities of Spacy along with natural language processing to grab a lot of information just from a simple string. It recognizes that “I” is a proper noun not just a word at the start of a sentence. So “I” here is capitalized not just because it’s the start of a sentence but also because it’s a proper noun. And it was able to tell that. it’s also understood that “U.S.” these dots don’t separate it and It’s a single entity and a single token. And as I dive deeper into Spacy, you are going to see where each of these abbreviations mean and how did arrive.

Tagger, Parser, and Ner:

I will also show you here how Spacy can interpret the last token combined “15LPA” that it’s going to be able to understand that all of this is some sort of quantifier of money. So, now I am going to explain here about the pipeline object. So, after I’m putting the Spacy module in the cell above, we loaded a model and named that NLP.

So, this actual line of code is known as loading a model and we called it nlp. Next, we created a doc or a document object by applying this model “nlp” to our text. Spacy also builds a companion vocab object for vocabulary. This doc object that we created holds the processed text and that’s really the focus of my discussion. So off of this nlp object, what we can do is call .pipeline. So, when we run an nlp, our test is entering a processing pipeline. It first breaks down the text and then performs that series of operations of tagging parsing and describing that data that we have passed in as input. So, we can see here output as.

So, the basic NLP pipeline is a “tagger”, a “parser” and then “ner” which stands for named entity recognizer and you can learn each of these in a lot more detail in our advanced NLP blogs.

You can also just get the basic names by triggering “nlp.pipe_names”

Basics of Tokenization using Spacy:

· You can play around with the various attributes here. The first one I want to discuss just quickly is tokenization. So, the very first step in processing any text is to split it up all the parts. So, basically what it does is converting the words and punctuation into the token forms and then these tokens are eventually annotated inside the doc object to contain the informative data or we can say the descriptive information.

So, let me give you here an example for better understanding:

I’m going to create another document “doc2” and again I am going to pass in a Unicode string means it starts with a “u” and the String is “Walmart isn’t looking into startup companies anymore”.

We run that by printing out the tokens. Here, I am going to print out the same information I printed out last time which was the text, the part of speech, and then the syntactic dependency as below:

So if you run that you can see that “Walmart” is a proper noun and noun subject, Then it’s actually able to understand that [is] and [n ‘t] is actually conjunction and it’s going to be able to keep them separate as well as know the relationship between these two parts of these tokens. The output you can see below:

So, it’s really advance what Spacy is doing here. So again, notice that actually “isn’t”-> it’s been split into two tokens and Spacy recognizes both the root verb is and the negation attached to it. So, it understands “is” the root verb and “n’t” -> this is a negation. Notice additionally, that the period (.) at the end of the sentence and both the extended whitespace are assigned or allotted their own tokens. So, if I were to put in a lot of extended whitespace in the put message and run this again, then actually this space would become a token with Spacy. And then the stop (.) is punctuation.
Now, I want to point out that in my previous example, we’re iterating for every token inside of this document object (doc2 object). But we can also use indexing where it grabs tokens individually. So, if I take my document object, then I can use indexing to grab the very first token off of this, and by default, it returns that token text. But just as before, after that token object, I can ask for attributes like part of speech and it returns prop and which stands for proper noun.

And if you Cheka Spacy basic notebook you can learn more about parts of speech tagging and dependencies, additional token attributes, and so on. So here, in previous examples, you saw that there’s a part of speech like PROPN for proper noun and then verb, noun, and so on.
Finally, for each of the token, there previous syntactic dependencies, and that as I already raised, is just going to be the called with “doc2[0].dep_”
The last thing I want to mention is that there are lots of other additional token attributes. So far, we’ve seen parts of speech, dependencies, and a couple of more. If you come to see the Spacy basics documents, then you can find an entire list of the different tags and descriptions. So, for example, if we call it with “.text”, that’s is the original word text which in this case would be “Walmart”. If you want the lemma that’s going to give you the lemmatization or the base form of the word. That is essentially the result is just lower case here (Walmart).

You can in the above image that there are many tags available and its usages. So, we have all this variety of tags you can call off of any of these tokens. And Spacy did this all automatically for us the very minute that he passed it into the nlp. And again it’s doing this all based on the fact that we loaded in this library (“en_core_web_sm”) which is why it took some time but that’s what makes Spacy so efficient. It loads up the library first for us.
Now large stock objects can be hard to work with sometimes. A span is a slice of a dark object in the form of some start versus a stop. So, I’m going to copy and paste some sample paragraphs and will pass into nlp. Essentially, we’re just passing in a really long string. So, I have these really large documents and what I may want to do is actually just grab a span of it.

For Ex:

So here, as I raised the paragraph in the above example, I want to select an individual span: “[The great enemy of clear language is insincerity]”. So, I have to write the start index and end for this as below:

And also, here in the program, I have printed that quote. So, now the output is as below after selecting the span [33:41]:

This is now a span of that overall document because of its documents quite large. And maybe we’re only interested in this particular quote inside of that document. So, what’s really interesting is that even though we’re only grabbing a section of the large document, Spacy is smart enough to know that this “quote” variable is a span.

So, if you check out the type of this “quote” variable, spacy is doing a lot of work under the hood to understand that this is a particular span unlike the entire document which we checked the type of doc3 and it understands that’s the entire document.
So, when you take a slice out of this, spacy is going to be smart enough to understand that it’s the span of a larger document, and then certain tokens inside a document or doc object may also receive a start a sentence tag. While this doesn’t immediately build the list of sentences these tags enable the generation of sentence segments through “doc.sents”.
But let me just show you a simple example where I will pass three sentences to nlp inside the documents. Here, Spacy is doing a lot of work for us. It actually understands and is going to separate each of these sentences throughout the documents.

Output:

It’s really incredible the capabilities a Spacy to take in a raw string and completely understand things like parts of speech, name recognition token attributes where the sentence starts and ends, and a lot more.

See more detail at AI And ML Survey: Bright Future With Python Development

Conclusion:

Here, I have explained the most important concepts of Natural Language processing in python and also have given various examples showing the usages of the Spacy library for the NLP purpose. How Spacy is really incredible in just taking in a drawstring and completely understand things like parts of speech, name recognition token attributes where the sentence starts and ends everything it efficiently does for us.

FAQs:

Subject: Natural Language Processing and Usage of Spacy Library for NLP

What do you mean by the term Natural Language Processing?
How to tokenize a sentence using Spacy in NLP?
How to span a specific segment in a long string/paragraph using Spacy
What is “PoS (Part-of-Speech-Tagging)” in NLP?
Give any two examples of real-time applications of NLP?
What is the difference between NLTK and Spacy Library?