NLP Libraries and Pretrained models
This article is 3rd in the series Everything to get started with NLP. I hope you have covered the 1st two. They will give you the starting point to get started in the world of NLP. Here are the links for those two:
1. Everything to get started with NLP.
2. Sentiment Classifier using Tfidf.
In this article, we will be looking at NLP libraries and pre-trained models. This will be very interesting. So buckle up your seat belts and let’s get started.
The world of NLP is enormous and it takes huge amount of time and resources to get a little hold of it. Taming it is altogether a different ball game.
This was a scenario a few years back, but not now. Thanks to pre-built NLP libraries and models which are here to make our lives easy in the gigantic world of NLP. Researchers have invested years of their lives in making these libraries and modules.
And since they have already put in a lot of efforts in designing a benchmark model/library for us, instead of building things from scratch to solve a similar NLP problem, we should use that pre-trained model/library on our dataset. You have already seen examples of the scikit-learn library, A python library for Machine Learning, in the Sentiment Analysis (Previous article).
There is a lot of research that has happened and is still going on in NLP. So naturally, there are a lot of Models and Libraries already in existence. Some of them are very good and some of them not so. So to make your task of choosing which library/model is good for your task easy, I will explain most popular models and libraries.
This will be just an overview of libraries and models, Please follow the links for complete information.
1. NLP Libraries
There are a lot of NLP libraries available. Here are the top five.
- NLTK is at the forefront of NLP world. It is fairly mature and in development since 2000. People use NLTK extensively when working with Python on Language Processing.
- NLTK is written with very good coding practices, is fairly easy to use. And on top of that, it comes with datasets that you can download to make life easy.
- supports lots of tasks such as classification, tokenization, stemming, tagging, parsing, etc. Developers who are new into NLP and Python generally starts with NLTK.
- Arguably the second most famous library for NLP tasks. It is very easy to use, intuitive and very strong. It calls itself industrial-strength natural language processing
- Despite being relatively new, SpaCy offers the fastest syntactic parser available on the market today. On top of it, since the toolkit is written in Cython it’s quite fast.
- Spacy supports almost all the NLP features. Some examples are NER, Tokenization. On top of that Spacy supports 49+ languages, pre-trained word vectors, and POS tagging, etc.
- A fairly specialized library for unsupervised semantic modeling. It is highly optimized. And thus you can expect high performance.
- When it comes to Semantic Analysis and Topic Modelling, Gensim is the go-to library. It is fast, scalable and very efficient.
- Gensim uses Numpy extensively which makes it very fast. On top of speed, due to Numpy, the memory is also highly optimized. Due to these two factors, Gensim can handle a huge amount of data.
- By far the most important library in Python for Machine Learning. You have already seen the usage of this library in the Sentiment Analysis article.
- This library provides various tools for text preprocessing. We have used some of them already.
- Polyglot is primarily designed for multilingual applications and thus is quite different from other libraries that we have discussed until now. Though it also provides the typical NLP features, the multilinguality makes it stand apart.
- It provides features to simultaneously deal with various languages. Though other libraries also have such features, but not as accurate and advanced as Polyglot.
- The backend of Polyglot also is dependent on Numpy and thus its also very fast.
Pros and Cons of these libraries
How to use these libraries
Since we have seen the pros and cons of these libraries and also have compared them, let us see how they work and how can we use them.
We will see how to install each library and also see a short example of how to use them. We have already seen how scikit-learn works. So we need not discuss it again. For NLTK, spaCy, and Polyglot we will take a random text and see how they perform Named Entity Recognition(NER). i.e. How well they are able to identify if a word is an entity or not. And for Gensim we will see an example of document similarity. For all other usages of these libraries please go to their docs. I have already given links to the docs of each library in the heading of each library.
First, we will do NER (Named Entity Recognition) using NLTK, spaCy and Polyglot.
Please note that I am using Jupyter notebook and Python 3.
#We will take this paragraph to perform NER using all three libraries.text = '''But Google is starting from behind. The company made a late push into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer adoption. I was born in India on 23/03/1996. Chennai is a coastal city.'''
#If having any trouble go to this link : https://pypi.org/project/nltk/!pip install nltk#importing tokenizer, POS tagger and NER chunker
from nltk import word_tokenize, pos_tag, ne_chunk#performing tokenization followed by POS tagging. Then extracting #NER.
The output is in the form of token, POS tag pairs. You can see that NLTK has converted the whole text into tokens, then performed POS tagging followed by extracting Named entities. All the named entities are in brackets.
For example, (GPE India/NNP) means that India is a GPE(location) and its POS tag is a Noun(proper, singular).
There are a lot of POS tags, to know about all of them please refer this StackOverflow answer.
#If having any trouble go to this link : https://spacy.io/usage!pip install -U spacy#To use spaCy you must have some language model. The model that we #need depends on our language and also on our usage.
#Please refer the above spaCy installation link for full details.
#We will download the Enlish model for our usage!python -m spacy download en#To download English's small statistical model. Medium and Large #models are also available.!python -m spacy download en_core_web_sm#importing spaCy, English's small statistical model and loading itimport spacy
nlp = en_core_web_sm.load()#performing NER over the same text.doc = nlp(text)
for entity in doc.ents:
In spaCy, When we perform NER then only the output of NER is shown. All the other details are hidden. The output of spaCy’s NER is in the form of token, NER tag pairs. For example, 23/03/1996 DATE means that 23/03/1996 is a date and DATE is its NER tag.
#installing Polyglot and other important modlules. All the other #three modules must be present for the working of polyglot.
#If having any trouble go to this link : https://pypi.org/project/polyglot/!pip install PyICU
!pip install pycld2
!pip install morfessor
!pip install polyglot#Downloading other requirements like POS tagger, embeddings and NER #tagger!polyglot download pos2.en
!polyglot download embeddings2.en
!polyglot download ner2.enpoly_text = Text(text)
The Polyglot gives only Named entities as output and hides all other details. The output is in the form NER-Tag([‘token’]). For example, I-ORG([‘Google’]) means that Google is an entity and I-ORG is its NER tag.
For our example, we will take 9 documents and find their similarity with a sample document.
#If having any trouble go to this link : https://pypi.org/project/gensim/!pip install gensim#importing gensim
import gensim#First, let’s create a small corpus of nine documents and twelve #features.
#From the sentiment analysis example we are familiar with how #documents are converted into vectors. This is same.
#We have taken documents as vectors. Total number of features are #12.corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
[(2,1.0), (3,1.0), (4,1.0), (5,1.0), (6,1.0), (8,1.0)],
[(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
[(0, 1.0), (4, 1.0), (7, 1.0)],
[(3, 1.0), (5, 1.0), (6, 1.0)],
[(9, 1.0), (10, 1.0)],
[(9, 1.0), (10, 1.0), (11, 1.0)],
[(8, 1.0), (10, 1.0), (11, 1.0)]]#Now we will convert the vectors in our corpus into tf-idf vectors.
#We have already seen tf-idf vectors in sentiment analyses example.from gensim import models
tfidf = models.TfidfModel(corpus)#Now we will create a sample document to calculate similarity with #all other documents in our corpus.
#we will apply tf-idf vectorization over this document also.
#In the output you can see the output as tf-idf vectors.sample_doc = [(0, 1), (4, 1)]
#let us transform the whole corpus via TfIdf and index it, in #preparation for finding similarityfrom gensim import similaritiesindex = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)#Now calculate the similarity of our sample document against every #document in the corpus:sims = index[tfidf[sample_doc]]print(list(enumerate(sims)))
The output is in the form of a list of tuples having document number and similarity. For example, the first document(index=0) has a similarity score of 0.466=46.6%, the second document has a similarity score of 19.1%, etc.
Thus, according to TfIdf document representation and cosine similarity measure, the most similar to our sample_doc is document no. 4, with a similarity score of 77.8%. Note that in the TfIdf representation, any documents which do not share any common features with sample_doc at all (documents no. 5–9) get a similarity score of 0.0.
Since we have already seen its usage in the last article extensively, I will not be covering it here.
End Notes on Libraries
Please note that all these libraries don’t work as nicely as their documentation suggests. you need to try different libraries to check which one fulfills your goal.
For example, in spaCy Chennai is not an entity but in NLTK it is. In NLTK 23/03/1996 is not a date but in spaCy it is.
If any library doesn’t fulfill your requirements, then you can train the library for your own dataset. Training the library can take a lot of time and resource but it can help a lot.
2. Transfer Learning and NLP Pretrained models
An alternative to training libraries can be training pre-trained models.
Table of Content
- What is Transfer Learning?
- What is a pre-trained model?
- why is it important?
- How to use them?
- what are the best models available?
Let us cover all of these topics one by one
- What is Transfer Learning?
Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task. It’s like learning something from someone and then utilizing the knowledge for some other task.
- What is a Pre-Trained model?
A pre-trained model is a model created by someone else to solve a similar problem. The most important factor of this model is its learning. What all it has already learned from its training. So Instead of starting from scratch to solve a similar problem, we can use its learning as a starting point.
- why is it important?
The researchers and authors of a particular model have invested months and sometimes years to train a model. Refine it to a level that it has achieved benchmark results. So why do the same steps when they have already been done? The only thing we need to do is fine-tune the model for our specific use case.
For example, if you want to build a self-learning car. You can spend years to build a decent image recognition algorithm from scratch or you can take the Inception model (a pre-trained model) from Google which was built on ImageNet data to identify images in those pictures.
- How to use them?
Please refer to this blog as they have explained very nicely. You can take this as an example of Pre-Trained Model which I am using. That is using someone else’s knowledge instead of starting something from scratch.
- what are the best models available?
ULMFiT, Transformer, Google’s BERT, Transformer-XL
To have an overview of all of them please go through this link. There you will also find more resources to get a complete hold on Pretrained models.
I tried to give you insights into the enormous world of NLP. I hope you got the idea about how things work in NLP. I wish that you take on this knowledge to explore more.
That is all. Thanks for reading. Happy Learning.