Top Open Sourced NLP Tools for Python
There are various tasks in NLP and from those various tasks comes a lot variation of libraries, especially in python environment. Here, I will briefly review about strengths of each library and show some snippets for subset of NLP tasks that a library could do.
SpaCy: Super Easy To Use NLP Tool
Spacy is an open source tools for various, industry ready, NLP tasks. It specifically aims to be easy to use and contains only proven latest research model for NLP tasks. It maintained by Ines Montani and Matthew Honnibal, both was founders of explosion.ai. You can read more on SpaCy’s features here, but for now I will demonstrate a few features.
NER
Getting NER is pretty straight forward, you could access `ents` attribute on doc variable
POS Tagger
Each token has also been POS-tagged by SpaCy,
Dependency Parser
Here, I use render function instead of serve because serve trigger another HTML server to be hosted, by default at port 5000, instead of statically rendering current doc.
License: MIT
Gensim: Topic Modeling and Word Embedding
Gensim branded itself as topic modeling for humans, and it’s true. It gives easy API to load corpus and to do topic modeling. It also has been widely use to build word embeddings like Word2Vec and GloVe.
Word Embedding
creating word embedding is straight forward. First you have to tokenize the sentences, then load it into word embedding constructor. I will provide example on how to make word2vec unsupervised model with gensim.
now it’ s time to visualize this word2vec. I will use TSNE from scikit-learn to project it into 2 dimension, so it would be easily visible from scatter plot what does this model do.
It seems this simple model has pull hate words from other words!
Topic Modeling (using LDA)
Gensim is also known to provide easy LDA interface. You need to make Dictionary
from list of tokens and bow
(bag of words) instance from it.
License: GNU LPGLv2.1. (See here)
Facebook’s Fast Text: Word Embedding
Facebook has recently publish its official library on python module fasttext
. You need install fasttext version > 0.8 (I used fasttext==0.9.1
)
License: MIT
Doccano: Text Annotation Tools
Doccano is one of the best data annotation tool, in my personal experience. It definitely open source! I would consider this as alternatives to prodi.gy from explosion.ai. It currently has 3 main annotation functions: Named Entity Recognition (NER), Sentiment Analysis / Text Classification, and Machine Translation.
you could upload and download (Export Data) your data as csv and json file, it’s pretty straight forward!
It also has dashboard to display annotation progress and label distributions
Overall Doccano has been my personal selection to annotate text data, especially on text classification! In addition, it also allow for collaborations because it is a webserver.
License: MIT