Top Open Sourced NLP Tools for Python

Arie Pratama Sutiono
4 min readJul 3, 2019

--

There are various tasks in NLP and from those various tasks comes a lot variation of libraries, especially in python environment. Here, I will briefly review about strengths of each library and show some snippets for subset of NLP tasks that a library could do.

SpaCy: Super Easy To Use NLP Tool

Spacy is an open source tools for various, industry ready, NLP tasks. It specifically aims to be easy to use and contains only proven latest research model for NLP tasks. It maintained by Ines Montani and Matthew Honnibal, both was founders of explosion.ai. You can read more on SpaCy’s features here, but for now I will demonstrate a few features.

Importing SpaCy + Example Corpus

NER

Getting NER is pretty straight forward, you could access `ents` attribute on doc variable

Code to Show Named Entities
NER Result

POS Tagger

Each token has also been POS-tagged by SpaCy,

Code to POS Tag
Sample Output For POS Tagging Code

Dependency Parser

Code to Show Tokens’ Dependencies
Result of Spacy Dependency Parsing

Here, I use render function instead of serve because serve trigger another HTML server to be hosted, by default at port 5000, instead of statically rendering current doc.

License: MIT

Gensim: Topic Modeling and Word Embedding

Gensim branded itself as topic modeling for humans, and it’s true. It gives easy API to load corpus and to do topic modeling. It also has been widely use to build word embeddings like Word2Vec and GloVe.

Word Embedding

creating word embedding is straight forward. First you have to tokenize the sentences, then load it into word embedding constructor. I will provide example on how to make word2vec unsupervised model with gensim.

now it’ s time to visualize this word2vec. I will use TSNE from scikit-learn to project it into 2 dimension, so it would be easily visible from scatter plot what does this model do.

Word2Vec Visualization Result

It seems this simple model has pull hate words from other words!

Topic Modeling (using LDA)

Gensim is also known to provide easy LDA interface. You need to make Dictionary from list of tokens and bow (bag of words) instance from it.

License: GNU LPGLv2.1. (See here)

Facebook’s Fast Text: Word Embedding

Facebook has recently publish its official library on python module fasttext . You need install fasttext version > 0.8 (I used fasttext==0.9.1)

License: MIT

Doccano: Text Annotation Tools

Doccano is one of the best data annotation tool, in my personal experience. It definitely open source! I would consider this as alternatives to prodi.gy from explosion.ai. It currently has 3 main annotation functions: Named Entity Recognition (NER), Sentiment Analysis / Text Classification, and Machine Translation.

Doccano’s NER Annotation
Doccano’s Sentiment Analysis / Text Classification
Doccano’s Machine Translation

you could upload and download (Export Data) your data as csv and json file, it’s pretty straight forward!

Doccano’s Interface for Viewing the Entire Datasets

It also has dashboard to display annotation progress and label distributions

Doccano’s Interface For Annotation Progress and Label Distributions

Overall Doccano has been my personal selection to annotate text data, especially on text classification! In addition, it also allow for collaborations because it is a webserver.

License: MIT

--

--