7 Amazing Open Source NLP Tools to Try With Notebooks in 2019

As previously highlighted in my Beyond Word Embeddings Series, 2019 is going to be an exciting year for natural language processing. Here are my favorite NLP toolkits, you can start experimenting with them and Azure Notebooks.

The Azure Notebook Service offers free interactive computing and project management in the browser it can be linked to remote GPU DSVM compute using an Azure Subscription. I’ve included an open source notebook that contains installation instructions and a hello world example for each of these toolkits.

1. NLTK

Tagline: NLTK — the Natural Language Toolkit — is a suite of open source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing.

Favorite Features: Lexical Corpus Integration(WordNet, Stopwords, etc), Tokenization, Sentiment Analysis

2. spaCy

Tagline: spaCy is a library for advanced Natural Language Processing in Python and Cython. spaCy comes with pre-trained statistical models and word vectors, and currently supports tokenization for 30+ languages.

Favorite Features: Syntactic Parser, Named Entity Recognition, Tokenization, Speed, Extensible Pipeline Interface, Displacy visualization

3. AllenNLP

Tagline: An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks.

Favorite Features: Question and Answering, Semantic Role Labeling, Within Document Co-reference, Textual Entailment, Text to SQL

4. Stanford NLP

Tagline: The Stanford NLP Group’s official Python NLP library. It contains packages for running our latest fully neural pipeline from the CoNLL 2018 Shared Task and for accessing the Java Stanford CoreNLP server.

Favorite Features: Extensive Language Support for Tokenization, Parsing, Named Entity Extraction including Hebrew, Arabic, Finnish, Basque and more.

5. Intel NLP Architect

Tagline: NLP Architect is an open-source Python library for exploring state-of-the-art deep learning topologies and techniques for natural language processing and natural language understanding.

Favorite Features: Intent Extraction, Term Set Expansion, Machine Reading Comprehension, The only working python based Cross Document Co-Reference Sieve Based System.

6. Flair

Tagline: Flair allows you to apply our state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS), sense disambiguation and classification.

Favorite Features: Easy to use Pretrained BERT and Flair Embeddings

7. Gensim

Tagline: Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

Favorite Features: Topic Modeling, Favorite LDA implementation

There you go, this should be more than enough to get you started on your next big NLP project.

Hope this helps you get started on your NLP journey feel free to comment below with your ideas.

Next Steps

If the field of NLP interests you and you would like to learn more about how these frameworks work behind the scenes, check out my Beyond Word Embeddings Series below.

If you have any questions, comments, or topics you would like me to discuss feel free to follow me on Twitter if there is a tool you feel I missed, please let me know in the comments below.

About the Author

Aaron (Ari) Bornstein is an avid AI enthusiast with a passion for history, engaging with new technologies and computational medicine. As an Open Source Engineer at Microsoft’s Cloud Developer Advocacy team, he collaborates with Israeli Hi-Tech Community, to solve real world problems with game changing technologies that are then documented, open sourced, and shared with the rest of the world.