Awesome NLP — 21 popular NLP libraries of 2022

The landscape of NLP libraries

Fabio Chiusano
NLPlanet
7 min readJan 24, 2022

--

In this article I list the most used NLP libraries of today, giving them a brief description. They each have specific strengths and weaknesses in distinct use cases, so they can all be useful as a wealth of knowledge of a good data scientist specialized in NLP.

Descriptions of each library are extracted from their GitHub repositories.

List of popular NLP libraries in 2022. Image by the author.

Top NLP libraries

Here is the list of top libraries, sorted by their number of GitHub stars.

Hugging Face Transformers

  • 57.1k GitHub stars.
  • Transformers provides thousands of pre-trained models to perform tasks on different modalities such as text, vision, and audio. These models can be applied to text (text classification, information extraction, question answering, summarization, translation, text generation, in over 100 languages), images (image classification, object detection, and segmentation), and audio (speech recognition and audio classification). Transformer models can also perform tasks on several modalities combined, such as table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.
  • Currently updated.

spaCy

  • 22.2k GitHub stars.
  • spaCy is a free open-source library for Natural Language Processing in Python and Cython. It’s built on the very latest research and was designed from day one to be used in production environments. spaCy comes with pre-trained pipelines and currently supports tokenization and training for 60+ languages. It features state-of-the-art speed and neural network models for tagging, parsing, named entity recognition, text classification, multi-task learning with pre-trained transformers like BERT, as well as a production-ready training system and easy model packaging, deployment, and workflow management. spaCy is commercial open-source software, released under the MIT license.
  • Currently updated.

Fairseq

  • 15.1k GitHub stars.
  • Fairseq is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. It provides reference implementations of various sequence modeling papers.
  • Currently updated.

Jina

  • 13.9k GitHub stars
  • Jina is a neural search framework to build state-of-the-art and scalable neural search applications in minutes. Jina allows building solutions for indexing, querying, understanding multi-/cross-modal data such as video, image, text, audio, source code, PDF.
  • Currently updated.
  • Thanks Christoph E. for suggesting Jina!

Gensim

  • 12.8k GitHub stars.
  • Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. The target audience is the NLP and information retrieval (IR) community. Gensim has efficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP), or word2vec deep learning.
  • Currently updated.

Flair

  • 11.2k GitHub stars.
  • Flair is a powerful NLP library. Flair allows you to apply state-of-the-art NLP models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification, with support for a rapidly growing number of languages. Flair has simple interfaces that allow you to use and combine different word and document embeddings, including Flair embeddings, BERT embeddings, and ELMo embeddings. The framework builds directly on PyTorch, making it easy to train your own models and experiment with new approaches using Flair embeddings and classes.
  • Currently updated.

AllenNLP

  • 10.8k GitHub stars.
  • An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. It provides a broad collection of existing model implementations that are well documented and engineered to a high standard, making them a great foundation for further research. AllenNLP offers a high-level configuration language to implement many common approaches in NLP, such as transformer experiments, multi-task training, vision+language tasks, fairness, and interpretability. This allows experimentation on a broad range of tasks purely through configuration, so you can focus on the important questions in your research.
  • Currently updated.

NLTK

  • 10.4k GitHub stars
  • NLTK — the Natural Language Toolkit — is a suite of open-source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.
  • Currently updated.

CoreNLP

  • 8.3k GitHub stars.
  • Stanford CoreNLP provides a set of natural language analysis tools written in Java. It can take raw human language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize and interpret dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases or word dependencies, and indicate which noun phrases refer to the same entities.
  • Currently updated.

Pattern

  • 8.1k GitHub stars.
  • Pattern is a web mining module for Python. It has tools for data mining: web services (Google, Twitter, Wikipedia), web crawler, and HTML DOM parser. It has several Natural Language Processing models: part-of-speech taggers, n-gram search, sentiment analysis, and WordNet. It implements Machine Learning models: vector space model, clustering, classification (KNN, SVM, Perceptron). Pattern can be also used for Network Analysis: graph centrality and visualization.
  • Last update 2 years ago.

TextBlob

  • 8k GitHub stars.
  • TextBlob is a Python library for processing textual data. It provides a simple API for diving into common Natural Language Processing tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. TextBlob stands on the giant shoulders of NLTK and Pattern and plays nicely with both.
  • Currently updated.

Hugging Face Tokenizers

  • 5.2k GitHub stars.
  • This library provides an implementation of today’s most used tokenizers, with a focus on performance and versatility.
  • Currently updated.

Haystack

  • 3.8k GitHub stars.
  • Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want to perform Question Answering or semantic document search, you can use the State-of-the-Art NLP models in Haystack to provide unique search experiences and allow your users to query in natural language. Haystack is built in a modular fashion so that you can combine the best technology from other open-source projects like Huggingface’s Transformers, Elasticsearch, or Milvus.
  • Currently updated.

Snips NLU

  • 3.6k GitHub stars.
  • Snips NLU is a Python library that allows the extraction of structured information from sentences written in natural language. Anytime a user interacts with an AI using natural language, their words need to be translated into a machine-readable description of what they meant. The NLU (Natural Language Understanding) engine of Snips NLU first detects what the intention of the user is (a.k.a. the intent), then extracts the parameters (called slots) of the query.
  • Last update 2 years ago.

NLP Architect

  • 2.8k GitHub stars.
  • NLP Architect is an open-source Python library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing and Natural Language Understanding Neural Networks. It’s a library designed to be flexible, easy to extend, allowing for easy and rapid integration of NLP models in applications, and to showcase optimized models.
  • Currently updated.

PyTorch-NLP

  • 2k GitHub stars.
  • PyTorch-NLP is a library of basic utilities for PyTorch NLP. It extends PyTorch to provide you with basic text data processing functions.
  • Currently updated.

Polyglot

  • 1.9k GitHub stars.
  • Polyglot is a natural language pipeline that supports massive multilingual applications: Tokenization (165 Languages), Language Detection (196 Languages), Named Entity Recognition (40 Languages), Part of Speech Tagging (16 Languages), Sentiment Analysis (136 Languages), Word Embeddings (137 Languages), Morphological analysis (135 Languages), and Transliteration (69 Languages).
  • Last update 3 years ago.

TextAttack

  • 1.8k GitHub stars.
  • TextAttack is a Python framework for adversarial attacks, data augmentation, and model training in NLP.
  • Currently updated.

Word Forms

  • 513 GitHub stars.
  • Word forms can accurately generate all possible forms of an English word. It can conjugate verbs and pluralize singular nouns. It can connect different parts of speeches e.g noun to adjective, adjective to adverb, noun to verb, etc.
  • Last update 1 year ago.

Rosetta

  • 420 GitHub stars.
  • Rosetta is a privacy-preserving framework based on TensorFlow. It integrates with mainstream privacy-preserving computation technologies, including cryptography, federated learning, and trusted execution environment. Rosetta reuses the APIs of TensorFlow and allows the transfer of traditional TensorFlow codes into a privacy-preserving manner with minimal changes.
  • Currently updated.

Honorable Mentions

I list here some data science libraries that are not specific to NLP but which are nevertheless often used in NLP projects.

scikit-learn

  • 48.6k GitHub stars.
  • Scikit-learn (also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector machines, random forests, gradient boosting, k-means, and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
  • Currently updated.

pandas

  • 32.4 GitHub stars.
  • pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open-source data analysis/manipulation tool available in any language.
  • Currently updated.

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence