#0to1 NLP Episode-1: Introduction and NLP libraries

Pranav Pattarkine
5 min readAug 15, 2020

--

What is Natural Language?

Natural language is simply human language like English, French, etc., whereas computer languages include C, Python among many others. Machine languages have been constructed for specific uses as opposed to natural languages that have matured over the years as per convenience. Although the natural language follows certain grammar rules it isn’t hard bounded by any specific rules, it incorporates slang, sarcasm, modern abbreviations, etc. Natural language can be of any form like text, speech, and even sign language.
Natural language needs to be processed for the machine to understand, hence NLP.

Natural Language Processing

NLP ( Natural Language Processing ) in the simplest terms is the interaction between computers and humans using the natural language. Broadly it can be defined as the building of computing tools for automatic manipulation of a natural language like speech and text. The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable and do useful operations with language like translation, chatbots, question answering, text summarization, speech to text and vice versa text to speech, sentiment analysis, etc.

Generally, we take Natural Language Processing in a wide sense to cover any kind of computer manipulation of natural language. At one extreme, it could be as simple as counting word frequencies to compare different writing styles. At the other extreme, NLP involves “understanding” complete human utterances, at least to the extent of being able to give useful responses to them. It is a collective term referring to the automatic computational processing of human languages. This includes both algorithms that take human-produced text as input and algorithms that produce natural-looking text as outputs.

Structured and Unstructured data

The main difference between structured and unstructured data is that structured data usually comes in tabular format i.e it can be displayed in rows, columns, and relational databases. It can also be simply numbers, dates, IDs, etc. whereas the unstructured data cannot be displayed in rows, columns, and relational databases. Audio files, emails, word processing files, a group of articles are all included in the unstructured category.

Structured data is preferred in tasks such as text classification and machine translation where labeled data is required and in tasks such as QA model and language models, unstructured data is preferred.

Challenges in NLP

Natural language is hard to learn and highly ambiguous. To simply understand the difficulty level an average full-grown human requires around 6–7 months to learn a language that the machine is expected to learn in a single go. Even after learning, the language is always evolving, detecting the true meaning of the sentence is truly difficult, how to do sentiment analysis on a sarcastic review. Though the language has certain governing rules, the raw data doesn’t necessarily follow these rules. The millennial language has slangs and abbreviations which prove to be very troublesome while processing natural languages.

The one thing we don’t want to do here is skipping the basic concepts of NLP and directly jumping to Text Classification and Text Summarization. In this series, we’ll try to cover as many topics as possible including:

  • Text Pre-processing
  • Neural Networks
  • Context-free Word Embeddings
  • Transformer
  • Context-based Word Embeddings
  • Text Summarization
  • Text Classification
  • QA module
  • GLUE Benchmark

A decade ago, only experts with knowledge of statistics, machine learning and in linguistic concepts would perform heavy NLP tasks but in recent years thanks to various NLP libraries, solving NLP problems has become much easier. In this article, we’ll look into the most popular NLP libraries. Their comparison is done in the successive articles based on the task the article is based upon. So let’s get started.

Notable NLP Libraries

There are many NLP libraries out there but these are a few libraries worth mentioning. One does not need to study all the libraries in detail but must know the advantages and disadvantages.

  • NLTK: Natural language tool kit is probably the most famous NLP library with over 50 corpora and lexicons, 9 stemmers, and dozens of algorithms to choose from. NLTK provides easy-to-use interfaces to over 50 corpora and lexical resources. Few weaknesses to note are that it is slow compared to other libraries and also a bit complicated to learn and implement.
  • spaCy: Spacy is known as the state of the art library, providing only the best algorithms thus avoiding the stress to choose among algorithms. It is designed explicitly for production usage — it lets you develop applications that process and understand huge volumes of text. As it’s implemented on Cython, Spacy is lightning fast. It can support tokenization for over 49 languages.
  • Stanford CoreNLP: Stanford CoreNLP is a suite of production-ready natural analysis tools. Since CoreNLP is written in Java, it demands that Java be installed on your device. However, it does offer programming interfaces for many popular programming languages, including Python. The library provides vast functionalities also it’s very fast and accurate. Hence many organizations use CoreNLP for production.
  • TextBlob: TextBlob is built on NLTK and another package known as Pattern. It’s an easy to use interface to the NLTK library. It is based on both NLTK and Pattern and provides a very straightforward API to all common (and some less common) NLP tasks. While TextBlob does nothing particularly new or exciting, it makes working with text very enjoyable and removes a lot of barriers. The library provides in-built functions for text classification and sentiment analysis.
  • Gensim: Gensim is a Python library designed specifically for “topic modeling, document indexing, and similarity retrieval with large corpora.”All algorithms in Gensim are memory-independent, w.r.t., the corpus size, and hence, it can process input larger than RAM. Even though it’s built-in pure python, Gensim is fast and memory efficient.

This was a basic introduction to NLP and the libraries providing NLP functionalities. The further articles will dive deeper into the topics the article is based on.

So let’s get started on our NLP journey!

--

--