Natural Language Processing: Advance Techniques ~ In-Depth Analysis.

Analytics Vidhya
Published in
26 min readMay 2, 2021


What is NLP?

Natural Language Processing (NLP) is a field at the intersection of computer science, artificial intelligence, and linguistics. The goal is for computers to process or “understand” natural language to perform tasks like Language Translation and Question Answering.

With the rise of voice interfaces and chatbots, NLP is one of the most important technologies of the information age a crucial part of artificial intelligence. Fully understanding and representing the meaning of language is an extremely difficult goal. Why? Because human language is quite special.

The field of artificial intelligence has always envisioned machines being able to mimic the functioning and abilities of the human mind. Language is considered one of the most significant achievements of humans that have accelerated the progress of humanity. So, it is not a surprise that there is plenty of work being done to integrate language into the field of artificial intelligence in the form of Natural Language Processing (NLP). Today we see the work being manifested in the likes of Alexa and Siri.

NLP primarily comprises Natural Language understanding (human to machine) and natural language generation (machine to human). This article will mainly deal with natural language understanding (NLU). In recent years there has been a surge in unstructured data in the form of text, videos, audio, and photos. NLU aids in extracting valuable information from text such as social media data, customer surveys, and complaints.

What’s special about human language? A few things actually:

  • Human language is a system specifically constructed to convey the speaker/writer’s meaning. It’s not just an environmental signal but a deliberate communication. Besides, it uses an encoding that little kids can learn quickly; it also changes.
  • Human language is mostly a discrete/symbolic/categorical signaling system, presumably because of greater signaling reliability.
  • The categorical symbols of a language can be encoded as a signal for communication in several ways: sound, gesture, writing, images, etc. human language is capable of being any of those.
  • Human languages are ambiguous (unlike programming and other formal languages); thus there is a high level of complexity in representing, learning, and using linguistic/situational/contextual / word / visual knowledge towards the human language.

Why study NLP?

There’s a fast-growing collection of useful applications derived from this field of study. They range from simple to complex. Below are a few of them:

  • Spell Checking, Keyword Search, Finding Synonyms.
  • Extracting information from websites such as product price, dates, location, people, or company names.
  • Classifying: reading level of school texts, positive/negative sentiment of longer documents.
  • Machine Translation.
  • Spoken Dialog Systems.
  • Complex Question Answering.

Indeed, these applications have been used abundantly in the industry: from search (written and spoken) to online advertisement matching; from automated/assisted translation to sentiment analysis for marketing or finance/trading; and from speech recognition to chatbots/dialog agents (automating customer support, controlling devices, ordering goods).

Deep Learning

Most of these NLP technologies are powered by Deep Learning — a subfield of machine learning. Deep Learning only started to gain momentum again at the beginning of this decade, mainly due to these circumstances:

  • Larger amounts of training data.
  • Faster machines and multicore CPU/GPUs.
  • New models and algorithms with advanced capabilities and improved performance: More flexible learning of intermediate representations, more effective end-to-end joint system learning, more effective learning methods for using contexts and transferring between tasks, as well as better regularization and optimization methods.

Most machine learning methods work well because of human-designed representations and input features and weight optimization to best make a final prediction. On the other hand, in deep learning, representation learning attempts to automatically learn good features or representations from raw inputs. Manually designed features in machine learning are often over-specified, incomplete, and take a long time to design and validate. In contrast, deep learning’s learned features are easy to adapt and fast to learn.

Deep Learning provides a very flexible, universal, and learnable framework for representing the world for visual and linguistic information. Initially, it resulted in breakthroughs in fields such as speech recognition and computer vision. Recently, deep learning approaches have obtained very high performance across many different NLP tasks. These models can often be trained with a single end-to-end model and do not require traditional, task-specific feature engineering.

Text Embeddings

In traditional NLP, we regard words as discrete symbols, which can then be represented by one-hot vectors. A vector’s dimension is the number of words in the entire vocabulary. The problem with words as discrete symbols is that there is no natural notion of similarity for one-hot vectors. Thus, the alternative is to learn to encode similarity in the vectors themselves. The core idea is that a word’s meaning is given by the words that frequently appear close-by.

Text Embeddings are real-valued vector representations of strings. We build a dense vector for each word, chosen so that it’s similar to vectors of words that appear in similar contexts. Word embeddings are considered a great starting point for most deep NLP tasks. They allow deep learning to be effective on smaller datasets, as they are often the first inputs to deep learning architecture and the most popular way of transfer learning in NLP. The most popular names in word embeddings are Word2vec by Google (Mikolov) and GloVe by Stanford (Pennington, Socher, and Manning). Let’s delve deeper into these word representations

In Word2vec, we have a large corpus of text in which every word in a fixed vocabulary is represented by a vector. We then go through each position t in the text, which has a center word c and context words o. Next, we use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa). We keep adjusting the word vectors to maximize this probability.

For efficient training of Word2vec, we can eliminate meaningless (or higher frequency) words from the dataset (such as a, the, of, then…). This helps improve model accuracy and training time. Additionally, we can use negative sampling for every input by updating the weights for all the correct labels, but only on a small number of incorrect labels.

Word2vec has 2 model variants worth mentioning:

Skip-Gram: We consider a context window containing k consecutive terms. Then we skip one of these words and try to learn a neural network that gets all terms except the one skipped and predicts the skipped term. Therefore, if 2 words repeatedly share similar contexts in a large corpus, the embedding vectors of those terms will have close vectors.

CBOW and SkipGram

Continuous Bag of Words: We take lots and lots of sentences in a large corpus. Every time we see a word, we take the surrounding word. Then we input the context words to a neural network and predict the word in the center of this context. When we have thousands of such context words and the center word, we have one instance of a dataset for the neural network. We train the neural network and finally, the encoded hidden layer output represents the embedding for a particular word. It so happens that when we train this over a large number of sentences, words in similar contexts get similar vectors.

The GloVe model seeks to solve this problem by capturing the meaning of one word embedding with the structure of the whole observed corpus. To do so, the model One grievance with both Skip-Gram and CBOW is that they’re both window-based models, meaning the co-occurrence statistics of the corpus are not used efficiently, resulting in suboptimal embeddings.

The GloVe model seeks to solve this problem by capturing the meaning of one word embedding with the structure of the whole observed corpus. To do so, the model trains on global co-occurrence counts of words and makes sufficient use of statistics by minimizing least-squares error and, as a result, produces a word vector space with a meaningful substructure. Such an outline sufficiently preserves words’ similarities with vector distance.

Besides these 2 text embeddings, there are many more advanced models developed recently, including FastText, Poincare Embeddings, sense2vec, Skip-Thought, Adaptive Skip-Gram.

Machine Translation

Machine Translation is the classic test of language understanding. It consists of both language analysis and language generation. Big machine translation systems have huge commercial use, a global language is a $40 Billion-per-year industry. To give you some notable examples:

  • Google Translate goes through 100 billion words per day.
  • Facebook uses machine translation to translate text in posts and comments automatically, to break language barriers, and allow people around the world to communicate with each other.
  • eBay uses Machine Translation tech to enable cross-border trade and connect buyers and sellers around the world.
  • Microsoft brings AI-powered translation to end-users and developers on Android, iOS, and Amazon Fire, whether or not they have access to the Internet.
  • Systran became the 1st software provider to launch a Neural Machine Translation engine in more than 30 languages back in 2016.

In a traditional Machine Translation system, we have to use parallel corpus — a collection of texts, each of which is translated into one or more other languages than the original.

For example, given the source language “f” (e.g. French) and the target language “e” (e.g. English), we need to build multiple statistical models, including a probabilistic formulation using the Bayesian rule, a translation model p(f|e) trained on the parallel corpus, and a language model p(e) trained on the English-only corpus.

Needless to say, this approach skips hundreds of important details, requires a lot of human feature engineering, consists of many different & independent machine learning problems, and overall is a very complex system.

Neural Machine Translation is the approach of modeling this entire process via one big artificial neural network, known as a Recurrent Neural Network (RNN).

RNN is a stateful neural network, in which it has connections between passes, connections through time. Neurons are fed information not just from the previous layer but also from themselves from the previous pass. This means that the order in which we feed the input and train the network matters: feeding it “Donald” and then “Trump” may yield different results compared to feeding it “Trump” and then “Donald”.

Recurrent Neural Network for Machine Translation
Machine Translation GIF

Standard Neural Machine Translation is an end-to-end neural network where the source sentence is encoded by an RNN called encoder, and the target words are predicted using another RNN known as a decoder. The RNN Encoder reads a source sentence one symbol at a time and then summarizes the entire source sentence in its last hidden state. The RNN Decoder uses back-propagation to learn this summary and returns the translated version. Amazingly, Neural Machine Translation went from a fringe research activity in 2014 to the widely adopted leading way to do Machine Translation in 2016. So what are the big wins of using Neural Machine Translation?

  1. End-to-end training: All parameters in Neural Machine Translation are simultaneously optimized to minimize a loss function on the network’s output.
  2. Distributed representations share strength: Neural Machine Translation has better exploitation of word and phrase similarities.
  3. Better exploration of context: Neural Machine Translation can use a much bigger context — both source and partial target text — to translate more accurately.
  4. More fluent text generation: Deep learning text generation is of much higher quality than the parallel corpus way.

One big problem with RNNs is the vanishing (or exploding) gradient problem where, depending on the activation functions used, information rapidly gets lost over time. Intuitively, this wouldn’t be much of a problem because these are just weights and not neuron states, but the weights through time are actually where the information from the past is stored; if the weight reaches a value of 0 or 1,000,000, the previous state won’t be very informative. As a consequence, RNNs will experience difficulty in memorizing previous words very far away in the sequence and are only able to make predictions based on the most recent words.

Long short-term memory (LSTM) networks try to combat the vanishing / exploding gradient problem by introducing gates and an explicitly defined memory cell. Each neuron has a memory cell and three gates: input, output, and forget. The function of these gates is to safeguard the information by stopping or allowing the flow of it.

  • The input gate determines how much of the information from the previous layer gets stored in the cell.
  • The output layer takes the job on the other end and determines how much of the next layer gets to know about the state of this cell.
  • The forget gate seems like an odd inclusion at first but sometimes it’s good to forget: if it’s learning a book and a new chapter begins, it may be necessary for the network to forget some characters from the previous chapter.

LSTMs can learn complex sequences, such as writing like Shakespeare or composing primitive music. Note that each of these gates has a weight to a cell in the previous neuron, so they typically require more resources to run. LSTMs are currently very hip and have been used a lot in machine translation. Besides that, It is the default model for most sequence labeling tasks, which have lots and lots of data.

LSTM Architecture

Gated recurrent units (GRU) are a slight variation on LSTMs and are also extensions of Neural Machine Translation. They have one less gate and are wired slightly differently: instead of an input, output, and a forget gate, they have an update gate. This update gate determines both how much information to keep from the last state and how much information to let in from the previous layer.

GRU Architecture

The reset gate functions much like the forget gate of an LSTM, but it’s located slightly differently. They always send out their full state — they don’t have an output gate. In most cases, they function very similarly to LSTMs, with the biggest difference being that GRUs are slightly faster and easier to run (but also slightly less expressive). In practice, these tend to cancel each other out, as you need a bigger network to regain some expressiveness, which in turn cancels out the performance benefits. In some cases where extra expressiveness is not needed, GRUs can outperform LSTMs.


Besides these 3 major architectures, there have been further improvements in neural machine translation systems over the past few years. Below are the most notable developments:

  • Sequence to Sequence Learning with Neural Networks proved the effectiveness of LSTM for Neural Machine Translation. It presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. The method uses a multilayered LSTM to map the input sequence to a vector of fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.
  • Neural Machine Translation by Jointly Learning to Align and Translate introduced the attention mechanism in NLP. Acknowledging that the use of a fixed-length vector is a bottleneck in improving the performance of NMT, the authors propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts like a hard segment explicitly.
  • Convolutional over Recurrent Encoder for Neural Machine Translation augments the standard RNN encoder in NMT(Neural Machine Translation) with additional convolutional layers to capture the wider context in the encoder output.
  • Google built its own NMT(Neural Machine Translation) system, called Google’s Neural Machine Translation, which addresses many issues in accuracy and ease of deployment. The model consists of a deep LSTM network with 8 encoder and 8 decoder layers using residual connections as well as attention connections from the decoder network to the encoder.
  • Instead of using Recurrent Neural Networks, Facebook AI Researchers use convolutional neural networks for the sequence to sequence learning tasks in NMT(Neural Machine Translation).

Dialogue and Conversations

A lot has been written about conversational AI, and a majority of it focuses on vertical chatbots, messenger platforms, business trends, and startup opportunities (Amazon Alexa, Apple Siri, Facebook M, Google Assistant, Microsoft Cortana). AI’s capability of understanding natural language is still limited. As a result, creating fully automated, open-domain conversational assistants has remained an open challenge. Nonetheless, the work shown below serves as great starting points for people who want to seek the next breakthrough in conversation AI.

AI Assistants

Researchers from Montreal, Georgia Tech, Microsoft, and Facebook built a neural network that is capable of generating context-sensitive conversational responses. This novel response generation system can be trained end-to-end on large quantities of unstructured Twitter conversations. A Recurrent Neural Network architecture is used to address sparsity issues that arise when integrating contextual information into classic statistical models, allowing the system to take into account previous dialog utterances. The model shows consistent gains over both context-sensitive and non-context-sensitive Machine Translation and Information Retrieval baselines.

Encode Decoder Mechanism

Developed in Hong Kong, Neural Responding Machine (NRM) is a neural-network-based response generator for short-text conversation. It takes the general encoder-decoder framework. First, it formalizes the generation of response as a decoding process based on the latent representation of the input text, while both encoding and decoding are realized with Recurrent Neural Networks. The NRM is trained with a large amount of one-round conversation data collected from a microblogging service. Empirical study shows that NRM can generate grammatically correct and content-wise appropriate responses to over 75% of the input text, outperforming state-of-the-arts in the same setting.

Encoder-Decoder Machine Translation.

Last but not least, Google’s Neural Conversational Model is a simple approach to conversational modeling. It uses the sequence-to-sequence framework. The model converses by predicting the next sentence given the previous sentence(s) in a conversation. The strength of the model is such it can be trained end-to-end and thus requires much fewer hand-crafted rules.

Encode-Decoder GIF

The model can generate simple conversations given a large conversational training dataset. It can extract knowledge from both a domain-specific dataset, and from a large, noisy, and general domain dataset of movie subtitles. On a domain-specific IT help-desk dataset, the model can find a solution to a technical problem via conversations. On a noisy open-domain movie transcript dataset, the model can perform simple forms of common sense reasoning.

Sentiment Analysis

Human communication isn’t just words and their explicit meanings. Instead, it’s nuanced and complex. You can tell based on the way a friend asks you a question whether they’re bored, angry, or curious. You can tell based on word choice and punctuation whether a customer is getting furious, even in a completely text-based conversation.

Sentiment Analysis GIF

You can read an Amazon review for a product and understand whether the reviewer liked or disliked it even if they never directly said so.

For computers to truly understand the way humans communicate every day, they need to understand more than the objective definitions of words; they need to understand our sentiments, what we really mean. Sentiment analysis is this process of interpreting the meaning of larger text units (entities, descriptive terms, facts, arguments, stories) by the semantic composition of smaller elements.

The traditional approach to sentiment analysis is to treat a sentence as a bag of words and to consult a curated list of “positive” and “negative” words to determine the sentiment of that particular sentence. This would require hand-designed features to capture the sentiment, which is extremely time-consuming and unscalable.

The modern deep learning approach for sentiment analysis can be used for morphology, syntax, and logical semantics, of which the most effective one is Recursive Neural Networks. As the name implies, the main assumption for Recursive Neural Net development is such that recursion is a natural way for describing language. Recursion is useful in disambiguation, helpful for some tasks to refer to specific phrases, and works extremely well for tasks that use a grammatical tree structure.

Sentiment’s word cluster.

Recursive Neural Networks are perfect for settings that have a nested hierarchy and an intrinsic recursive structure. If we think about a sentence, doesn’t this have such a structure?

Take the sentence “A big crowd violently attacks the unarmed police.” First, we break apart the sentence into its respective Noun Phrase and Verb Phrase — “A big crowd” and “violently attacks the unarmed police.” But there’s a noun phrase within that verb phrase, right? “violently attacks” and “unarmed police.” Seems pretty recursive.

The syntactic rules of language are highly recursive. So we take advantage of that recursive structure with a model that respects it! Another added benefit of modeling sentences with RNN’s is that we can now input sentences of arbitrary length, which was a huge head-scratcher for using Neural Nets in NLP, with very clever tricks to make the sentence’s input vector be of equal size, despite the length of the sentences not being equal.

The Standard RNN is the most basic version of a Recursive Neural Network. It has a max-margin structure prediction architecture that can successfully recover such structure both in complex scene images as well as sentences. It’s used to provide a competitive syntactic parser for natural language sentences from the Penn Treebank.

For your reference, the Penn Treebank is the 1st large-scale treebank dataset composed of 2,499 stories from a three-year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. Additionally, it outperforms alternative approaches for semantic scene segmentation, annotation, and classification.

However, the standard RNN captures neither the full syntactic nor semantic richness of linguistic phrases. The Syntactically Untied RNN, otherwise known as Compositional Vector Grammar (CVG), is a major upgrade that addresses this issue. It uses a syntactically untied recursive neural network that learns syntactic-semantic and compositional vector representations. The model is fast to train and implemented as efficiently as the standard RNN. It learns a soft notion of headwords and improves performance on the types of ambiguities that require semantic information.

Another evolution is the Matrix-Vector RNN, which is capable of capturing the compositional meaning of even much longer phrases. The model assigns a vector and a matrix to every node in a parse tree: the vector captures the inherent meaning of the constituent, while the matrix captures how it changes the meaning of neighboring words or phrases. This matrix-vector RNN can learn the meaning of operators in propositional logic and natural language.

As a result, the model obtains the state of the art performance on three different experiments:

  • Predicting fine-grained sentiment distributions of adverb-adjective pairs.
  • Classifying sentiment labels of movie reviews.
  • Classifying semantic relationships such as cause-effect or topic-message between nouns using the syntactic path between them.
Recursive Neural Tensor Network

The most powerful RNN model for sentiment analysis developed thus far is Recursive Neural Tensor Network, which has a tree structure with a neural net at each node. This model can be used for boundary segmentation to determine which word groups are positive and which are negative. The same applies to sentences as a whole. When trained on the Sentiment Treebank, this model outperformed all previous methods on several metrics by more than 5%. Currently, it’s the only model that can accurately capture the effects of negation and its scope at various tree levels for both positive and negative phrases.

Question Answering

The idea of a Question Answering (QA) system is to extract information, directly from documents, conversations, online searches, and elsewhere, that will meet a user’s information needs. Rather than make the user read through an entire document, a QA system prefers to give a short and concise answer. Nowadays, a QA system can combine very easily with other NLP systems like chatbots, and some QA systems even go beyond the search of text documents and can extract information from a collection of pictures.

In fact, most of the NLP problems can be considered as a question answering problem. The paradigm is simple: we issue a query, and the machine responds. By reading through a document, or a set of instructions, an intelligent system should be able to answer a wide variety of questions. So naturally, we’d like to design a model that can be used for general QA.

Question and Answer GIF

A powerful deep learning architecture, known as dynamic memory network(DMN), has been developed and optimized specifically for QA problems. Given a training set of input sequences (knowledge) and questions, it can form episodic memories, and use them to generate relevant answers. The architecture has the following components:

  • The Semantic Memory Module (analogous to a knowledge base) consists of pre-trained GloVe vectors that are used to create sequences of word embeddings from input sentences. These vectors will act as inputs to the model.
  • The Input Module processes the input vectors associated with a question into a set of vectors termed facts. This module is implemented using a Gated Recurrent Unit. The GRU enables the network to learn if the sentence currently under consideration is relevant or has nothing to do with the answer.
  • The Question Module processes the question word by word and outputs a vector using the same GRU as the input module, and the same weights. Both facts and questions are encoded as embeddings.
  • The Episodic Memory Module receives the fact and question vectors extracted from the input and encoded as embeddings. This uses a process inspired by the brain’s hippocampus, which can retrieve temporal states that are triggered by some response, like sights or sounds.
  • Finally, the Answer Module generates an appropriate response. By the final pass, the episodic memory should contain all the information required to answer the question. This module uses another GRU, trained with the cross-entropy error classification of the correct sequence, which can then be converted back to natural language.
Dynamic Memory Network(DMN)

DMN not only did extremely well for QA tasks but also outperformed other architectures for sentiment analysis and part-of-speech tagging. Since its inception, there have been major improvements to Dynamic Memory Networks to further improve their accuracy on question answering tasks, including:

  • Dynamic Memory Networks for Visual and Textual Question Answering is basically DMN being applied to images. Its memory and input modules are upgraded to be able to answer visual questions. This model improves the state of the art on many benchmark Visual Question Answering datasets without supporting fact supervision.
  • Dynamic Coattention Networks for Question Answering addresses the problem of recovering from local maxima corresponding to incorrect answers. It first fuses co-dependent representations of the question and the document to focus on relevant parts of both. Then, a dynamic pointing decoder iterates over potential answer spans. This iterative procedure enables the model to recover from initial local maxima corresponding to incorrect answers.

Text Summarization

It’s very difficult for human beings to manually summarize large documents of text. Text summarization is the problem in NLP of creating short, accurate, and fluent summaries for source documents. It’s become an important and timely tool for assisting and interpreting text information in today’s fast-growing information age. With push notifications and article digests gaining more and more traction, the task of generating intelligent and accurate summaries for long pieces of text has been growing every day.

How does it work?

Automatic summarization of text works by first calculating the word frequencies for the entire text document. Then, the 100 most common words are stored and sorted. Each sentence is then scored based on how many high-frequency words it contains, with higher frequency words being worth more. Finally, the top X sentences are taken and sorted based on their position in the original text.

Text Summarization GIF

By keeping things simple and for a general purpose, the automatic text summarization algorithm can function in a variety of situations that other implementations might struggle with, such as documents containing foreign languages or unique word associations that aren’t found in standard English language corpora.

There are two fundamental approaches to text summarization: extractive and abstractive. The former extracts words and word phrases from the original text to create a summary. The latter learns an internal language representation to generate more human-like summaries, paraphrasing the intent of the original text.

The methods in extractive summarization work by selecting a subset. This is done by extracting the phrases or sentences from the actual article to form a summary. LexRank and TextRank are well-known extractive summarizations. Both of them use a variation of the Google PageRank algorithm.

  • LexRank is an unsupervised graph-based algorithm that uses IDF-modified Cosine as the similarity measure between two sentences. This similarity is used as the weight of the graph edge between two sentences. LexRank also incorporates an intelligent post-processing step that makes sure top sentences chosen for the summary are not too similar to each other.
  • TextRank is a similar algorithm to LexRank with a few enhancements, such as using lemmatization instead of stemming, incorporating Part-Of-Speech tagging and Named Entity Resolution, extracting key phrases from the article, and extracting summary sentences based on those phrases. Along with a summary of the article, TextRank also extracts meaningful key phrases from the article.
Text Translation

Models for abstractive summarization fall under the larger umbrella of deep learning. There have been certain breakthroughs in text summarization using deep learning. Below are some of the most notable published results by some of the biggest companies in the field of NLP:

  • Facebook’s Neural Attention is a neural network architecture that utilizes a local attention-based model capable of generating each word of the summary conditioned on the input sentence.
  • Google Brain’s Sequence-to-Sequence model follows an encoder-decoder architecture. The encoder is responsible for reading the source document and encoding it to an internal representation. The decoder is a language model responsible for generating each word in the output summary using the encoded representation of the source document.
  • IBM Watson uses a similar Sequence-to-Sequence model, but with attention and bidirectional recurrent neural network features.

Attention Mechanism

Attention Mechanisms in Neural Networks are loosely based on the visual attention mechanism found in humans. Human visual attention is well-studied and while there exist, different models, all of them essentially come down to being able to focus on a certain region of an image with “high resolution” while perceiving the surrounding image in “low resolution,” and then adjusting the focal point over time.

Imagine you’re reading a whole essay: instead of going through each word or character sequentially, you subconsciously focus on a few sentences of the highest information density and filter out the rest. Your attention effectively hierarchically captures contextual information, such that it’s sufficient for decision-making while reducing overheads. Attention Mechanisms in Neural Networks are loosely based on the visual attention mechanism found in humans.

So why is this important? Models such as LSTM and GRU rely on reading a complete sentence and compressing all the information into a fixed-length vector. This requires sophisticated feature engineering based on the statistical properties of text. A sentence with hundreds of words represented by several words will surely lead to information loss, inadequate translation, etc.

Attention Mechanism

With an attention mechanism, we no longer try to encode the full-surge sentence into a fixed-length vector. Rather, we allow the decoder to attend to different parts of the source sentence at each step of the output generation. We let the model learn what to attend to based on the input sentence and what it has produced so far.

According to the image above from Effective Approaches to Attention-Based Neural Machine Translation, blue represents encoder and red represents decoder, so we can see that the context vector takes all cells’ outputs as input to compute the probability distribution of source language words for every single word the decoder wants to generate. By utilizing this mechanism, the decoder can capture global information rather than solely infer based on one hidden state.

Besides Machine Translation, the attention model works on a variety of other NLP tasks. In Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, the authors apply attention mechanisms to the problem of generating image descriptions. They use a Convolutional Neural Network to encode the image and a Recurrent Neural Network with attention mechanisms to generate a description. By visualizing the attention weights, they interpret what the model is looking at while generating a word:

Attention Neural Image Caption.

Attention does come at a cost, however. We need to calculate an attention value for each combination of input and output words. If you have a 100-word input sequence and generate a 100-word output sequence, that would be 10,000 attention values. If you do character-level computations and deal with sequences consisting of hundreds of tokens, the above mechanisms can become prohibitively expensive.

Natural Language Processing Obstacles

It should be noted that in each of the 7 NLP techniques, researchers have had to deal with a variety of obstacles: limits of the algorithms, scalability of the models, vague understanding of the human language. . .The good news is that the development of this field seems like a giant open-source project: researchers keep building better models to solve the existing problems and sharing their results with the community. Here are the major obstacles in NLP that have been resolved thanks to recent academic research progress:

  • There is no single model architecture with consistent state-of-the-art results across tasks. For example, in Question Answering, we have Strongly Supervised End-to-End Memory Networks; in Sentiment Analysis, we have Tree-LSTMs; and in Sequence Tagging, we have Bidirectional LSTM-CRF. The Dynamic Memory Network could perform well consistently across multiple domains.
  • A powerful approach in machine learning is multi-task learning, which shares representations between related tasks to enable the model to generalize better on the original task. However, fully-joint multitask learning is hard, as it’s usually restricted to lower layers, useful only if tasks are related (often hurts performance if tasks are not related), and has the same decoder/classifier in the proposed model. In A Joint Many-Task Model: Growing a NN for Multiple NLP Tasks, the authors pre-define a hierarchical architecture consisting of several NLP tasks as a joint model for multi-task learning. The model includes character n-grams and short-circuits as well as a state-of-the-art, purely feedforward parser, capable of performing dependency parsing, multi-sentence tasks, and joint training.
  • Zero-shot learning is the ability to solve a task despite not having received any training examples of that task. There aren’t many models capable of doing zero-shot learning for NLP, as answers can only be predicted if they were seen during training and as part of the softmax function. To tackle this obstacle, the authors of Pointer Sentinel Mixture Models have combined a standard LSTM softmax with Pointer Networks in a mixture model. The pointer networks help with rare words and long-term dependencies, while the standard softmax can refer to words that are not in the input.
  • Another challenge is the problem of duplicate word representations, where different encodings for the encoder and decoder in a model result in duplicate parameters/meanings. The simplest solution for this is to tie word vectors together and train single weights jointly, as demonstrated in Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling.
  • Another big obstacle is that Recurrent Neural Networks, the basic building block for any Deep NLP techniques, are quite slow compared to, Convolutional Neural Nets or Feedforward Neural Nets. Quasi-Recurrent Neural Networks take the best parts of RNNs and CNNs to enhance the training speed, using convolutions for parallelism across time and element-wise gated recurrence for parallelism across channels. This approach is better and faster than any other models in language modeling and sentiment analysis.
  • Finally, in NLP, architecture search — the process of using machine learning to automate the design of artificial neural networks — is quite slow, as the traditional manual process requires a lot of expertise. What if we could use AI to find the right architecture for any problem?
  • Neural architecture search with reinforcement learning from Google Brain is the most viable solution developed so far. The authors use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set.


So there you go! I showed you a basic rundown of the major natural language processing techniques that can help a computer extract, analyze, and understand useful information from a single text or sequence of texts.

From machine translation that connects humans across cultures to conversational chatbots that help with customer service; from sentiment analysis that deeply understands a human’s mood to attention mechanisms that can mimic our visual attention, the field of NLP is too expansive to cover completely, so I’d encourage you to explore it further, whether through online courses, blog tutorials, or research papers.