Modern NLP
Published in

Modern NLP

Variety Of Encoders In NLP

A history of feature engineering for text

Photo by Skyla Design on Unsplash

Encoding text is at the heart of understanding language. If we know how to represent words, sentences and paragraphs with small vectors, all our problems are solved!

Having one generalised model to semantically represent text in a compressed vector is the holy grail of NLP 👻

What does encoding text mean?

When we encode a variable-length text to a fixed-length vector, we are essentially doing feature engineering. If we use language models or embedding modules, we are also doing dimensionality reduction.

As I discussed in one of my previous posts on transfer learning, there are 2 approaches to modelling — Fine-tuning and Feature extraction. In this post, I will discuss the various ways of encoding text(feature extraction) with deep learning which can be used for the downstream tasks. You can read the advantages of feature extraction methodology in this post.

Suppose you have this sentence — “I love travelling to beaches.” and you are working with a classification project. If your vocabulary is huge, it becomes difficult to train the classifier. This happens when you use a TF-IDF vectorizer and get sparse vectors for each word.

With embeddings like GloVe you can get a dense vector of 100 dimensions for every word. But the problem with a model like GloVe is that it cannot handle OOV(Out of vocabulary) words and cannot deal with Polysemy —many possible meanings for a word based on context.

So the best approach is to use a model like ELMo or USE(Universal sentence encoder) to encode words. These models work on character level and can handle polysemy. This means that they can handle unseen words and the vector we get for every word/sentence will encapsulate its meaning.

Once we have a fixed vector for word/sentence, we can do anything with it. This is what the feature extraction approach is. Create feature once and then do any downstream task. We can try out different classification models and hypertune them. We can also create a semantic search or recommendation engine.

Now, the real question is what are the different models available for encoding text? Is there a model that works for everything or is it task dependent?

Evaluation of sentence embeddings in downstream and linguistic probing tasks

So I was reading this paper and it opened Pandora’s box for me. Ideally, we want an embedding model which gives us the smallest embedding vector and works great for the task. The smaller the embedding size, the lesser the compute required for training as well as inference.

As you can see, there is a huge variation in the size of embedding — varies from 300 to 4800. As per the basics, more the vector size, the more information it can contain! But is it actually true? Let’s see how they perform on the tasks.

Different embedding models and their vector size —

Classification tasks

Authors tried out different classification tasks as shown below to understand the performance of these models. For the linguistic probing tasks, a MLP was used with a single hidden layer of 50 neurons, with no dropout added, using Adam optimizer with a batch size of 64.

(For the Word Content (WC) probing task in which a Logistic Regression was used since it provided consistently better results)

Classification tasks

From the results we can see that different ELMo embeddings perform really good for classification tasks. USE and InferSent also top on some of the tasks. The difference between the best and the 2nd best is around 2%. Word2Vec and GloVe do not top in any task as expected but their performance is also in the range of 3%.

The thing to note here — ELMo has a vector size of 1024, USE has 512 and InferSent has 4096. So if somebody has to actually put a system to production, his first choice will be USE and then maybe ELMo.

Results for classification tasks.

Semantic relatedness tasks

Then they try out the embeddings for semantic relatedness and textual similarity tasks. This time USE(Transformer) model is a clear winner. If we neglect InferSent, which is 8x bigger embedding than USE, USE is far ahead of others.

This makes USE a clear choice for semantic search and similar question kind of tasks.

BTW, when should we use USE(DAN) and USE(Transformer)? The performance of USE(DAN) is O(n) with length of text while its O(n²) for USE(Transformer). So if you are dealing with long texts, you might want to go with USE(DAN).

Linguistic probing tasks

Next, they show results for Linguistic probing tasks which consist of some esoteric tasks. In this case, ELMo seems to rule the world!

BShift (bi-gram shift) task — the goal is to identify whether if two consecutive tokens within the sentence have been inverted or not such as “This is my Eve Christmas”

The differences are huge between ELMo and non-ELMo models.

Information retrieval tasks

In the caption-image retrieval task, each image and language features are jointly evaluated with the objective of ranking a collection of images in respect to a given caption (image retrieval task — text2image) or ranking captions with respect to a given image (caption retrieval — image2text).

InferSent is a clear winner in this one. The 2nd in the line is ELMo.

We can say that ELMo is a badass model for sure 😀

Universal Sentence Encoder

As we can see, USE is a great production-level model to use and let's discuss it a bit. I will not talk about ELMo as there are many articles on it.

There are 2 models available for USE

  • Transformer
  • DAN(Deep Averaging Network)

The encoder takes as input a lowercased PTB tokenized string and outputs a 512 dimensional vector as the sentence embedding. Both the encoding models are designed to be as general-purpose as possible. This is accomplished by using multi-task learning whereby a single encoding model is used to feed multiple downstream tasks.


This uses the transformer architecture which creates context-aware representations for every token. The sentence embedding is created by element-wise addition of embedding of all tokens.


This is a controversial modelling methodology because it doesn’t regard for the sequence of words. The GloVe embedding of words are first averaged together and then passed through a feedforward deep neural network to produce sentence embeddings.

The model makes use of a deep network to amplify the small differences in embeddings that might come from just one word like good/bad. It performs great most of the time but experiments show it fails at double negation like “not bad” because the model strongly associates ‘not’ with negative sentiment. Have a look at the last example.

Failures at double negation

This makes USE(DAN) a great model for classifying news articles into categories but might cause problem in sentiment classification problems where words like ‘not’ can change the meaning.

What do you learn from context?

The fact that a model like DAN is as good as the transformer raises question — whether our models are taking care of the ordering and is ordering as important as we thought?

Let’s discuss what do we learn from the context? In this paper, authors try to understand where these contextual representations improve over conventional word embeddings.

Tasks taken for evaluation

Authors introduce a suite of “edge probing” tasks designed to probe the sub-sentential structure of contextualized word embeddings. These tasks are derived from core NLP tasks and encompass a range of syntactic and semantic phenomena.

They use the tasks to explore how contextual embeddings improve on their lexical (context-independent) baselines. They focus on four recent models for contextualized word embeddings–CoVe, ELMo, OpenAI GPT, and BERT.

ELMo, CoVe, and GPT all follow a similar trend (Table 2), showing the largest gains on tasks which are considered to be largely syntactic, such as dependency and constituent labeling, and smaller gains on tasks which are considered to require more semantic reasoning, such as SPR and Winograd.

How much information is carried over long distances (several tokens or more) in the sentence?

To estimate information carried over long distances (several tokens or more), authors extend the lexical baseline with a convolutional layer, which allows the probing classifier to use local context. As shown in Figure 2, adding a CNN of width 3 (±1 token) closes 72% (macro average over tasks) of the gap between the lexical baseline and full ELMo; this extends to 79% if we use a CNN of width 5 (±2 tokens).

This suggests that while ELMo does not encode these phenomena as efficiently, the improvements it does bring are largely due to long-range information.

The CNN models and the orthonormal encoder perform best with nearby spans, but fall off rapidly as token distance increases. (The model can access only embeddings within given spans, such as a predicate-argument pair, and must predict properties, such as semantic roles, which typically require whole-sentence context.)

The full ELMo model holds up better, with performance dropping only 7 F1 points between d = 0 tokens and d = 8, suggesting the pretrained encoder does encode useful long-distance dependencies.

Findings of the paper

First, in general, contextualized embeddings improve over their non-contextualized counterparts largely on syntactic tasks (e.g. constituent labeling) in comparison to semantic tasks (e.g. coreference), suggesting that these embeddings encode syntax more so than higher-level semantics.

Second, the performance of ELMo cannot be fully explained by a model with access to local context, suggesting that the contextualized representations do encode distant linguistic information, which can help disambiguate longer-range dependency relations and higher-level syntactic structures.

A Simple but Tough-to-Beat Baseline for Sentence Embeddings

Since now we know that contextual models can be beaten, what are some easy tricks to beat it?

If DAN proves that averaging word embedding is enough to get great results, what if we could find a smart weighing scheme! This paper shows us how to represent the sentence as a weighted average and then use PCA/SVD to further refine the embedding.

They write

“We modify this theoretical model, motivated by the empirical observation that most word embedding methods, since they seek to capture word co-occurrence probabilities using vector inner product, end up giving large vectors to frequent words, as well as giving unnecessarily large inner products to word pairs, simply to fit the empirical observation that words sometimes occur out of context in documents.

These anomalies cause the average of word vectors to have huge components along semantically meaningless directions. Our modification to the generative model of (Arora et al., 2016) allows “smoothing” terms, and then a max likelihood calculation leads to our SIF reweighting.”

Here the weight of a word w = a/(a + p(w)) with a being a parameter and p(w) the (estimated) word frequency; which they call — smooth inverse frequency (SIF).

Using the weights they compute a weighted average and then remove the projections of the average vector on their first singular vector (“common component removal”).

Interesting sentence in the paper — “Simple RNNs can be viewed as a special case where the parse tree is replaced by a simple linear chain.

SIF weighting

This is the recipe for computing SIF embeddings:

  • Compute the frequencies of all the words of the corpus.
  • Then, given a hyper-parameter a usually set to 1e-3, and a set of pre-trained word embeddings, compute the weighted average above for each of your texts/sentences.
  • Finally, Use SVD to remove the 1st component off of these averages and get fresh sentence embeddings. Removing 1st component is like removing the most common information as it captures the maximum information about the average embedding.

My understanding is that removing 1st component is like removing ‘mean’ from the compressed vector! What we are left with is the unique characteristic about the word rather than having the complete information 🤔

The results are fantastic and they beat sophisticated methods like DAN and LSTM. 🤯

Below are the same results I posted up earlier for SST tasks and SIF is nailing this stuff 😝

Their contribution

For GloVe vectors, using smooth inverse frequency weighting alone improves over unweighted average by about 5%, using common component removal alone improves by 10%, and using both improves it by 13%.

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

What is the state of the art??? 😝

In this paper, the authors report that we are doing semantic search, finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering.

Finding in a collection of n = 10,000 sentences the pair with the highest similarity requires with BERT n·(n−1)/2 = 49,995,000 inference computations.

Sentence-BERT (SBERT) is a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.

This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.

Main Idea

  • Rather than running all A-B pairs through the model and getting a score, it is better to train a model that generates similar embedding for similar sentences. Using this approach once a model is trained for an appropriate task, we can create embedding for every sentence once.
  • Every time we get a query, we run cosine similarity of the query with all other precomputed sentence embedding which is linear in time and can happen very quickly with a library like FAISS.

To make that nice encoder, they trained a dual-encoder with tied weights — A siamese network!

The results create a new state of the art with considerable gains on some dataset except SICK-R.

Correlations between Word Vector Sets

This paper was released very recently in Oct 2019. The authors investigate the application of statistical correlation coefficients to sets of word vectors as a method for computing semantic textual similarity (STS). It is surprising to see USE showing higher statistical correlation than BERT models.

Also max and min-pooled vectors consistently outperform mean-pooled vectors when all three representations are compared with Pearson correlation.

Does this mean USE is better suited for semantic search? 🤔

Mean Pearson correlation on STS tasks -

BERT, ELMo, USE and InferSent Sentence Encoders: The Panacea for Research-Paper Recommendation?

Till now we have been comparing conventional Vs deep. But what if we could leverage both! 👻

Using sentence embeddings on large corpora seems hardly feasible in a production recommender system, which needs to return recommendations within a few seconds or less.

Authors report that BM25 queries took around 5 milliseconds to retrieve up to 100 results. The extra time taken to calculate embeddings and reranking 20, 50 and 100 titles through the different models is shown below. USE (DAN) is the fastest, taking around 0.02 seconds to rerank 20 or 50 titles, and 0.03 seconds to rerank 100 titles.

As you can see USE(DAN) is blazing fast!

Reranking time in seconds

Finally, BERT and SciBERT using bert-as-server are the slowest in reranking 100 titles, taking around 4.0 seconds. This means that they could not be used for real-time reranking recommendations, unless higher computing resources (e.g., GPU or TPU) were provided

Best approach

  • Use Apache Lucene’s BM25 to retrieve a list of top-20, 50 or 100 recommendation candidates.
  • Get sentence embeddings of top-k and calculate cosine similarity score with the query embedding.
  • Perform a linear combination between the initial scores from BM25 after being normalized, and the semantic similarity scores from sentence embeddings, by summing up the scores (with uniform weights set to 0.5) to generate the final ranked recommendations


My main reason for writing this was to throw light on how to choose an existing model for our problem. We have a variety of models, methodologies and tasks. Selecting a model without questioning can lead to over-engineering when a simple model like USE(DAN) could have solved the purpose. Sometimes a CNN might do what ELMo can 😃

Want to know all about semantic search? Find various approaches for semantic search over here!

More readings

SentEval: An Evaluation Toolkit for Universal Sentence Representations

Want more?

Subscribe to Modern NLP for latest tricks in NLP!!! 😃




All the latest techniques in NLP — Natural Language Processing

Recommended from Medium

TensorFlow Model Optimization Toolkit — Post-Training Integer Quantization

Image Enhancement: “Pixelated Images Are A Thing of The Past”

The McCulloch-Pitts ANN

How to Make a Simple Deep Learning Chatbot

Why Cross Entropy Loss?

Pay Attention to the Man Behind the Curtain

Online Courses for Studying ML

Different Kinds of Algorithms in Machine Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Pratik Bhavsar

Pratik Bhavsar

NLP & Semantic search engineer | Now writing on | | @nlpguy_

More from Medium

A New Approach to the FNC — Fake News Competition Dataset, Placing 2nd Overall with Half the ML

Deep Learning Techniques for Text Representation — Part 1

Using Lime for Interpreting NLP

Stemming & Lemmatization