When are Contextual Embeddings Worth Using?

Contextual embeddings from BERT are expensive, and might not bring value in all situations. Let’s figure out when that’s the case!

Viktor Karlsson
Published in
4 min readAug 1, 2020


Working with state-of-the-art models like BERT, or any of its descendants, is not for the resource-limited nor the budget restrained researcher or practitioner. Only pre-training BERT-base, a model that almost could be considered small with today's standards, took more than 4 days on 16 TPU chips which would cost multiple thousands of dollars. This does not even take further fine-tuning or eventual serving of the model into account, both of which only add to the total cost.

Instead of trying to figure out ways of creating smaller Transformer models, which I’ve explored in previous articles, it would be valuable to take a step back and ask: when are contextual embeddings from Transformer based models actually worth using? In what cases would it be possible to reach similar performance with less computationally expensive, non-contextual embeddings like GloVe or maybe even random embeddings? Are there characteristics of the datasets that could indicate when this would be the case?

These are some of the questions that Arora et al. answer in Contextual Embeddings: When Are They Worth It?. This article will provide an overview of their study and highlight their main findings.


The study is divided into two, first examining the effect of training data volume and then the linguistic characteristics of these datasets.

Training data volume

The authors find that training data volume plays a key role in determining the relative performance of GloVe and random embeddings when compared to BERT. The non-contextual embeddings quickly improved with more training data and were often able to perform within 5–10% of BERT when all available data were used.

On the other hand, the authors found it possible in some cases to train the contextualized embeddings with up to 16 times fewer data while still matching the best performance achieved by the non-contextualized embeddings. This presents a tradeoff between the cost of inference (compute and memory) and that of labeling data, or as Arora et al. put it:

ML practitioners may find that for certain real-world tasks the large gains in efficiency [when using non-contextual embeddings] are well worth the cost of labelling more data. — Arora et al

Linguistic characteristics of the dataset

The study of training data volume made it clear that contextual embeddings could perform significantly better than non-contextualized ones for some tasks, while in other cases these differences were much smaller. These results motivated the authors to figure out if it would be possible to find and quantify linguistic properties that would indicate when this is the case.

To this end, they defined three metrics used to quantify the characteristics of each dataset and the items within. These metrics were, by design, not given a single definition but instead used to encode the intuition of which characteristics affect model performance. This allows them to be interpreted, and subsequently stringently defined, for the task we study. In the list below I, therefore, share the authors’ proposed metrics with example definitions for a Named Entity Recognition dataset:

  1. Complexity of text structure. The number of tokens spanned by each entity indicates complexity. “George Washington” spans two tokens.
  2. Ambiguity in word usage. The number of different labels each token is assigned in the training dataset. “Washington” can be assigned person, location and organisation which requires its context to be taken into consideration.
  3. Prevalence of unseen words. The inverse of the number of times a token appeared. If the previous sentence was our dataset, the word “of” would be assigned the value 1/2 = 0.5

These metrics were used to score each item in the datasets to allow us to split them into a “difficult” and an “easy” partition. This enabled the authors to compare embedding performance on these two partitions from the same dataset.

Animation the metric calculation and how it was used to evaluate the performance of the two kinds of embedding models.

If these metrics were non-informative, the difference in performance would be equivalent on both partitions. Fortunately, that is not what the authors find. Rather, they observed that in 30 out of 42 cases the difference between contextual and non-contextual embeddings was higher on the difficult partition compared to the easy one.

This means that these metrics can be used as a proxy for when contextual embeddings from models like BERT will outperform non-contextual ones! It might, however, be more useful the other way around — to indicate when non-contextual embeddings from GloVe are sufficient to reach state-of-the-art performance.


In Contextual Embeddings: When Are They Worth It? Arora et al. highlight key characteristics of the dataset which indicate when contextual embeddings are worth using. First, training dataset volume determines the potential usefulness of non-contextualized embeddings where more is better. Secondly, the characteristics of the dataset also play an important role. The authors were able to define three metrics, the complexity of text structure, ambiguity in word usage, and prevalence of unseen words, which help us understand the potential benefits using contextual embeddings might bring.

If you found this summary helpful in understanding the broader picture of this particular research paper, please consider reading my other articles! I’ve already written a bunch and more will definitely be added. I think you might find this one interesting👋🏼🤖



Viktor Karlsson

Learning to write and writing to learn. Staying on top of current NLP research through sharing what I find interesting 🤖 www.linkedin.com/in/viktor2k/