State of the Art: Pretraining for Question Answering & Info Retrieval

12 min readDec 5, 2022

--

Retrieval is one of the most important enabler for business applications today. It can be used in a wide variety of areas such as Search, Chatbot, Recommendation, Duplicate detection, Classification, Anomaly detection, and even Cross-Language/Data-Modality retrieval.

* Note that the code implementations in this article are still a Work in Progress
* Nonetheless, 70% of the code is complete (pending tests) and 100% of everything else (written parts) are done!

Thus, this series as part of the continuation from my previous article, seeks to survey some of the current state-of-the-art in the Machine Learning field of Question Answering and Information Retrieval, otherwise known as Dense Retrieval (DR). Code implementations are included as well.

Specifically for this article, I focus on the pretraining stage of DR. But before we proceed, if you are unsure of what a Dense Retriever is, or need some Python code to be used along with the techniques in this article, you can refer to my introductory article, or a precursor to DR, or this article about the earliest DR technique.

A unifying & general framework for training Neural Dense Retrievers

DR Techniques surveyed in this series of articles are grouped into 4 stages. Doing so allows one to **plug and play different techniques across the stages**.

Pretraining stage & tasks for DR

Today, Bert-based language models are commonly used for DR but unfortunately, they perform badly out of the box. Research points to the use of externalized similarity measures that do not work well with Masked Language Modelling (MLM). Meanwhile, the MLM pretraining used in PLM will train attention weights that are too far off and insufficient for single-stage similarity computation. This resulted in a need for 1) pre-training tasks catered specifically for DR.

A customized batching stage for DR

Random batches (higher chance of uninformative samples) will train DR models that perform poorer versus those trained on 2) specifically selected samples in a similar given time frame. This observation, therefore, builds upon the existing use of negatives in the form of triples (query, positive, negative) for DR training.

Different forms of DR training

The application of existing ideas from other fields similarly gave rise to 3) different flavors of DR training processes. Such advances allowed users to trade off criteria such as efficiency, space, and performance.

DR validation

Finally, to 4) validate DR models’ performance, researchers test them over the entire corpus or subsets of the test data. Such practices have an inherent limitation as there is often a gap between the training and real-world settings (in real-world scenarios, the number of passages to be considered per query is many times larger). One could therefore construct a specialized set for validation in order to better correlate testing and real-world results.

Essentially, DR techniques can be grouped into 4 stages. Next, I will cover the problems underlying the pretraining stage of DR and derive some lessons useful for DR pretraining.

Pretraining stage: the problems with using Pretrained Language Models for DR

Existing pretraining techniques such as MLM are unsuitable for DR

Generally, pretraining techniques such as Masked language Modelling (MLM) and Next Sentence Prediction (NSP) are used in Pretrained Language Models (PLM). MLM involves masking a word and then predicting it while NSP involves classifying whether two concatenated sentence inputs (separated by a [SEP] token) are originally joined (somewhat similar to DR). Even though these techniques have allowed Language models (LM) to acquire the ability to understand the context of a passage, they still fail to prepare LMs well enough for DR.

To predict masked tokens, a model can use info from all other tokens. But visualized attention maps show that MLM-trained models rely only on a few tokens. This lazy behavior can be carried over to the [CLS] token used for DR, which is bad because point C) below shows that semantics & token aggregation is important for DR.

Thus, pretraining tasks must mimic inference

Though it should be noted that Cross-Encoders which encode two passages together as opposed to Bi-Encoders (more commonly used for first-stage DR), actually perform well for DR. This is because Cross-Encoders had it easy by relying on the NSP task which is very similar to DR. This means that (A) pretraining tasks that mimic DR are important for performance. Where it falls short, however, is in terms of its inefficiency. For instance, to compute a similarity matrix between every pair of a query and potential passage with a Cross-Encoder, it incurs a heavy quadratic computational + memory cost, on top of pairwise polynomial attention complexity. Not to mention the longer input text lengths due to the simultaneous encoding of two input texts.

Cross-Encoder will fall back on the NSP task in DR, but cross-encoding is very inefficient.

Untrained DR models will rely on MLM-trained weights/behavior

Thus, B) Bi-Encoders used for DR will rely on behaviors they learned during the word-level MLM pretraining task only (which is bad) and not NSP.

MLM-trained models focus too much on irrelevant aspects

In fact, Research on the Attention layers showed that MLM trains them to each learn a different aspect of a Language, such as the Part of Speech, Dependency relations, Subject-Verb-Object, etc. These aspects, unfortunately, proved to be irrelevant for DR, and later C) research showed empirically, semantics alone reigns supreme.

Finetuning can help overwrite undesirable pretrained behaviors

Luckily as with most Neural Network (NN) tasks, vanilla (pretrained on MLM w/o other techniques) D) fine-tuning can help to overwrite the MLM behavior to some extent. We can infer this as Neural networks are universal approximators and that the fine-tuning task often better mimics actual DR training and real-world inference. Though results may vary due to underlying randomness. The performance will also tend to not reach state of the art.

A good pretraining procedure is important

Now, it makes common sense that finetuning will help since the Attention layers consisting of adjustable weights will change to suit the target task when finetuned. Specifically, in models finetuned for the DR task, [CLS] tokens will aggregate and place more focus on text semantics. This implies that E) a large quality dataset and ample pretraining/finetuning time are key for the model to eventually learn such attention weights.

Figure from paper. The Y axis is a simplified attention behavior (higher value = higher aggregation, semantics) while X is the layer index. (a) is an MLM pretrained Bert (performed poorly for DR). (c) is a pretraining technique for DR (performs better). **This shows that attention layers need to be pretrained specifically for DR.**

Train on in-domain data to avoid a fundamental weakness of NNs

Similarly in another research, a model pretrained on a synthetic in-domain PAQ dataset (of 65M query + positive pairs with mined negatives, similar to training/real-world inference task) attained good performance on the NaturalQuestions dataset without fine-tuning on it. This observation reiterates point E) and also F) whereby in domain pretraining data is crucial. Unfortunately, f) is seldom a feasible option in real-world scenarios as besides the training data, there is often a lack of in-domain, large, and high-quality datasets. This is where other techniques from other stages, i.e. Batching or the Training stage can help. More on that in later articles.

Pretraining stage: other hard problems persist!

Now, while most of the problems facing DR pretraining have been identified and solved, transfer learning with pretrained DR models has yet to bring about huge performance gains commonly seen in most other tasks. This problem is exacerbated when there is a domain mismatch between the pretraining and inference/real-world data. Thus, DR models are tedious to retrain and kept up to date under the constant changes to a dataset. Additionally, specific data samples such as Hard Negatives (more about it in batching stage) are needed in the absence of large in-domain and new datasets as well as training batch sizes. This complicates matters for real-world usage. Though, there is increasing research on these areas.

The core problems facing DR, though it should be noted that Continual Learning is also a problem for most/all Neural tasks.

Next up, let's review some of the code used for DR pretraining techniques.

Implementing state of the art pretraining techniques

In the following section, I will implement state-of-the-art DR pretraining techniques. Starting from the earliest technique in the year 2019 to most recently, 2022. Each of the techniques has solved some of the problems we identified earlier.

Pretraining stage: 1) Inverse Cloze Task

The Inverse Cloze Task (ICT) is one of the earliest DR pretraining techniques. This is a technique that closely mimics the inference task and even solves the MLM attention problem to some extent. Essentially, ICT can be explained as such: whereby in a passage with N sentences, ICT randomly samples a sentence, i ~ [1, N] as the Query. The query is paired with another sentence, k != i, which is the Positive sample. Thereafter, the pair’s similarity is maximized vs the in-batch negatives: min Sum_i ^ k [ Cross_ent ( Queries, Passages ) ].

ICT: The authors used Wikipedia as an example, where two disjoint passages: q1 is the query, and p1 is the positive passage, together, they form a training pair.

Note that ICT as a self-supervised technique is good to use if you have lots of passages besides your training data. ICT also only defines the data source and the loss function. This means that other techniques covered in this article can be used together with ICT, in a plug-and-play manner. If you are interested, I have implemented ICT in my previous article.

Finally, beyond ICT, a recent paper proposed the idea of using an ICT pretraining corpus that is of the same domain as your training data to avoid distribution shifts, as it is one of the DR models’ weaknesses.

2) Condenser & Cocondenser

The condenser is a technique introduced in the year 2020 after the authors realized that PLM’s attention weights are unsuitable for DR. Intrinsically, the condenser still utilizes the MLM pretraining task but it made changes that help to adapt the MLM behavior to DR.

In greater detail., the condenser made minor model architectural changes for DR pre-training. Specifically, the number of Attention layers in a Bert-based model is split into two groups: Early (Eqn 6 below) and Late (Eqn 7). A few Multi-Head Attention blocks are then added to the end of the model, where its input consists of the [CLS] output token from the Late group, and its word tokens from the Early group (Eqn 8).

Condenser_head is the extension made to a Bert-based model.

The MLM loss is then used to train the head’s output, thereby leading to improvement over other DR models back in 2020 in two benchmarks.

where W,h_i ^ cd refers to the masked token’s final representation, and x_i is the token id of the true word

In the problems section, I said that MLM does not prepare models well for DR. The difference here, in Condenser, is that the h ^ early (all tokens representation of size: passage length * dim) is used with the h_cls ^ late. This implies that the earlier Attention Layers were the culprit in MLM pretrained models (since adding the h_cls ^ late helped improve performance). Possibly suggesting that their individual word tokens (h ^ early) did not aggregate information from most other tokens, but instead focused only on a few tokens, perhaps due to it relying on learned Parts of Speech or grammatical/syntactic patterns which is irrelevant for DR. This reinforces our original point about MLM (about MLM training PLM to learn nasty behavior irrelevant for DR). That said, we can now implement Condenser as such:

code removed

It should be noted that the condenser head is merely there to drive the Attention layers toward desired behavior that is beneficial for DR. Hence, the condenser head is not required during inference.

Thereafter in 2021, with the success of condenser, the original authors seek to mimic the DR inference task closer and to derive a smoother loss and gradient by increasing the batch sizes. And in a bid to further improve performance, a successor to the condenser technique was invented.

Named CoCondenser, instead of having just MLM loss for pretraining, the author added a Cross-Entropy loss similar to ICT. Whereby the [CLS] token from the head is used to compute a similarity measure. I implemented the loss in my previous article.

MLM loss on both the query and positive sample + L^CO: matmul([q_CLS], [p_CLS]^T)

An additional improvement is the use of Gradient Caching, a technique used in Metric/Constrative Learning to overcome memory constraints. This technique enables the use of large batch sizes and the key lies in separating the partial derivatives into different terms and approximating the final gradient. This technique can be succinctly explained via:

eqn 8 is the Cocondenser’s combined loss, eqn 9 is the gradient of a loss w.r.t h_i,j (the [CLS] token/representation). Thus, v_i,j is a vector of the same dim as the [CLS] token.

whereby eqn 8 denotes the expectation of Contrastive and MLM loss. Meanwhile, eqn 9 is the part where the forward pass with no grad is performed. Here, the gradient of the contrastive loss w.r.t each [CLS] representation is computed and is stored as a variable v_i,j of shape (1 by dim or 768). Thereafter,

eqn 10 denotes the full partial derivative of the model‘s gradient. Since eqn 10 by itself is costly to perform and we already have the first part of the equation (square box from eqn 9), this allows us to approximate the final gradient eqn 11 (we approximate here because of shape mismatches that are unfortunately hard to resolve manually atm for me).

Then in eqn 12, we can add the MLM gradient for the current sample onto the previous vector to get back a gradient of matrix shape. We can implement gradient caching as such:

Notice that in our implementation for eqn 11 in the function forward_backward, the gradient ∂h_i,j/∂theta is computed via a dot product of the CLS representation and its gradient. This is just an approximation of the true gradient. We can then use gradient caching in our training loop as such:

code removed

Next up, the last technique that I will cover is the use of an auto-encoding architecture for pretraining.

3) Masked auto Encoding / Decoding (CoT-MAE)

Introduced most recently in 2022, this technique focuses on training the attention layers to focus only on aggregating and extracting semantic information about a passage.

Specifically, the Contextual Mask Auto-Encoder (CoT-MAE) incorporates a method that combines ICT, MLM, and reconstruction loss via a weakly para-metered decoder. These arrays of techniques thereby helped CoT-MAE to achieve top performance in 3 public datasets (as of Nov 2022). The combination of techniques can be described in the following image:

Extracted from CoT-MAE paper. On the left (a), CoT-MAE sample text spans in a similar fashion to ICT. At (b), MLM is applied to the Encoder, and the context span is

whereby in (a), training pairs of a query and positive pair denoted as T_A and T_B (blue and green box), are sampled in a manner similar to ICT. A minor difference here is that there can be some small gaps between the neighboring passages.

In (b), T_A and T_B are masked with a random probability of 45 and 15% respectively, before they are fed into the Encoder/Decoder. The encoder which intakes T_A seeks to minimize the MLM loss. The encoded T_A is also passed to the decoder as context and the decoder will use it with encoded T_B to derive a loss based on how well the decoder reconstructs T_B. That said, we can implement CoT-MAE as such:

code removed

After training, we will only use the Encoder. Also, notice that I am simply reusing my code from Condenser from 2). This is a strategy typically used in many machine learning papers where existing ideas are combined and with a bit of twist here and there, a new technique is born.

And with that aside, I have in total, implemented 3 state of the art (in the year 2019, 2020/2021, 2022) pretraining techniques for DR.

Conclusion and what's to come next

In the first part of this article, I have explained the problems facing Dense Retrieval, in particular, the unsuitability of Bert-based models for DR. These findings directed us to focus on pretraining Language models for DR. Finally, I ended the article with code implementations of 3 key research developments that moved the DR field.

As always, feel free to ask any questions in the comments or via private notes. If my article has helped you in any way, please leave a clap. You can also follow me for more of such articles in the future. In my next article, I will talk about the next stage(s) of DR training: selecting informative samples for training, and more if the time and length permits.