ColBERT: A complete guide

7 min readAug 19, 2022

Me: BERT, can you please find me a Document Retrieval Model?
BERT: Yes sure, here is your State Of The Art (SOTA) ColBERT model.
Me: What’s so special about ColBERT?
BERT: Let’s understand what’s so interesting about ColBERT with this blog.

Since ColBERT is likely to stay around for quite some time, in this blog post, we are going to understand it by attempting to answer these 6 questions:

What is Document Retriever?
Why was ColBERT needed?
What is the core idea behind it?
What is its architecture and how does it work?
How do we train ColBERT model?
How to build end-to-end models with ColBERT as a Retriever model?

Recent progress in Natural Language Processing (NLP) is driving fast-paced advances in Information Retrieval (IR), largely owed to fine-tuning deep language models (LMs) for document ranking. A document retriever model, in simple terms, is a Machine Learning model which primarily ranks documents based on some heuristic algorithm and retrieves the documents which get the best ranks in the pool of documents.

There are many NLP applications such as Open-domain Question-answering models, and Web search engines, to name a few, which use Document retrievers in their end-to-end pipelines.

2. Why was ColBERT needed?

ColBERT proved to be a major breakthrough which enhanced the performance of document retriever models on a large scale. In prior approaches, while being effective, the increase in effectiveness came with an enormous increase in computational cost, thus making the retrieval process slow. We will see why computational cost used to be high in prior models in the later part of this blog. Some models which didn't use BERT base models to retrieve the documents, such as tf-idf based model, performed unsatisfactorily, though being computationally effective.

ColBERT impressively deals with this trade-off by introducing a late interaction architecture that independently encodes the query and the document using BERT as the base model and then employs a cheap yet powerful interaction step that models their fine-grained similarity.

Ugh-oh, didn’t understand? Let’s move ahead for now. Things will get clear.

3. What is the core idea behind it?

Figure 2: Schematic diagrams illustrating query–document matching paradigms in neural (Information Retrieval) IR. The figure contrasts existing approaches (sub-figures (a), (b), and (c)) with the proposed late interaction paradigm (sub-figure (d)).

ColBERT(Contextualized Late interaction over BERT) reconciles efficiency and contextualization, hence getting this abbreviation. In ColBERT, Query and Document text are separately encoded(tokenized) into contextual embeddings using two different BERT( base model can be changed, for eg: RobBERTa, mBERT) models. Contextual embeddings are simply vectors which are being generated as outputs by the BERT models. The 2 sets of encodings (one set for query q and another set of tokens for document d) are allowed to attend each other and compute a relevance score for each query-document pair. The document achieving the highest relevance score for a query gets the lowest rank and vice-versa. In this way, we rank the pool of documents.
Figure 2 illustrates other approaches to calculating relevance scores.

Figure 2(a): Representation-focused rankers, which independently compute an embedding for q and another for d and estimate relevance as a single similarity score, say cosine similarity, between two vectors.

Figure 2(b): Interaction-focused rankers, these rankers, instead of summarizing q and d into individual embeddings, models word-level and phrase-level relationships across q and d and match them using a deep neural network (such as CNNs).

Figure 2(c): This model belongs to a more powerful interaction-based paradigm, which models the interactions between words within as well as across query and document at the same time, as in BERT’s transformer architecture

Figure 2(d): By isolating the encoding procedure of document and query, it’s possible to pre-compute document encodings offline, thereby reducing computational load per query significantly.

It’s observed that the fine-grained matching of interaction-based models and the pre-computation of document representations of representation-based models can be combined by retaining yet judiciously delaying the query–document interaction. This delaying procedure reduces the computational overhead by a significant margin, thus making the retrieval process swift.

Talking about numbers, it delivers over 170 times speedup relative to other BERT-based retrieval models, while maintaining the overall performance.

4. Architecture

Figure 3: The general architecture of ColBERT given a query q and a document d.

A pre-trained Embedding matrix is used to generate tokens for q and d. Different kinds of tokenization methods can be used, WordPiece tokenisation being the default one. We can also use SentencePiece tokenisation, Byte-level tokenisation, n-gram tokenisation etc.

These tokens are separately passed into BERT-based models for generating encoded representation. Let ‘Eq’ and ‘Ed’ be the contextualised encoding generated by the BERT models separately. There is a ‘model’ attribute to change the base model, by default it’s “bert-base-uncased”. You can check out more about different bert-based models here.

Figure 4: Late-interaction mechanism of ColBERT. Magenta coloured cells represents which document encoding has highest similarity with the corresponding query encoding.

Using Eq and Ed, ColBERT computes the relevance score between q and d via late interaction, which is defined as a summation of maximum similarity (MaxSim) operators. In particular, we find the maximum cosine similarity (any similarity metric can be used) of each v ∈ Eq with vectors in Ed and combine the outputs via summation. Here, vectors are simply the contextualised encodings of the tokens given as input to the BERT model.

Intuitively, the model searches for each query embedding over all the encodings of the document, thus quantifying the match between a document and query encoding. It calculates similarity scores between each document encoding and query encoding. Then it calculates the MaxSim by taking the largest similarity score between each query encoding and all document terms. Given these term scores, it then estimates the document relevance by summing the matching evidence across all query terms.

If the query has fewer than a pre-defined number of tokens Nq, we pad it with BERT’s special [mask] tokens up to length Nq (otherwise, we truncate it to the first Nq tokens). In the case of truncation, ColBERT returns the overflowing tokens along with the output.

5. Training ColBERT- Weak Self-supervision training

Figure 5: Weak-supervision technique of training ColBERT

ColBERT is trained on triplets which are as follows :
<query, positive_document, negative_document>

a) query: Query for which we want to retrieve a document.

b) positive_document: Document which is relevant to the query and can plausibly contain the answer to the query.

c) negative_document: Document which is not relevant to the query and can’t plausibly contain the answer to the query.

We initially use a naive retrieval model for ranking the documents based on the heuristic algorithm of that model, we generally use the BM-25 model as the naive retrieval model. It uses tf-idf technique to rank the documents. Then this existing retrieval model is used to collect the top-k passages for every training query and, with a simple heuristic, sort these passages into positive (+ve) and negative (–ve) examples, using those to train another, more effective retriever. This process is applied thrice, resulting in a robust trained ColBERT model.

We get the triplets by using a naive retriever. The top-k ranked documents are the pos_documents, and the rest of them are neg_documents. ‘k’ is the hyperparameter, whose value can be adjusted accordingly. We use these triplets to again train the ColBERT in the same fashion. This process is repeated 3–5 times and we finally get a trained ColBERT model.

6. How to build end-to-end models with ColBERT as a Retriever model?

I’ll try to give a glimpse of an Open-Domain Question-Answering model using ColBERT as the retriever model and XLM-RoBERTa as the Reader model.

Step 1: Create a pool of documents
Step 2: Use a pre-trained retriever model and pass the pool of documents along with the query as input to the model.
Step 3: Retriever will rank the pool of documents based on similarity scores.
Step 4: Parse the top-k documents into paragraphs.
Step 5: Pass each of these paragraphs along with the query to the Reader model, which in our case is XLM-RoBERTa.
Step 6: Get the answer from the Reader model.

To know more about the Reader Model checkout
RoBERTa model detailed overview

References:-
https://arxiv.org/abs/2004.12765
https://github.com/stanfordnlp/ColBERT-QA
https://arxiv.org/abs/2007.00814

ColBERT: A complete guide

Written by Varun Bhardwaj