ColBERT — A Late Interaction Model For Semantic Search

Zachariah Zhang
7 min readNov 17, 2022

--

TLDR — Single vector embedding search model’s are efficient for search but they create an information bottleneck which limits performance. In this review we introduce a class of models which represent text using multiple embeddings such as ColBERT. The literature shows that these models are much better at domain transfer and get better overall performance. Searching becomes more complicated using this approach but it still enjoys the same benefit of encoding document representations during index time.

Introduction

Embedding based search has seen an explosion of growth over the last few years. This has led much of the academic research to focus on building better and better embedding models using more data, larger architectures, or more sophisticated training methodology.

However, there are some key geometric limitations to these approaches when considering representing documents which have a lot of different information. In this review I will dive deeper into what these limitations are as well as discuss some alternative representations. Specifically, we will look at the class of Late Interaction Models which represent text similarity as a pairwise similarity between contextualized token embeddings.

Issues with Embedding Based Search

Text is complicated and colorful and often times text we represent will contain a variety of information. This is especially true when searching over a corpus of long documents.

Let’s consider the example of indexing a wikipedia page for the movie Forrest Gump. Given we have an embedding for our page, we need this embedding to be “close” to the embeddings of the queries that it would likely be relevant to. The movie covers a wide range of events including Forrest’s time playing football, his time in Vietnam, his time on the shrimping boat. This page also covers information about the movie itself such as dates, cast, the director, etc.

It is difficult to make the document embedding for Forrest close to all of the queries because there is a wide range of potential queries.

This leads to a situation in which the document representation can’t be close to all of the queries because there is a wide range of queries that it needs to be close to. All of these queries different semantic information so there is pressure during training to not just collapse all of them to a single point.

We can also look at this effect mathematically. Let’s consider the cosine similarity between query q and document d. We can expand d as an average of its constituent token embeddings d_i (if the model uses [CLS] token for pooling its a weighted average of the L-1 representations but this looks a little nicer). By commuting our inner product operations we see that the cosine similarity between query and document is equivalent to the average cosine similarity to the document tokens.

More intuitively, this illustrates that when we match a query to a document that we need it to match all of the document. In the above example, if I edited the wikipedia page by adding another paragraph on particle physics all of the queries would become less relevant.

Some Alternative Representations

Cross Encoders

One known solution to this problem is using cross-encoder base re-rankers. These model’s condition it’s similarity evaluation on the query and document together which allows the model to selectively ignore information it doesn’t need. The downside is that they can be computationally infeasible as we need to pass all the documents through BERT during inference time.

https://www.sbert.net/examples/applications/cross-encoder/README.html

Sparse Models

Another approach is to represent text using sparse vectors. Many readers may be familiar with more traditional IR approaches like BM25 which focus on TFIDF term match between query and document. Notice that this approach doesn’t suffer from the same issues described in the previous section as only the evidence which pertains to the query is considered during scoring. While it has it’s own issues, BM25 can still provide strong results especially on problems with low amounts of data.

An exciting line of research has been on learned sparse models such as SPLADE. These models combine the power of contextual language models with the representations benefits of sparse model. These models have been shown to have much better generalization to new domains compared with dense embedding models. I write about this in more depth in Sparse Representations of Text for Search.

SPLADEv2 Example. Note that the representations is able to both capture the contextual importance of terms as well as expand the vocabulary.

In Combining Embedding and Keyword Based Search for Improved Performance we also see how these sparse models compliment existing embeddings as well.

Late Interaction Model

Late Interaction models are a class of models for semantic similarity in which we model similarity through token level interaction between documents, rather than a single vector embedding.

By representing text as a collection of token embeddings rather than a single vector, these models have both the expressiveness of cross encoders but still enjoy the computational benefits of being able to store document representations offline. Poly-encoder✎ EditSign was one of the seminal works in this area but in this post I will focus on ColBERT✎ EditSign as it has been shown to be the most effective approaches

ColBERT — Late Interaction Models

ColBERT first encoder query and document using BERT to get contextualized embeddings for each token. The model represents query and document similarity by summing the maximum cosine similarity across the document for each token (shown below). We can apply this score to a training triplet <q, d+, d-> we can learn the model in the same way we would any other semantic search model.

ColBERT scoring function. For each query token we take its max similarity against the document and sum across query terms.

Note that this function is only effected by the terms within the document which are most relevant to the query which gives the a desired “information ignoring” effect. I attempt to illustrate this in a hastily constructed collab notebook which pieces together elements of Tensor2Tensor Intro and this pretrained model.

who was the lead singer of led zeppelin?

Led Zeppelin were an English rock band formed in London in 1968.
The group comprised vocalist Robert Plant, guitarist Jimmy Page,
bassist/keyboardist John Paul Jones, and drummer John Bonham

We see that ColBERT performs almost a soft token matching between query and document where “singer” is aligned to “vocalist”.

ColBERTv2 uses this same architecture but leverages a more modern, distillation based, training methodology so I show results from that paper here. We see that ColBERT achieves state of the art performance on the MSMARCO benchmark.

In addition, authors show SOTA out of domain performance as well. Their model is able to successfully adapt to search tasks in new domains.

Performing Search with ColBERT

While ColBERT can achieve impressive performance, there are some extra considerations with how we index and perform search to make it computationally feasible.

Indexing

ColBERT needs to store embeddings for each token rather than each document in a corpus. This can turn millions of embeddings to billions if performed naively. You might imagine there may be many very similar embeddings in the index if we choose to store embeddings of all tokens. Authors instead estimate embedding centroids from a sample of the corpus and store documents as a set of cluster IDs that it’s tokens are close to.

Retrieval

During query time, the query is encoded and tokens are mapped to their closest centroid. An inverted index search can then be performed to get the most similar candidates. These candidates are then re-ranked using their exact scores.

Conclusion

All of this isn’t to say that single embedding based search isn’t the wrong approach to search as there are many other considerations such as the efficiency and, importantly, that much of the open source tooling gravitates towards this type of solution. In addition, there are many other levers that one can pull to increase performance such as hard negative mining or distillation that wouldn’t involve uprooting an existing architecture.

Here my aim is to illustrate some of the geometric constraints with embeddings and highlight some alternative approaches to thinking about how we represent text. Late interaction models are one such approach which has been shown to get very promising results through avoiding the information bottleneck of single embeddings. These approaches draw a strong parallel to a more flexible version of keyword based matching.

--

--

Zachariah Zhang

Staff Machine Learning Engineer at Square. I like to write about NLP.