From Cranfield Experiments to Neural Information Retrieval

16 min readDec 30, 2019

Introduction

When I was working in industry, I was seeing SOTA models published every week in various machine learning fields while I was struggling with a complex search application consisting of lots of parameters and heuristics. I naturally came to think “how can I incorporate recent research advances into practical applications?”. I had this question in my mind for a while and I ended up getting into university to study information retrieval. This article is just a note taken during my sabbatical leave. I summarized what I’ve learned in the following order.

Information Retrieval and its Evolution
Challenges in Modern Applications
Neural Networks for Information Retrieval
From Research to Production
Conclusion

1. Information Retrieval and its Evolution

The beginning of the study of information retrieval dates back to the 1960s when the Cranfield experiments started. In 1968, Gerard Salton, who formed and led a large information retrieval group and invented the vector space model, defined that Information Retrieval is a field concerning generating, storing, classifying, analysis, search related to information. 60 years have passed since then. The form of the applications has changed radically from libraries, legal and medical information, desktop search, social media, mobile search, question answering, and chatbots. The academic interests have also changed in accordance with the surrounding environment as seen in the slides below created by Bruce Croft.

https://ciir.cs.umass.edu/downloads/essir2019/doing-ir-research-croft.pdf

Essentially, information retrieval is the activity of obtaining information, relevant to a user’s query from within large collections. The most classical style of information retrieval is so-called boolean retrieval. As this approach did not rank retrieved documents, an alternative approach predicting the degree of relevance was proposed. The relevance scores were estimated based on a probabilistic approach. In the same age, it was proved that taking into account term frequency improves performance and this ranked retrieval approach came to be continuously refined over the following decades.

Another alternative approach was the vector space model where both documents and queries are represented as vectors in euclidean space, in which the inner product of two vectors can be used to measure their similarities. The studies of learning representations of queries and documents came to be studied more after the breakthrough of neural networks.

The field was more advanced by introducing probability theory. Robertson defined the probability ranking principle, which determined how to optimally rank documents based on probabilistic measures with respect to defined evaluation measures. The original probabilistic model did not include term frequency and a number of researchers worked to incorporate them in an effective way.

For better understanding of recent trends, let me show how classical retrieval systems work. Suppose that we have two documents below.

Doc 1: The quick brown fox jumps over the lazy dog
Doc 2: Fox News is the number one cable news channel

The inverted index becomes like this.

+---------+-------+------+
|  Term   | Doc 1 | Doc2 |
+---------+-------+------+
| quick   | x     |      |
| brown   | x     |      |
| fox     | x     | x    |
| jumps   | x     |      |
| over    | x     |      |
| lazy    | x     |      |
| dog     | x     |      |
| number  |       | x    |
| one     |       | x    |
| cable   |       | x    |
| news    |       | x    |
| channel |       | x    |
+---------+-------+------+

From the table, it can be seen that the system returns doc 1 for a query “quick”, doc 2 for “news”, and both for “fox”. Then, the system ranks candidates in some way to make it more useful for users. As aforementioned, probabilistic models are based on the probability ranking principle.

If a reference retrieval system’s response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.

It says to simply rank all documents in decreasing order of p(r = 1|d, q) to maximize the utility.

Many methods have been proposed to represent relevance. BM25 (acronym of Best Matching) would be the most known method as it has been the default scoring algorithm in Lucene and Elasticsearch. In the 21st century, language models have come to be paid more attention. In this method, the probability of a document is interpreted as the likelihood that it is relevant to the query. These methods are based on the strong assumptions below.

Given relevance, terms are statistically independent.
The presence of a term in a document depends on relevance only when that term is present in the query.

These assumptions may not be true in practice because (1) says that “hot dog” and “dog hot” are the same and (2) does not take into account semantically similar terms like synonyms.

For that reason, applying neural networks, which potentially can capture both lexical and semantic features, has become a trend in the community. The figure below shows the rate of papers related to neural information retrieval at SIGIR, indicating how popular the topic became in the community.

https://www.microsoft.com/en-us/research/publication/introduction-neural-information-retrieval/

Over the past few years, a number of neural information retrieval models have been proposed such as DSSM (2013), CDSSM (2014), DUET (2017), DRMM (2017), PACRR (2017), Co-PACRR (2017), KNRM (2017), Conv-KNRM (2018), DeepCT (2019), CEDR (2019), HCAN (2019) and more. Related to this trend, Google announced that Google Search has started using a contextualized language model called BERT (Understanding searches better than ever before). Modern information retrieval approaches are capable of taking into account context using stopwords while traditional methods have dropped it because stopwords such as “for” or “to” have considered noise when estimating relevance. As we can see, the surrounding environment and technology trends have been updated day by day.

2. Challenges in Modern Applications

In this section, we take a look at issues in modern applications. That is, they are problems I needed to deal with when I was in the industry.

Spelling Mismatch

Words have different forms. In the example in the previous section, “jumps” is indexed as I just put tokenized terms without stemming or lemmatizing but users might issue a query with a different form. Exact lexical matching falls short in this respect. In Japanese, people use three different letter sets, plus different ways of writing, plus alphabets, plus ... For example, “狐”, “きつね”, “キツネ”, “フォックス”, “fox”, … they are all the same meaning but represented in a different way (I didn’t choose a special case on purpose. It happens for all words). This kind of issue is not Japanese specific. Each language has different linguistic issues. Spanish is a widely spoken language across countries but vocabularies are different by region, there are Traditional Chinese and Simplified Chinese, Greeklish in Greek, etc. In addition, the level of proficiency of language or input device could affect spelling. I make a lot of spelling mistakes in English. People who are not familiar with mobile devices would typo a lot. Search engines misinterpret intention if the spelling is not correct.

Google Search is smart enough to understand my intention

Lack of Semantic Understanding

Natural languages are complex and ambiguous as it reflects human activities. It is important to capture words that may lexically not close but semantically similar. One solution to mitigate the issue is query expansion, adding semantically the same or similar words by referring to knowledge base or embeddings. However, maintaining a large knowledge base is time-consuming and introducing embeddings may result in unexpected search results.

Granularity is another concern when applying query expansion. “hot dog” could become “warm puppy” when term-level query expansion, but if it is query-level, recall would decrease since fewer words are discovered for combined queries.

In contrast to synonyms, there are words that multiple meanings are assigned. The two banks below have different meanings though they have the same sentence structure. This is a famous example in NLP.

I arrived at the bank after crossing the street.
I arrived at the bank after crossing the river.

The example below may be more familiar to us. I see the red adorable creature every time I search for a contextualized word embedding.

Interestingly, the order of terms affects the plausibility. According to Google, people associate “渋谷凛” (“Shibuya Rin”) with an anime character but when “凛渋谷” (“Rin Shibuya”, reverse order), it becomes a ramen restaurant in Shibuya (and looks right to me).

Going back to the inverted index I showed, “fox” matches both doc 1 and doc 2 but when a query “fox” (animal) is given, doc 1 would be more relevant than doc 2 since “fox news” is a proper noun and might be a bit distant from the user’s intention. But is our intuition correct? If so, how can we apply this knowledge to the ranking algorithm?

Ambiguous Queries

Sometimes queries do not directly represent user information needs, such as “dishes for Christmas”, “hotels with a nice view”, and “relaxing music”. This issue is a bit more difficult to solve in a naive way as the number of related terms for ambiguous queries can be large. Hence, maintaining the hierarchy by hand could require infinite effort. Furthermore, Question Answering on the web is much more difficult in which semantic distance may not be a good measurement, though recent Google Search has addressed the issue.

Query: “Method estimating the hierarchy from photos in the communist bloc” Answer: “Kremlinology”

Note that collecting weird search examples is my hobby.

Ranking Semi-Structured Complex Documents

Modern search applications typically need to retrieve complex documents.

Documents may contain several fields such as name, image, description, price, ratings, etc. The question is how to rank those documents while taking multiple fields/media into account.

Just concatenating different text fields would not be a good idea because the amount of information conveyed could vary by field, e.g. title is denser than description. Instead, you can give different weights to each field but it might be difficult to find good weights. Furthermore, how to combine non-text fields is unknown.

Combining Ranking with Context

When a query is “pizza”, which document should be more relevant?

How to make pizza: Put the flour into a large bowl, then stir in the yeast and salt …
Come fare la pizza: Mettiamo in una boule la farina setacciata e uniamovi il lievito di birra sbriciolato …

In this case, the relevance cannot be judged from the given information. The user may be English or Italian, though the estimation can be more precise if we are given more context.

I’m sure that Google knows I’m Japanese based on every evidence Google has but Google Search keeps showing Chinese articles that I can’t read (I guess for performance reasons).

統計学 (Japanese), 統計學 (Traditional Chinese), 统计学 (Simplified Chinese)

The evaluation of ranking is subjective. Good search results seem to vary by individual.

3. Neural Networks for Information Retrieval

It appears that there are two types of issues: Matching and Ranking. In this section, we will have a look at recent studies to see how researchers have attempted to solve those issues by applying neural networks to information retrieval.

Matching

Matching is a problem of determining which documents to pass to the next ranking process. We want to include as many relevant documents as possible while keeping the size reasonably small because ranking models are usually computationally expensive.

Semantic Matching in Product Search

If the query is “countertop wine fridge”, it would be reasonable to include an item “beverage mini refrigerator” to the candidate set. [2019] Semantic Product Search proposed a semantic matching system that captures the semantic meanings of queries.

https://wsdm2019-dapa.github.io/slides/05-YiweiSong.pdf

Capturing semantics of text is a perennial and well-studied problem in NLP. The outcome of NLP research, embeddings, is now used everywhere in information retrieval such as [2016] A Dual Embedding Space Model for Document Ranking to capture semantic features.

In Amazon’s paper, they divide words at different granularities and combine them to make queries robust for spelling variations. This approach looks similar to FastText. In this way, a query “artistic iphone 6s case” becomes represented like below.

Word Unigram: [“artistic”, “iphone”, “6s”, “case”]
Word Bigram: [“artistic#iphone”, “iphone#6s”, “6s#case”]
Character Trigram: [“#ar”, “art”, “rti”, “tis”, “ist”, “sti”, “tic”, “ic#”, “c#i”, “#ip”, “iph”, “pho”, “hon”, “one”, “ne#”, “e#6”, “#6s”, “6s#”, “s#c”, “#ca”, “cas”, “ase”, “se#”]

Then, take the average of retrieved vectors. It seems that combining different granularities help to keep semantic meanings. They also compared their approach with LSTM and GRU but averaging performs similar or slightly better than recurrent units with significantly less training time. That may be because unlike web search, queries are relatively short and consist of a set of keywords rather than natural language. The best approach seems to depend on the characteristics of queries and documents.

Another interesting thing is that they combined both the existing lexical matching and the new semantic matching system when running online experiments. The reason is not mentioned in the paper but I guess they wanted to minimize the risk of performance degradation.

Semantic Matching in Web Search

Methods using pre-trained language models have begun to demonstrate SOTA in information retrieval tasks. The figure below shows the benchmark on TREC Robust04 which is the standard benchmark to measure the performance of ad-hoc retrieval systems. The outstanding growth from 2018 was achieved by utilizing contextualized language models.

https://paperswithcode.com/sota/ad-hoc-information-retrieval-on-trec-robust04

Ranking

After generating candidates, we want to rank items based on some rationality we believe. Learning-to-rank is a subfield of information retrieval focusing on ranking items. Recently, researchers have attempted to use neural models for the task. The difference between existing learning-to-rank approaches and neural ranking approaches is that the existing machine learning techniques rely on hand-crafted features while neural models learn representations. Neural ranking models are known to achieve better scores thanks to the nature of the system as shown in the slides below.

I picked up some papers related to my interests in ranking problems.

Ranking Multi-Field Items

It has been usually assumed that documents consist of only one text field in experimental settings. However, practical search applications need to retrieval complex documents consisting of short text, numerical values, keyword sets, images, audio, etc. One known method to combine multiple fields is BM25F which is an extension of BM25. [2017] Neural Ranking Models with Multiple Document Fields introduces a method dealing with multi-field items in neural models, called NRM-F named after BM25F.

The framework consists of three components: instance representation, field-level aggregation, and document-level aggregation.

https://www.microsoft.com/en-us/research/publication/neural-ranking-models-multiple-document-fields/

In naive implementations, the absence of important fields strongly affects scoring. In order to minimize the impact, they introduced some techniques called field-level masking and field-level dropout.

Personalized Raking

Even if you built a perfect information retrieval system that returns documents perfectly relevant to the given query, NDCG does not become 1.0 because relevance is subjective as aforementioned. There are more latent factors other than query that indicate the user’s information needs, which result in the gap shown below.

https://web.stanford.edu/class/cs276/handouts/personalization-lecture-1-per-page.pdf

In product search like Amazon, it is known that users’ behavioral history and their economic conditions affect decision making. [2019] A Zero Attention Model for Personalized Product Search discussed the method exploiting the fact. Actually, it is also known that personalization could result in a negative impact on search. For that reason, they invented a method automatically applying personalization only when it is effective.

Contextualized Ranking

Though pieces of literature say that personalization is useful in product search, is it always true that long-term preferences help to achieve a short-term goal? Say, is recommending a Star Wars T-shirt to Star Wars fans effective when they are viewing clothes? It also may be important to note that personalization cannot be applied to the applications in which the rate of guest users (who do not have an account) is high.

[2019] Leverage Implicit Feedback for Context-aware Product Search focused on optimizing search results using short-term preferences. Unlike web search, users click many products for comparison before making a decision. In order to utilize the signals, they reformulated the problem as a dynamic ranking problem and use clicked items to re-rank items on the next page.

4. From Research to Production

Many studies have shown that neural models outperform existing methods in benchmarks but there are some challenges to overcome when deploying to production.

Effectiveness vs. Efficiency

In practice, search systems have performance constraints. For that reason, current neural ranking models are implemented as multistage rankers. [2019] Recommending What Video to Watch Next: A Multitask Ranking System describes the algorithm of YouTube recommendation, a real-world example of a large application. They first filter videos by similarity, then rank videos with a pointwise loss for performance reasons (It is known that listwise is better to optimize the ranking metric in theory: An Analysis of the Softmax Cross Entropy Loss for Learning-to-Rank with Binary Relevance). I thought that Google has a tremendous number of supercomputers that compute listwise loss against trillions of items in 1 millisecond with secret algorithms but it was not right, which made me relieved (?). It shows that no one can disregard efficiency when building a model.

When reading papers, I was aware that there are different experimental settings.

(1) wants to add potentially related items to the candidate set. If the system successfully captures semantic meanings, the number of rewriting queries would decrease. If it does not perform well, “warm puppy” could show up when you search for “hot dog”. (2) allows us to rank all items, which may lead to high effectiveness but it would be intractable in practice. In (3), we are given a generated candidate set and asked to re-rank them, which would aim to replace the existing ranker. However, the size of the candidates could be still too large to handle. (4) re-ranks top-k candidates which are already ranked in the prior ranker. The size is always fixed which makes predicting load easier. This is a practical setting but it limits the potential of the ranking model. It appears that there is a trade-off between effectiveness and efficiency.

Speaking of the trade-off, in multi-layered information retrieval systems, if the matching layer places importance on recall, irrelevant documents may be included in the candidate set. That is, higher quality is required in the ranking layer. On the contrary, if the ranking layer does not place importance on accuracy (say order by published date or something), the search results could become noisy unless the matching layer filters irrelevant items perfectly. This trade-off is known to be difficult to optimize because of the disjoint nature. (5) is an idea to build an end-to-end system to deal with this trade-off and performance issues. Let’s look at the effort on it a bit more in the next section.

Query Term Independence Assumption

Traditional relevance estimation methods have made strong assumptions: query terms are independent. Under this assumption, documents can be filtered using an inverted index as absent terms do not affect the relevancy of documents. For example, if the query is “Tchaikovsky masterpiece”, the relevance score is an accumulation of each term like “Tchaikovsky”: 10 + “masterpiece”: 6 = 16. There is no association between terms. But is it true for queries like “hot dog” or “what is the meaning of 42”? Recent research has shown that using contextualized language models as input outperforms the previous state-of-the-art ([2019] Passage Re-ranking with BERT). It would imply that terms are dependent to some extent. If we cannot assume independence, we might not be able to rely on inverted indexes to filter documents.

Neural Index Construction

[2018] From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing proposed a standalone neural search system which introduces a sparsity property to learn a latent sparse representation for each query and document. This representation can be used as an inverted index, which dramatically reduces the computational cost.

I implemented the system and visualized the inverted indexes.

Neural Inverted Index (Regularization Term: 0.001)

Neural Inverted index (Regularization Term: 0.0005)

It can be seen that sparsity can be controlled by the regularization term like autoencoders. However, two questions remain: how to choose the optimal regularization term and how to apply this to models requiring complex interactions.

Measuring the Impact of the Assumption

By the way, I was wondering how plausible the query term independent assumption was. [2019] Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks measured the impact of the assumption on three neural models: BERT, Duet, and CKNRM. Surprisingly, Duet and CKNRM suffered no statistically significant adverse effect with respect to ranking effectiveness. Though the performance of BERT degraded, the drop in MRR might be small enough. I found this result quite interesting as it showed the potential that deep models can be applied for large collections.

Training Data-Hungry Models

It might be difficult to collect enough amount of training data as neural information retrieval systems are data-hungry applications. [2017] Neural Ranking Models with Weak Supervision proposed a method for training a model using documents ranked by BM25 so that we can obtain an infinite amount of data. [2018] Joint Modeling and Optimization of Search and Recommendation showed that we can jointly train a search model with recommendation models as their goal is essentially the same (Belkin and Croft stated in 1992). It shows the possibility that we could improve a model if we already have the other model.

5. Conclusion

In this article, we viewed the surface of recent trends in the information retrieval community. I want to highlight three things below.

The surrounding environment is changing day by day.
Existing assumptions may or may not be correct.
The best approach varies from application to application.

I do not think all search applications need a neural model. Or rather, I would say that Elasticsearch +BM25 + Heuristics surprisingly works well. Managing neural information retrieval models would require a lot of effort yet but I hope the cost will decrease and search experience become much better in the future.