Enhancing Information Retrieval via Semantic and Relevance Matching

Suryakant Pandey
May 8 · 10 min read
Information retrieval

Introduction

Data is the gold of the 21st Century. We are everyday creating quintillion bytes of data. Information is the processed and refined form of data which carries a logical meaning. Whether we search on search engines, search for a product on e-commerce websites , any other search for articles, products, peoples etc, IR( Information retrieval) is everywhere and has become an integral part of our daily lives.

Understanding Information retrieval

Information retrieval, in the field of computer science is defined as, the process of obtaining relevant documents that satisfy the information needs from a large collection of documents. In most popular cases, the information needed is expressed in the form of a query string( Eg : google search) and the information too is string (Eg : search results on google).

Sample IR system

Core task in IR is to first find matching documents for a query(Retrieval stage) and then rank the matched documents(Ranking stage). Matching happens between the query and each document in the collection, since the collection set is very large (in billions), matching logic has to be efficient. Next, ranking is done based on content relevance of a document for the query, performance metrics of the documents and the user context.

In this article, we are going to talk about algorithms which aim to find the most relevant documents for a query based on their textual content as well as semantic meaning. We will discuss the evolution of matching a document with a query and how to use matching signals in the ranking to increase relevance of results.

Classification of Matching algorithms

Below diagram captures important algorithms for matching in IR. On a high level we have the following approaches.

The next level of classification is based on whether the algorithm is neural network based or non neural network. Neural network based are further classified into representation based( query and documents are first translated to embedding space individually without knowledge of each other and then matched, risk of losing exact match signals)and interaction based( model interaction between query words and document terms explicitly, exact match signals are not lost).

Classification of Matching algorithms in IR

Now we will discuss each one of them in terms of key Points. We have tried to cover the breadth and omit implementation details and derive key takeaways as the novel idea in the paper.

Basic Bag of Words Model

Term-Document matrix to enable efficient retrieval

Pseudo-relevance feedback

Sequential dependency Model

Composite Match Autocompletion (COMMA)

Latent semantic Analysis

SVD factorisation to learn topics

Latent Dirichlet Allocation

Semantic similarity with inSession Queries

Content relevance model on engagement rate

Twitter Content Relevance

MERCURE System

Deep Structured Semantic model (DSSM)

Deep semantic similarity model

Deep Relevance Matching Model

Doc2Query

Doc2Query

K-NRM (Kernel based neural ranking model)

K-NRM model architecture

Semantic Product Search

Generating embedding for document and query
Neural network for semantic product search

DocBert

Bert Contextualised embedding

Deep Contextualized Term Weighting framework (DeepCT)

DeepCT in action

Summary

  1. Bag of words model has been very effective ( as exact term matching is very important important relevance signal) and efficient ( due to inverted indexing getting documents with matching terms is very fast and don’t need to iterate over the documents)
  2. Bag of word model has certain limitations, it doesn’t understand human language ( document for non chinese phone will match query chinese phone ), doesn’t understand vocabulary mismatch( eg document for puppy will not match document for dog). Term importance is decided by term frequency and not by semantic understanding.
  3. We can increase recall by using mix of top documents and user entered query as a feedback to return more documents
  4. Modelling phrase importance in the document and term importance in the document ( P(Document/Phrase), P(Document/Term)) can increase relevance
  5. If terms that are in proximity in query are also in proximity in the document, then there is strong evidence in favour of relevance.
  6. Matching query not with just document title but with category and facets as well and derive ranking signals for match in these fields can help to increase relevance
  7. Difference between semantic matching vs relevance matching. SEMANTIC MATCHING (meaning same, treats query and document alike) VS. RELEVANCE MATCHING( Exact matching signals, Query term importance, queries are usually smaller than documents)
  8. Use user engagement on a document for a query as a measure of relevance.
  9. Various Neural retrieval models(DSSM, DRMM,K-NRM,Semantic search) discussed tries to capture following:

Feedback

Questions? Comments? Contact at : LinkedIn, Instagram

MLearning.ai

Data Scientists must think like an artist when finding a solution

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store