Hybrid Search on Azure AI Search for Retrieval Augmented Generation (RAG): a more effective search

Lydia AREZKI
4 min readDec 13, 2023

--

The retrieval step in retrieval augmented generation (RAG) applications is usually implemented using vector serach. This approch finds the most relevant passages of a set of documents solely relying on semantic similarity.

In this article, we’ll explore a set of novel features available on Azure AI Search, namely hybrid search and semantic ranking, and some of the best practices to optimize the relevance of the retrieved documents. Hybrid search and semantic ranking bring additional capabilities that complement and build on vector search to effectively improve relevance. This is especially true for Generative AI scenarios where applications use the RAG pattern, though these conclusions apply to many general search use cases as well.

Retrieval and ranking:

Modern search engines follow a 2-steps pattern:

Step 1- Retrieval: In this step, we aim to find all relevant documents from the index that satisfy the search criteria, across a usually large number of documents. These are scored to pick the top n documents to return to the ranking step. Azure AI search supports:

  • Keyword search: wich are traditional full-text search methods based on terms or keywords. The document content is broken into key terms through text analysis. We rely on inverted indexes for fast retrieval and a probabilistic model for scoring.
  • Vector search : Documents are converted from text to vector representations using an embedding model. Retrieval is performed by generating a query embedding and finding the documents whose vectors are closest to the query’s. Open AI text-embedding-ada-002 is a very common model used for generating embeddings.
  • Hybrid search: Combines both keyword search and vector retrieval. A fusion step is necessary to select the best results from each technique. Azure AI Search currently uses Reciprocal Rank Fusion (RRF) to produce a single result set.

Step 2- Ranking: recomputes a better-quality relevance score on the top n returned documents from step 1. At this stage, we can only reorder retrieved documents, but if the ideal document was missed during step 1, there is no way to retrieve it. This step is important for RAG applications to make sure the top results are in the top positions.

  • Sematic ranker: Azure AI search uses semantic ranking, a deep learning-based model adapted from Microsoft Bing to rerank the top 50 documents. It uses the query and the documents simultaneously to produce scores.

What is the best search strategy for Gen AI/RAG applications ?

According to Microsoft and based on a study (linked below), conducted on both customer indexes and academic benchmarks, the best strategy seems to be the full hybrid approach (keyword + vector) search + semantic ranker.

Let’s dive into why this is the case :

  • Keyword and vector retrieval achieve search from different perspectives, which yield complementary capabilities.
  • Vector search prioritizes meaning. It is also less sensitive to misspellings, synonyms, and phrasing differences and can work in cross lingual scenarios. They are good for more of the intent behind the query
  • Keyword search is useful because it prioritizes matching specific, important words that might be diluted in an embedding. They are good in textual search like when you search for emails, product names…etc
  • Last but not least, Semantic search is an important refining step since generative AI scenarios typically only use the top 3 to 5 results as their context.
Percentage of queries where high-quality chunks are found in the top 1 to 5 results, compared across search configurations

Best practices:

Here are some best practices when implementing search on Azure AI Search, especially for Generative AI scenarios where applications use the RAG pattern.

Best practice 1: Use Hybrid search + Semantic reranker for optimal results.

Best practice 2 : Do not use large chunks.

Chunking solves 3 problems for Generative AI applications:

  1. Respect the context window-limit, in case you have long documents. Splitting them into limited-length passages makes it possible to pass them to the LLM.
  2. Chunking provides a mechanism for the most relevant passages of a given document to be ranked first.
  3. Vector search has a per-model limit to how much content can be embedded into each vector.

Embedding models must compress all the semantic content of a passage into a limited number of floating-point numbers (e.g. Ada-002 uses 1,536 dimensions). If you encode a long passage with multiple topics into a single vector, important nuance can get lost. Using large chunks reduces retrieval performance.

Best practice 3 : Overlapping chunks

Some overlapping is always beneficial so there is shared context between the chunks. A good rule of thumb is to use 10 to 25% overlap.

Sources:

--

--

Lydia AREZKI
0 Followers

Data scientist | AI & Data consultant, Problem solver