Azure Vector Search: A new way of information retrieval in diverse data types

Vishal Padma
Version 1
Published in
6 min readJan 17, 2024
Generated using dall-e-3

Introduction

Information retrieval has become one of the most trivial tasks when working with large documents. Having large sets of data is beneficial at all times but at the same time we need to be able to retrieve the exact data. With the growing data and its complexity, retrieving data with just key word search is not enough, we need the power to retrieve data when queried with contextual key word search.

That is where we could leverage the power of the Azure vector search. It is a method developed which helps in information retrieval from documents, where the data from the documents and user query is represented as vectors (embeddings) instead of plain text. In this methodology, the LLM models are used to generate the vector representation (embeddings) of the data which can be text, images, audio, or video content. In this way the everything is in a vector form, and a query can locate a match in a vector space, even if the associated original content is in different media or in a different language than the query.

Process Flow

Process flow for Azure AI search using vector Search

There are two parts to this process, the first part where we need to pre-process the input documents and store the embeddings in the index with other details. The second part would be the query the documents using the vector search.

Data Pre-processing: All the documents which are taken as input are scanned in this step. In our case it was plain text, hence we used form recognizer to extract the data from all the documents. A LLM embedding model was used for the converting the input documents (Plain text) into vector embeddings which are then stored into the index under embedding field.

Data Retrieval: The way this would work is, the user query is collected from the application. Once that is done, it is converted into vector using encoding step (Any LLM embedding model). The encoded query is sent to the index on azure AI search for a similarity search, Azure AI Search returns documents with highest similarity score in the results (by using k nearest neighbors-kNN). The results are sorted based on the similarity score of the documents, and furthermore we can filter the results for example, by only selecting the top 10.

Capabilities Of Vector Search

1. Vector search for text: The normal method where we can encode the input text from the document using any LLM embedding model (Any LLM embedding model). The next step would be to encode the user query as vectors and retrieve the documents.

2. Vector search across different data types (multi-modal): We can vectorize data of different types like images, text, audio, and video and try to find the similarity across all data types.

3. Multi-lingual search: Supports the use of documents with multiple languages by making the use of a multi-lingual embeddings model. This way we can search data in a single vector space to find relevant documents irrespective of the language.

4. Hybrid search: Operates at the field level, empowering one to construct queries that seamlessly incorporate both vector fields and searchable text fields. These queries run concurrently, and the outcomes are unified into a response. For enhanced precision, you have the option to incorporate semantic ranking, further refining accuracy through L2 reranking, utilizing the same language models that drive Bing’s capabilities.

5. Filtered vector search: The query request can be constructed with vector query and filter expressions. Using the filter criteria, we can include or exclude search documents, and these can be applied to text, numeric fields, and metadata. The search engine can process the filter before or after the vector query executes.

Why Azure vector search is better than azure normal search method?

Vectors provide a ground-breaking solution to conventional keyword-based search limitations, excelling in understanding the contextual meaning of words and phrases through machine learning models. This advanced approach not only surpasses lexical analysis and individual query term matching but also ensures more relevant results by grasping user intent, even when exact terms are absent. Additionally, the versatility of vector search extends to diverse content types like images and videos, enabling innovative search experiences such as multi-modal and cross-language search in multi-lingual applications.

Our primary objective in utilizing Azure AI search was to ensure our chatbot could provide users with accurate information for any given query. Initially, we retrieved the top 10 documents based on a straightforward keyword frequency match from user queries. However, we encountered an issue where the intended document was not consistently among the top 10 results, and in some cases, not even within the top 100. This led to the chatbot delivering apologies to users, with no flexibility to modify the request query or add parameters for improved results.

This challenge prompted our exploration of the Vector search methodology within Azure AI search. While the overall process resembled our initial approach, it involved a few additional steps. First, we created an index to store the vector encodings of input documents. Subsequently, we extracted and converted data from input documents into vector encodings, storing them in the designated embedding fields within the index. Once these foundational steps were completed, we could initiate queries to retrieve relevant documents.

Our focus was to evaluate the vector search capabilities in comprehending contextual queries. To test this, we created a query using a term closely related to “Automated test,” which is “Automatic Examination,” to assess if the search results could be enhanced. The results demonstrated that, in vector search, page 1 consistently ranked in the top 5, while in Azure AI Search with simple query method, it never made it to the top 5 results.

Query = “What automatic examination?”

Text in document
Vector search top 5 results
Azure AI search top 5 results

We tried a lot of different question to understand if the results get any better. For instance, we created a query using a term closely related to “Estimate emissions,” specifically to “Forecast energy,” to assess if the search results could be enhanced. The results showed that, in vector search method, page 7 ranked first in the top 5, while in Azure AI Search with simple query method, it was fourth in the top 5 results.

Query = forecast energy savings

Text in document
Vector search top 5 results
Azure AI search top 5 results

Note: Found the vector search and hybrid search results to be almost similar. This is due to the data processed and being retrieved. The results might differ depending upon the data to be processed and retrieved.

Conclusion

In conclusion, traditional keyword-centric searches are proving inadequate when faced with complex datasets. The utilization of Azure vector search, which encapsulates data and user queries as vectors, introduces a feasible solution for information retrieval that surpasses the limitations of conventional methods.

The distinctive process flow of Azure AI search, encompassing data pre-processing and retrieval, underscores the method’s effectiveness. By employing form recognizer and language model models (LLM), input documents are transformed into vector embeddings and stored in an index. The subsequent conversion of user queries into vectors and their querying of the Azure AI search index leads to the retrieval of documents with the highest similarity scores. Azure vector search stands out by not only overcoming the shortcomings of keyword-centric methods but also exhibiting versatility across various content types, offering a more contextually aware and nuanced search experience. Vector search is available as part of all Azure AI Search tiers at the same cost.

About the Author
Vishal Padma is an Associate Consultant at Version 1 AI Labs.

--

--