Hybrid Search — Amalgamation of Sparse and Dense vector representations for Active Content Discovery

13 min readMay 14, 2023

Hybrid Search — Uniting meaning of data with metadata to leverage deeper context

Robust information retrieval systems are utilised by almost every app to efficiently search through millions of products and content hosted on the platform, in order to provide users with the most relevant item. With numerous competing products, apps have taken perceived urgency to capture users’ attention more seriously than ever before. As a result, content or service discovery have become the core of many businesses, as I have discussed with numerous examples in my previous post.

The main focus of this post is active content discovery, enabled by cutting-edge technologies tightly integrated with the product. I will briefly discuss proven strategies for achieving content discovery with various tools and frameworks, while emphasizing the importance of efficiently leveraging context, which is crucial at its core.

Though there is no “one size fits all” kinda design approach for most software solutions, choice of frameworks play a key role in the decision making which could directly impact the end user experience. Responsibility of active review and enhancements leading to incremental updates solely lies at the hands of the product teams. From my past experience of working with different systems, I have come across different maturity levels of the products when it comes to achieving active content discovery:

Exact search with Traditional IR systems : When it comes to conventional IR systems, the item metadata is typically stored in a structured database such as SQL, in its original format. The search algorithm used in these systems resembles querying SQL databases using operations like “like” and “in”. Though this could serve the purpose, this approach may not be performance efficient and generally lacks to govern both user’s and content’s context, thus failing to offer rich experience. Besides, the retrieval is very limited to exact title /topic based search. I abstain myself from delving in the details of this approach as it has already proven obsolete.
Keyword based search with Vector Search engines : At another maturity level, the robust search engine could be employed with the data stored in vector format to offer efficient indexing and retrieval. This comes with all the sophistication like advanced schema involving hierarchical relationships as well as distributed computing. The cool features like type ahead suggestions, signal boosted discovery for relevance ranking, faceting the result categories, spellchecker and MLT etc make it all the more compelling choice for product developers.
Semantic search with Vector databases : The core idea of vector databases is to capture the semantic relationship between the query terms to discern deeper understanding. When both the user query and the content is “deeply understood” in the form of natural language, it could deliver most relevant items to user’s query. Needless to say, this goes miles ahead with achieving relevance. In essence, semantic search impressively caters to “what user actually meant” over “how exactly the user phrased the query”. This approach also dovetails with natural language interfaces like Chatbots and VoiceAI etc

Achieving Contextualisation : Though all the earlier techniques promise to capture the context of the content, the onus of managing user’s context is in the hands of the product teams. Standard app analytics engine integrated with the user’s click stream offers greater personalisation to the user as well be able to compute advanced stats (crowd sourced by likes, shares) for each item in the repo to offer support with context of the cohort (by demography in case of consumer apps and by organisational context in case of enterprise apps)

Let’s get to technical design :

These are some insights that I have gained from my experience while experimenting and evolving search systems with various tools and techniques for our online learning app over the years.

a. Keyword based search / Sparse vector search — The most dominant search engines like Solr and Elastic Search operate upon building Lucene index out of the item metadata and offer robust indexing and retrieval ability. The richer the metadata, better the discoverability. Under the hood, Lucene Index maintains vector representations of the data in the form of TFIDF vectors and efficiently searches against loads of data by leveraging robust algorithms like BM25. The algorithm interlaces the terms in user’s query with the terms in the stored index. While both elastic search and solr have clearly dominated the market over the years, hosted solutions like Pinecone are making larger waves since the onset of GPT.

Advantages of this approach -

Faster indexing as well as retrieval, simpler data schema in the form of key/value pairs (including hierarchical relationships), advanced searching ability across various fields and most important of all, signal boosting are some of many factors that can be benefited from search engine. Signal boosting is very much essential for relevance ranking as it could dynamically focus on certain key attributes of the data as applicable to the context. The search engines also cater to various semantic offerings by harnessing NLP tools in-built to the engine. Ex: autocompletion / type-ahead suggestions, query expansion, search result faceting etc. Given the vector representation at its core, computing document to document similarity is also a possibility. This is popularly known as MLT (more like this)

Search engines facilitate leveraging the context at various levels. With basic analytics engine (clickstream tracking) on the platform, apps could infer context of the content, context of the user as well as the context inferred by the crowd behaviour and apply all these 3 to craft relevance ranking to make the whole search experience more personalised and relevant to each user.

Caveats of traditional search engine -

The data representations are sparse! the underlying data representation lack meaning beyond the finite dataset in the index and has little sense in the real life context. They fail to comprehend the real meaning of the data as it treats each word /value with numerical representation in that limited vector space. Albeit the search engines attempt to cater to the semantics with the help of various NLP plugins, it is yet not comparable to the transformer based contextualisation that dense counterpart could offer. This is where I argue that context of the data (both user’s query and content) is not fully captured by the sparse vector approach.

b. Modern Semantic search / dense vector search —

On the other end of the sparse vector representation lies the dense vector representations. Popularly known as vector embeddings (numeric representations), successfully capture the deeper meaning of the data as they are represented with larger dimensions capturing various aspects of the information with the power of transformers. Transformers have the amazing ability to encode given piece of text into into multitude dimensions, simply put - represents the text with many different real life attributes by employing deep learning architecture for NLP.

The ability to capture the semantic relation between the terms helps compare and surface the relevant results even in the absence of exact words matching the query.

Whether a product uses generic embeddings (having world view trained with wiki texts) or tailored to domain specific embeddings (trained with large set of domain specific literature — say medical, legal, finance) depends on the business context of the product. Just to quote an example, the query “apple market trends” may mean totally different to a finance app from an e-commerce app. Ultimately the whole purpose of text embeddings is to furnish deeper contextualisation.

Owing to the computationally intense indexing as well as retrieval, there are relatively fewer search engines which offer vector databases. FAISS, Pinecone, Vespa.ai and Solr 9 are few of the dominant vector databases that are currently available. Most of these vector databases allow the data representations encoded with various transformers like BERT, HuggingFace Embedding and OpenAI embeddings to just name a few.

Advantages of Semantic Search —

With the ability to comprehend the “meaning” of the data, the vector databases can effectively understand user queries expressed in natural language and surface most relevant results by matching the similar intent in the index. This goes miles ahead from the sparse vector approach which mainly focuses on the keyword /term matching strategy. As discussed earlier, this method easily integrates with natural language interface like chatbots.

Given that the embeddings require distinct data representations and demand high computational resources, it’s crucial to consider indexing/retrieval efficiency, which leads most vector databases to the use of customised index representations such as HNSW.

Since the real word sense is very implicit to the vector embeddings, the context of the user’s query is interlaced with equally rich context of the content! The surfaced results make it apparent to the end users that the app is able to understand their questions and context more deeply than ever before! With advent of Large Language Models (LLM) it is already evident that the semantics is the future. Technology is maturing at faster pace to harness the actual essence content and surface to the user most efficiently with ability to focus on micro level of the content.

Caveats of dense vector database engine -

Specifically applicable to textual data (least relevant to structured /semi structured data) : Dense vector embeddings are designed to manage textual data expressed in the form natural language. Textual data expressed in semi-structured format (like metadata in key-value pairs) doesn’t automatically fit into dense embeddings. Like mentioned earlier, we have worked around this situation as well, by adding important metadata in natural language format appended to the target data.

I presume you’re already aware of the way standard search engines operates and affluent with sparse and dense vector representations. I recommend the book — AI Powered Search and James Brigg’s youtube channel to develop larger understanding of search systems and semantic search respectively. (links are given at the end of this post).

After considering the advantages of each approach, it becomes clear that there is a desire to achieve the best of both worlds. In order to accomplish this, it is common to manage semantic embeddings separately from the remaining item metadata. As a result, vector databases containing the actual data (semantic embeddings) and the item metadata often need to be handled separately and treated in isolation when creating custom solutions to draw value from both.

c. Hybrid search, where sparse vectors meets dense vectors to offer holistic approach —

By thoroughly examining the advantages and disadvantages of sparse and dense vector search engines, it becomes evident that remarkable accomplishments can be attained by combining the two. This approach, widely recognised as Hybrid search, combines the strengths of both Sparse and Dense vector approaches, allowing them to work together in harmony.

Typically, the content discovery in a hybrid search process is accomplished through a two-step approach -

Candidate Retrieval — This is similar to the goal of recall (how many relevant items are fetched). This step aims to retrieve most of the relevant items to match user’s query /context
Relevance Ranking— This is akin to the objective of precision (how many items fetched are relevant), which aims to fine tune the retrieved results to more accurately match individual user’s context

Let’s get to the implementation details :

Developers could execute search pipeline to achieve the above mentioned steps in a serial fashion. In our earlier version of Hybrid search engine, we have built such search pipeline enabled by Apache Solr and managed dense vector embeddings with FAISS. Below list highlights the high level steps that were involved in the content discovery process. For exemplifying, I have used content discovery in an online learning platform:

Index Content metadata in Solr — title, description, topic tags, author, validity and various other meta tags for online course.
Build dense vector database on FAISS enabled with BERT representation — The choice of data to embed could be title, description of the content. We could also consider adding additional contextual attributes specific to the domain explicitly if the embedding is generated from standard models.
Since the data in Vector db is agnostic to its corresponding metadata, the linking between the two was established separately. i.e the embedding id (FAISS id in our case) for each item has to be stored in solr along with rest of its metadata. Alternatively, this mapping can also be maintained in any other databases available in the backend.
The search process could involve :

a. Query Preprocessing — generate query embedding by using the same transformer architecture as configured in FAISS.

Context of the query — The query embedding transforms the user input expressed in natural language into large vector representation in the real world / business domain. These additional dimensions contribute to the context of the user’s query. For ex : “free certification courses on data science” conveys course cost, availability of certification and topic category but could seldom directly match any course title /description! This is the kind of context that semantics can infer and retrieve course with titles “Guide to Understanding and Implementing Data Analytics”, “From Data Mining to Machine Learning”, “Real-World Applications and Case Studies for data analytics” etc
Explicit Context of the user /content — High level content filters specifically applied by the user contributes towards refining the target candidates. Ex : content format, regionality /language applicable to the user, scope of the course to user’s job role etc

b. Candidate Retrieval — This step aims to elicit almost all the relevant items to the context stated explicitly - either via query or via filters or both.

Though this is subjective, we have used dense vector retrieval using the query embedding. In my observation, the user’s query captures his main intent hence the choice to hit FAISS index takes the role of the candidate retrieval. This choice is more contextual to the app, hence the developer has to take informed decisions to retrieve from vector db /search engine as it purely depends on the data architecture.

c. Relevance Ranking — This model goes a level further to refine the results by personalising to each user. While accepting the output of the retriever, this step mainly accounts for the user’s implicit context.

Implicit Context of the user and content — This caters to the knowledge about user’s preferences that are inferred by the app from her previous interactions. This is where leveraging analytics engine could help by drawing signals from tracking content that user liked in the past, provided comments for, shared with the community or even more implicit like the action of enrolling to the course, engaging with other related actions.
Additional context can be mined from cues like user’s favourite choice of topics and subtopics (which could tell apart data engineering from data visualisation), affinity to sources of the course, personal choice of content format, length of the courses, etc

Once these implicit signals are established, the search algorithm can effectively convert them into sophisticated solr queries for applying relevance ranking. Signal boosting with edismax algorithms yield greater benefits with control over the weightage scores. Ranking algorithm refines the results precisely based on relevance scores computed using the signal weighting. Most importantly, this solr operation must be done upon the candidate results fetched from the previous step i.e candidate retrieval (in this case, FAISS ids of the documents spit out by the retriever)

d. Post processing — Data transformations as expected by the APIs — like attaching thumbnail, applying pagination etc. I am not getting into explanations here since this is very subjective to the app and also very straightforward.

e. Test and Evaluate— This hybrid/ augmented search has yielded phenomenal improvements to the quality of the search results. There are various evaluation methods and metrics that are discussed in this book. In case of our app,

The intuitiveness of the results were evident to the users as there was increased exploration of the search results. Systemic metrics to measure this would employ search analytics to quantify engagement of users with search results.
By analyzing the logs, we could observe that users were actively rephrasing their search queries and exploring the results within a session. This clearly indicates that the end users recognised the significance of the contextual information provided by their search queries.
We could also discern from logs that the users rephrasing their search queries and exploring the results in a given session, made it apparent that the end users could see the value of context provided by his/her search query.

This impressive feat made us believe that few imperative overheads ( listed below) were really worth the effort. Indeed, there were un-ignorable challenges while dealing with dual systems to manage sparse and dense vector engines:

Custom scoring algorithm — One of the key observation with hybrid search was — the semantic search out-performed the traditional keyword search with the lengthy search queries. Besides, the retrieval and ranking happened on different engines, the confidence score of the retriever was not in parity for re-ranker’s consideration. This lead us to configure dynamic weights to the content filters (applied on candidate results from the retriever) as per the search query length; thus making the final query to solr a lot more complex! We ended up designing custom scoring algorithm to account for query size, retrieval score and solr ranking score to yield the final score.
Double search engine overhead— It was apparent that there was increased latency with new content indexing as well as ongoing updates to content. This latency was owing to the fact that the content metadata and embedding were to be indexed to Solr & FAISS respectively and also link the faiss_id back to the corresponding record in Solr.
Maintenance overhead —Syncing data between 2 disparate systems needed additional care to ensure that there are no inconsistencies /orphan records left behind. There was also additional caveat in our case with older versions of solr (7.x). Atomic updates to the solr records having nested relationships was not well optimised with Solr7. In case if we had to rebuild the faiss embeddings for given content with the new faissid, we had to re-construct solr document and reindex. Fortunately, this issue has been resolved with later versions of solr.
Search performance overhead — We observed that the FAISS index was resource intensive and queries took longer than usual upon node restarts.

Though we could manage to workaround the above listed situations, we continued to mature and evolve the search solution. We were more excited to discover latest hybrid search abilities of Solr9 to accommodate even the dense vector embeddings and craft elegant search solution leveraging best of both worlds! Pinecone is another alternative framework to achieve similar capabilities, though available only as hosted solution at the moment.

Stay tuned for the next instalment of this series, where we will uncover the powerful synergy created by combining sparse and dense vector approaches by employing solr 9. Don’t miss out to learn how we could manage all the stated objectives whilst alleviating the inherent overheads.

💬 I invite you to participate by sharing your thoughts, experiences, and questions in the comments section below the blog post. Have you encountered any challenges or successes with implementing Hybrid Search? I’m excited to hear your perspectives and engage in a lively discussion with fellow readers.

Citations:

AI Powered Search : https://www.manning.com/books/ai-powered-search

Vector Similarity Search playlist by James Briggs : https://youtube.com/playlist?list=PLIUOU7oqGTLhlWpTz4NnuT3FekouIVlqc

Solr 9 Dense Vector Search : https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html

Hybrid Search — Amalgamation of Sparse and Dense vector representations for Active Content Discovery

Hybrid Search — Uniting meaning of data with metadata to leverage deeper context

Written by Maithri Vm