Exploring Semantic Search Using Embeddings and Vector Databases with some popular Use Cases

Pankaj Pandey
6 min readAug 10, 2023

--

Exploring Semantic Search Using Embeddings and Vector Databases with some popular Use Cases

In the world of modern information retrieval, the way we search for and discover relevant information has evolved significantly. Traditional keyword-based searches often struggle to capture the nuances of language and context, leading to less accurate results. This is where semantic search, powered by embeddings and vector databases, comes into play. In this blog post, we will delve into the concept of semantic search, explore the importance of embeddings and vector databases, and highlight some of the best use cases with descriptive examples.

Understanding Semantic Search:

Semantic search aims to understand the meaning behind words and phrases, allowing search engines to provide more contextually relevant results. Unlike traditional keyword-based searches that rely on exact matches, semantic search takes into account the relationship between words, their contextual significance, and even the intent behind the query. This is achieved through the use of embeddings and vector databases.

Embeddings:

Embeddings are numerical representations of words or phrases in a continuous vector space. These representations capture semantic relationships between words by placing similar words closer together in the vector space. Popular techniques like Word2Vec, GloVe, and FastText generate such embeddings through training on large text corpora. For example, in a word embedding space, words like ‘king’ and ‘queen’ would be positioned close to each other due to their semantic relationship.

Vector Databases:

Vector databases, also known as similarity search databases, are designed to efficiently store and query vector representations. These databases can rapidly identify similar vectors, making them ideal for semantic search tasks. Instead of comparing queries against an entire dataset, vector databases narrow down the search by calculating the similarity between the query vector and the stored vectors. This significantly reduces the computational load and speeds up the search process.

Key Use Cases of Semantic Search Using Embeddings and Vector Databases:

1. E-Commerce Recommendations:

Semantic search enhances product recommendations by understanding customer preferences beyond direct keyword matches. By analyzing previous purchases and browsing behavior, e-commerce platforms can recommend items that align with the user’s style and taste. For instance, a customer searching for “comfortable running shoes” might receive recommendations based on the semantic context of their query.

from sklearn.metrics.pairwise import cosine_similarity

# Sample product embeddings (already trained)
product_embeddings = {
'running shoes': [0.8, 0.6, -0.2],
'comfortable shoes': [0.7, 0.5, -0.3],
'sneakers': [0.6, 0.7, -0.1]
}

user_query = 'comfortable running shoes'
query_embedding = [0.75, 0.6, -0.25] # Generated using the same embedding method

similar_products = []
for product, embedding in product_embeddings.items():
similarity = cosine_similarity([query_embedding], [embedding])[0][0]
similar_products.append((product, similarity))

similar_products.sort(key=lambda x: x[1], reverse=True)
recommended_products = [product for product, _ in similar_products]

print("Recommended products:", recommended_products)

In this example, we explore how semantic search enhances e-commerce recommendations. The code snippet demonstrates how product embeddings, which are numerical representations capturing semantic relationships between products, can be used to recommend items based on user queries. The code calculates cosine similarity between the query’s embedding and the embeddings of available products. The result is a list of recommended products that closely align with the user’s query, even if the exact keywords don’t match. This approach enables more personalized and accurate product recommendations, leading to increased customer satisfaction and sales.

2. Content Discovery:

Media platforms can leverage semantic search to help users discover relevant articles, videos, or music. By analyzing the content they’ve engaged with previously, the platform can suggest related content with similar themes or subject matter. For instance, a user reading about space exploration might be directed to articles about astronomy, rocket technology, and cosmic discoveries.

from sklearn.metrics.pairwise import cosine_similarity

# Sample content embeddings (already trained)
content_embeddings = {
'space exploration': [0.8, -0.3, 0.6],
'astronomy': [0.7, -0.2, 0.8],
'rocket technology': [0.6, -0.1, 0.7]
}

user_interests = ['space exploration', 'cosmic discoveries']
user_embedding = [0.75, -0.25, 0.7] # Generated using the same embedding method

related_content = []
for content, embedding in content_embeddings.items():
similarity = cosine_similarity([user_embedding], [embedding])[0][0]
related_content.append((content, similarity))

related_content.sort(key=lambda x: x[1], reverse=True)
suggested_content = [content for content, _ in related_content]

print("Suggested content:", suggested_content)

This example showcases the application of semantic search for content discovery on media platforms. The provided code snippet illustrates how content embeddings can be employed to suggest related content to users. By calculating cosine similarity between the user’s interests and the embeddings of available content, the system identifies content that matches the user’s preferences, even if the content doesn’t contain the exact same keywords. This approach enhances user engagement by offering them content that aligns with their preferences and interests, expanding their horizons and keeping them engaged on the platform.

3. Enterprise Search:

In large organizations, employees often struggle to find specific documents or information. Semantic search improves this process by understanding the context of queries and returning documents related to the intended topic. A search for “project deadlines” could yield documents discussing timelines, task assignments, and completion dates.

from sklearn.metrics.pairwise import cosine_similarity

# Sample document embeddings (already trained)
document_embeddings = {
'project deadlines': [0.6, 0.7, -0.2],
'task assignments': [0.5, 0.6, -0.3],
'completion dates': [0.4, 0.8, -0.1]
}

user_query = 'project timelines and deadlines'
query_embedding = [0.55, 0.65, -0.25] # Generated using the same embedding method

relevant_documents = []
for document, embedding in document_embeddings.items():
similarity = cosine_similarity([query_embedding], [embedding])[0][0]
relevant_documents.append((document, similarity))

relevant_documents.sort(key=lambda x: x[1], reverse=True)
matching_documents = [document for document, _ in relevant_documents]

print("Matching documents:", matching_documents)

In this scenario, the focus is on improving enterprise search within large organizations. The code snippet demonstrates how semantic search can assist employees in finding relevant documents and information. By generating embeddings for user queries and comparing them to the embeddings of available documents, the system retrieves documents related to the user’s intent. This approach reduces the effort required to sift through numerous documents manually and ensures that employees can quickly access the information they need to perform their tasks effectively.

4. Medical Diagnosis:

In the medical field, semantic search assists doctors in diagnosing diseases and finding relevant research. By analyzing patient symptoms, medical history, and the latest scientific literature, doctors can make more accurate diagnoses and treatment recommendations. For instance, a doctor researching treatment options for a rare condition can quickly access related studies and cases.

from sklearn.metrics.pairwise import cosine_similarity

# Sample medical condition embeddings (already trained)
condition_embeddings = {
'diabetes': [0.7, 0.6, 0.5],
'heart disease': [0.6, 0.7, 0.4],
'asthma': [0.5, 0.4, 0.6]
}

user_symptoms = ['high blood sugar', 'fatigue']
user_embedding = [0.65, 0.55, 0.5] # Generated using the same embedding method

relevant_conditions = []
for condition, embedding in condition_embeddings.items():
similarity = cosine_similarity([user_embedding], [embedding])[0][0]
relevant_conditions.append((condition, similarity))

relevant_conditions.sort(key=lambda x: x[1], reverse=True)
potential_conditions = [condition for condition, _ in relevant_conditions]

print("Potential conditions:", potential_conditions)

This example highlights how semantic search can aid medical professionals in diagnosing diseases and finding relevant research. The code snippet showcases the utilization of condition embeddings to identify potential medical conditions based on patient symptoms. By comparing the embeddings of user-reported symptoms with those of known medical conditions, the system suggests potential diagnoses that closely match the patient’s situation. This approach empowers doctors with additional insights and relevant information to make informed medical decisions and recommend appropriate treatments.

5. Legal Research:

Legal professionals often require comprehensive research on specific legal cases or precedents. Semantic search helps lawyers find relevant legal documents and cases by understanding the context of their queries. Searching for “landlord responsibilities in property disputes” could lead to documents discussing relevant laws, regulations, and case outcomes.

from sklearn.metrics.pairwise import cosine_similarity

# Sample legal topic embeddings (already trained)
legal_topic_embeddings = {
'landlord responsibilities': [0.7, 0.5, -0.2],
'property disputes': [0.6, 0.6, -0.3],
'tenant rights': [0.5, 0.4, -0.1]
}

user_query = 'legal aspects of property disputes'
query_embedding = [0.65, 0.55, -0.25] # Generated using the same embedding method

relevant_topics = []
for topic, embedding in legal_topic_embeddings.items():
similarity = cosine_similarity([query_embedding], [embedding])[0][0]
relevant_topics.append((topic, similarity))

relevant_topics.sort(key=lambda x: x[1], reverse=True)
matching_topics = [topic for topic, _ in relevant_topics]

print("Matching legal topics:", matching_topics)

In this use case, semantic search is applied to legal research. The provided code snippet illustrates how legal topic embeddings can assist legal professionals in finding relevant legal documents and information. By generating embeddings for user queries related to legal topics and comparing them with the embeddings of available legal topics, the system retrieves documents and resources that pertain to the user’s query. This approach streamlines legal research processes by ensuring that lawyers can quickly access cases, laws, and regulations relevant to their cases, ultimately supporting their decision-making and argumentation.

Conclusion:

Semantic search, powered by embeddings and vector databases, has revolutionized the way we search for information. By moving beyond keywords and focusing on context and relationships, it delivers more accurate and relevant results across various domains. The given examples collectively showcase the versatility and power of semantic search using embeddings and vector databases across various domains, enabling more accurate, context-aware, and personalized information retrieval.

Please note that the above code snippets provide a simplified demonstration of how semantic search using embeddings and vector databases could work. In a real-world scenario, you would need to use well-established libraries for generating embeddings and handling similarity searches, such as gensim or faiss for vector databases. Also, the example embeddings are random and should ideally be derived from a trained model using actual data.

--

--

Pankaj Pandey

Expert in software technologies with proficiency in multiple languages, experienced in Generative AI, NLP, Bigdata, and application development.