Hello LLM: Building a Semantic Search Engine with ChromaDB

Dagang Wei
7 min readFeb 13, 2024

--

Image generated with DALL-E

This article is part of the series Hello LLM.

Introduction

In the realm of artificial intelligence (AI), data representation plays a pivotal role. Gone are the days when we relied solely on keywords and simple numerical data. Today, the concept of embeddings is revolutionizing the way AI systems understand and process information. Let’s explore the world of embeddings and how they lead us to the powerful concept of vector databases.

What are Embeddings?

Imagine trying to teach a computer the meaning of the word “love.” It’s a complex concept, isn’t it? Embeddings provide a solution. An embedding is a numerical representation of a word, phrase, image, or even a whole piece of code. These numerical representations, or vectors, carry the essence of the original data point within them. In the case of “love,” the vector might capture elements of relationships, emotion, and positive sentiment. This vector sits within a vast, multi-dimensional space. The magic lies in this space:

  • Proximity Means Similarity: Words with similar meanings or associations — “affection,” “adoration,” “devotion” — will have embedding vectors that reside close to the vector for “love”.
  • Context Matters: Depending on the training data, embeddings can pick up on different shades of love — romantic, familial, platonic. Words and concepts associated with that type of love will cluster near it.
  • Beyond Synonyms: Embeddings aren’t just about direct synonyms. Words that evoke the feelings or actions tied to love, like “cherish” or “protect” might also be situated nearby.

Embeddings give AI systems the ability to calculate similarity. With vectors in hand, computers can determine which concepts are more closely related by measuring the distance between their corresponding vectors. This opens up incredible possibilities across many AI applications.

Embeddings in Action

Here’s how embeddings are transforming the world of AI:

  • Search Engines: Ever notice how search engines understand your intent beyond specific keywords? Embeddings power this change, helping retrieve results that are thematically related to your query even if they don’t contain the exact same words.
  • Recommendation Systems: Recommendation systems in streaming platforms and online stores analyze your preferences with the help of embeddings. Products, movies, and music with similar embedding vectors are likely to be suggested to you.
  • Image Recognition: Can your phone now group your photos by people? Embedding-based similarity measurements let AI models recognize similarities between faces, objects, and scenes within images.
  • Chatbots: Modern chatbots understand the nuances of human language due in part to embeddings. This allows them to go beyond rigid word matching and understand the underlying meaning of what you’re saying.

Introducing Vector Databases

As AI applications rely more heavily on embeddings, traditional databases start to fall short. Here’s where vector databases swoop in to save the day! Vector databases are purpose-built to:

  • Store Embeddings: They are designed to efficiently store and manage massive collections of embedding vectors.
  • Search by Similarity: The magic of vector databases lies in their ability to perform blazing-fast similarity searches. Think of finding the most similar images in a collection of millions, or identifying relevant documents based on a conceptual query.

Why Vector Databases Matter

Vector databases are the key to unlocking the full potential of embedding-powered AI applications:

  • Scalability: Vector databases handle massive datasets of embeddings with ease, a limitation of traditional databases.
  • Speed: Fast similarity searches allow for real-time recommendations, fraud detection, and other complex use cases.
  • Unlocking new Possibilities: The ease of similarity-based operations in vector databases enables entirely new AI applications previously unimaginable.

How Vector Databases Work

Let’s dive deeper into the mechanics of how vector databases work, focusing on indexing, querying, and the technologies that make them efficient and scalable.

Indexing in Vector Databases

The primary challenge that vector databases address is the efficient indexing of high-dimensional data. Traditional indexing methods, effective for structured data like integers and strings, falter with the high dimensionality and unstructured nature of vector data. To overcome this, vector databases employ specialized indexing strategies:

  • Tree-based Indexing: Techniques such as KD-trees and Ball trees partition the vector space into regions, organizing vectors in a way that reflects their spatial distribution. This structure allows the database to quickly eliminate large portions of the dataset that are unlikely to contain the query’s nearest neighbors.
  • Hashing-based Indexing: Locality-sensitive hashing (LSH) is another popular method where vectors are hashed in such a way that similar items are more likely to be placed into the same “buckets.” This reduces the dimensionality of the problem by limiting searches to relevant buckets.
  • Quantization: Methods like product quantization divide the vector space into a finite number of regions, each represented by a centroid vector. Vectors are then approximated by their nearest centroid, significantly reducing storage requirements and speeding up distance computations.
  • Graph-based Indexing: Some vector databases use navigable small world (NSW) graphs or hierarchical navigable small world (HNSW) graphs, where vectors are nodes in a graph. These methods ensure that each node is linked to its nearest neighbors, facilitating efficient traversal during queries.

Querying Process

The querying process in vector databases is designed to efficiently find the “nearest neighbors” to a given query vector, which are the vectors in the database that are most similar to the query. This process involves:

  • Distance Metrics: The similarity between vectors is typically measured using distance metrics such as cosine similarity or Euclidean distance. The choice of metric depends on the specific application and the nature of the data.
  • Search Algorithms: Depending on the indexing method, different algorithms are used to traverse the index and identify the nearest neighbors. For tree-based methods, this might involve traversing down the tree, while graph-based methods involve navigating the graph’s nodes.
  • Approximate Nearest Neighbor (ANN) Searches: Given the computational expense of exact nearest neighbor searches in high-dimensional spaces, vector databases often resort to ANN searches. These searches sacrifice a degree of accuracy for significant improvements in speed and resource efficiency, providing “good enough” results much faster.

Performance and Scalability

Vector databases are designed to be highly scalable, capable of handling billions of vectors. This scalability is achieved through efficient indexing, parallel processing, and distributed architectures that can spread the workload across multiple machines or nodes in a cluster. Advanced vector databases also employ techniques like dynamic indexing, where the index is continuously optimized based on the queries it receives, further enhancing performance.

The efficiency of vector databases in handling high-dimensional vector data fundamentally changes how we approach storing and querying unstructured data in AI applications. By employing sophisticated indexing methods and optimized querying processes, these databases enable fast, accurate, and scalable searches across vast datasets. This capability is indispensable in a world where the volume and complexity of data continue to grow exponentially, powering everything from personalized recommendations to advanced natural language understanding and beyond.

Example: Semantic Search Engine with ChromaDB

Now, let’s build a semantic search engine with ChromaDB. The code is available in this colab notebook.

import chromadb
from chromadb.utils import embedding_functions

# --- Set up variables ---
CHROMA_DATA_PATH = "chromadb_data/" # Path where ChromaDB will store data
EMBED_MODEL = "all-MiniLM-L6-v2" # Name of the pre-trained embedding model
COLLECTION_NAME = "demo_docs" # Name for our document collection

# --- Connect to ChromaDB ---
import os
os.environ["ALLOW_RESET"] = "TRUE" # Enable resetting the ChromaDB client
client = chromadb.PersistentClient(path=CHROMA_DATA_PATH) # Create a ChromaDB client
# Clean up the DB, only for testing to avoid warnings when reinserting docs.
client.reset()

# --- Set up embedding function ---
embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name=EMBED_MODEL
) # Use a Sentence Transformer model for generating embeddings

# --- Create (or retrieve) the collection ---
collection = client.get_or_create_collection(
name=COLLECTION_NAME,
embedding_function=embedding_func, # Assign the embedding function to the collection
metadata={"hnsw:space": "cosine"} # Configure search optimization metadata
)

# --- Prepare documents for storage ---
documents = [
"The latest iPhone model comes with impressive features and a powerful camera.",
"Exploring the beautiful beaches and vibrant culture of Bali is a dream for many travelers.",
"Einstein's theory of relativity revolutionized our understanding of space and time.",
"Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.",
"The American Revolution had a profound impact on the birth of the United States as a nation.",
"Regular exercise and a balanced diet are essential for maintaining good physical health.",
"Leonardo da Vinci's Mona Lisa is considered one of the most iconic paintings in art history.",
"Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
"Startup companies often face challenges in securing funding and scaling their operations.",
"Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'",
]

genres = [
"technology",
"travel",
"science",
"food",
"history",
"fitness",
"art",
"climate change",
"business",
"music",
]

# --- Add documents to the ChromaDB collection ---
collection.add(
documents=documents,
ids=[f"id{i}" for i in range(len(documents))], # Generate unique document IDs
metadatas=[{"genre": g} for g in genres] # Associate genre metadata with each document
)

# --- Perform a search using a query ---
q1 = "Find me some delicious food!"
q2 = "I am looking to buy a new Phone."
queries = [q1, q2]
query_results = collection.query(
query_texts=queries,
n_results=2, # Retrieve the top 2 results
)

# --- Print the results ---
for i, q in enumerate(queries):
print(f'Query: {q}')
print(f'Results:')
for j, doc in enumerate(query_results['documents'][i]):
print(f'{j+1}. {doc}')
print('\n')

Conclusion

Embeddings and vector databases stand at the forefront of a revolution in how AI systems understand and interact with the world. By moving beyond literal keywords to encode meaning and relationships, embeddings unlock powerful similarity-based analysis. With vector databases built for storing and managing vast collections of embeddings, we’re only beginning to imagine the possibilities. From smarter search engines to AI that understands the underlying intent of our words, this technology will push the boundaries of what machines can comprehend.

References

--

--