ChromaDB Vector Embeddings Powered Smart Search
A unified search engine that searches across my textual (blog) and visual (youtube) content.
I built a smart search feature (code) for my website using ChromaDB that seamlessly searches for content across all my blog posts and YouTube videos.
When someone visits my website wanting to learn some programming concept — let’s say Google Analytics — I want to serve them both blogs and videos I created on the topic.
Using a strict text-based search functionality, like the ones offered by WordPress, I cannot do that.
Now I can with this custom solution. In this blog post, I will tell you exactly how.
Two Phases
I will break down the solution into 2 distinct pieces:
- Ingesting Content into ChromaDB Offline
- Serving Search Results Online
Whenever I publish a new blog post or upload a new video, I want to ingest the content into my ChromaDB database that houses the underlying vector embeddings, as well as other metadata (title, thumbnail, time_created, etc).
When a user comes to my website and searches for something like mysql, I will have an online API endpoint serve recommended content in a friendly UI.
Let’s start with the first part.
ChromaDB Vector Embeddings
ChromaDB is one of the best open-source search and retrieval database for AI applications.
It can do full text searches, metadata searches, and the best part — I can store everything locally.
For starters, here’s the system architecture.
There’s a nightly batch job that runs daily to process all new blog posts and YouTube videos. What does process mean exactly?
For every piece of content, the workflow is very similar:
- Convert to text (for videos, I use the transcript)
- Format metadata (title, thumbnail_url, time_created, description, etc)
- Generate vector embeddings using OpenAI (you can do the same using local LLMs)
- Store embedding + metadata in ChromaDB
ChromaDB utility functions make the process very smooth. You can do all four steps with very few lines of code.
This offline process makes sure every night our database is up-to-date with vector embeddings and metadata for every piece of content I have every uploaded.
Serving Search Results to Users
Let’s say a user comes to my website and searches for MySQL Optimizations .
Ideally, I want to show them the following:
- Any **blog post** that talks about MySQL Optimizations
- Any **YouTube video** that talks about MySQL Optimizations
There are two key factors to consider:
- I want to do a semantic search not a text-match search. In other words, I want it to take in account the meaning and theme of every content, not only search for the string
mysqloroptimizations. - I want to search across my written and visual contents.
Here’s a visual version of how I display the search results.
Here are the results:
You can see that I am presenting to the user both my blog posts and YouTube videos.
Under the hood, here’s how the system is designed:
I have an API route that acts as the entry-point for the search functionality.
Request:
API GET
/recommendation?query=<mysql>
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Response:
{
"data": [
{
"title": "Row vs Column Databases | System Design | Databases",
"url": "https://www.youtube.com/watch?v=57bPWExdFes",
"description": "In this video, we go through the two distinctly different types of databases:\n\n1. Row Oriented DB (MySQL, Postgres)\n2. Column Oriented DB (Redshift, Snowflake)\n\nWe go over how data is stored in disk in both these databases and what kind of queries are suited for each of them. \n\nContact me at irtizahafiz9@gmail.com\nPersonal website: https://irtizahafiz.com\nInstagram: https://www.instagram.com/irtiza.hafiz/",
"thumbnail": "https://i.ytimg.com/vi/57bPWExdFes/hqdefault.jpg",
"content_category_id": 1,
"score": 1.2814242839813232
},
{
"title": "Relational Database Indexing | Simple, Compound, B-Trees",
"url": "https://www.youtube.com/watch?v=ECG36-O4-NI",
"description": "We take a closer look at database indexes in relational databases - simple & compound indexes, if & when indexes are used, what data structures hold these indexes, and the drawbacks of having too many indexes. \n\nNotes & References: https://doc.clickup.com/45016410/d/h/1axtau-1702/a845ee8db2d5b2d\n\n🥹 If you found this helpful, follow me online here:\n\n✍️ Blog https://irtizahafiz.medium.com\n👨💻 Website https://irtizahafiz.com\n📲 Instagram https://www.instagram.com/irtiza.hafiz/\n\n0:00 Agenda\n01:15 What is an Index?\n04:50 How does an index look?\n07:30 Default Index\n08:43 Adding an Index\n09:50 DB Engine does not use index\n12:00 Compound Index\n14:40 Order of Index Matters\n16:30 B-Trees\n18:15 Drawbacks of having too many indexes\n\n#database #mysql #postgres",
"thumbnail": "https://i.ytimg.com/vi/ECG36-O4-NI/hqdefault.jpg",
"content_category_id": 1,
"score": 1.3015165328979492
},
]
}With ChromaDB here’s how I am doing a vector search.
def get_top_recommendations(query_string: str):
results = content_collection.query(
query_texts=[query_string],
n_results=5,
)
documents = results["documents"][0]
metadatas = results["metadatas"][0]
distances = results["distances"][0]It’s that easy! This is what happens behind the scene:
- ChromaDB uses OpenAI to generate embeddings for the user query
- It does a **vector search** to compare the generated embedding with embeddings from each of my YouTube video and blog post
- It results the top 5 results with the highest matching score.
You can find the full code on GitHub.
Closing Thoughts
With this architecture, I can serve related content to my users that spans across text (blogs) and visual (video) content.
The data is stored locally and every new content is processed daily. Pretty cool stuff!
If you are still reading, I hope you found it valuable and it was worth your time.
For similar content, check out my YouTube channel or follow me here on Medium.
If you would like to get a copy of the illustration or extra notes, join my newsletter and I will email you a copy.
