Stories by Priya Singh on Medium

Building Recommendation Systems with Vector Search

Priya Singh — Fri, 22 May 2026 07:19:47 GMT

Last week I was debugging a recommendations pipeline that looked fine in a notebook and felt mediocre in the product. The model was not broken. The problem was that our retrieval layer was still thinking in keywords while the user behavior was much messier than that.

A user might read three articles about deployment latency, save one post about prompt evaluation, and ignore five posts about generic AI trends. If I reduce that behavior to tags, I lose a lot of signal. If I represent the content and user profile as vectors, I can retrieve items that are semantically close even when the words do not match exactly.

That is the part of recommendation systems I wish more teams discussed: the model is only half the story. The retrieval substrate matters.

The Three Recommendation Modes I Actually Use

Most recommendation systems start with one of three patterns.

The first is popularity-based recommendation. This is the “trending now” shelf. It is simple, explainable, and surprisingly hard to beat for cold-start users. If I know nothing about a visitor, showing globally popular items is a reasonable baseline.

The second is content-based recommendation. This uses item features: title, description, category, image embedding, transcript, metadata, or any other representation of the item. If a user reads an article about stream processing, I can recommend other content with similar meaning.

The third is collaborative filtering. This uses behavior. Users who clicked, watched, purchased, or saved similar things become signals for one another.

In practice, I rarely deploy only one of these. The fix was simpler than I expected: combine them, then make the retrieval layer fast enough that the product can use the system in real time.

Why Vectors Help

Traditional content-based systems often rely on manually assigned tags or keyword overlap. That works until the user searches for “slow checkout” and the most relevant item says “payment latency.” A keyword system sees different words. An embedding model can put both concepts near each other.

Vector embeddings let us represent text, images, audio, and user profiles as dense numerical vectors. Once the items are embedded, recommendation becomes a similarity search problem:

1. Build an item vector from content and metadata.

2. Build a user vector from recent interactions.

3. Search for nearby item vectors.

4. Rerank with business rules, freshness, diversity, and availability.

This is where a recommender system starts to feel less like a static rules engine and more like a retrieval pipeline.

A Minimal Content-Based Recommender

Here is a stripped-down version of the pattern I usually prototype first. It averages recent item vectors to build a user vector, then searches for the nearest items.

import numpy as np

def normalize(v):
    norm = np.linalg.norm(v)
    return v if norm == 0 else v / norm

def build_user_vector(recent_item_vectors, weights=None):
    if weights is None:
        weights = np.ones(len(recent_item_vectors))

    weighted = np.array([
        normalize(vec) * weight
        for vec, weight in zip(recent_item_vectors, weights)
    ])
    return normalize(weighted.sum(axis=0))

def recommend(vector_store, user_vector, seen_ids, limit=20):
    candidates = vector_store.search(
        vector=user_vector,
        top_k=limit * 3,
        filters={"status": "published"}
    )

    results = []
    for item in candidates:
        if item["id"] in seen_ids:
            continue
        results.append(item)
        if len(results) == limit:
            break

    return results

This is not enough for production, but it is enough to test whether semantic similarity is useful for the product. The first thing I measure is not model accuracy. I measure whether a product manager can look at the results and say, “Yes, these belong together.”

The Cold Start Problem Does Not Go Away

One thing I learned the hard way: vector search improves retrieval, but it does not magically solve cold start.

For new users, I still need fallback shelves:

• popular items by geography or segment

• editorially curated items

• recently trending items

• onboarding interests

• context from the current page or query

For new items, I can embed the content immediately and make it eligible for content-based retrieval before it has interaction history. That is one of the nicest parts of vector-based recommendations. A new article, video, or product can be recommended based on what it is, not only who has clicked it.

Collaborative filtering still becomes more useful as interactions accumulate. My usual approach is to blend both:

• content similarity for fast item cold start

• collaborative signals for mature inventory

• business rules for availability and safety

• diversity constraints so the shelf does not show ten near-duplicates

Where The Vector Database Fits

A vector database becomes useful when the catalog is too large or dynamic for a local in-memory index. The database stores item embeddings, metadata, and sometimes multiple vectors per item. For example, a product might have one vector for title text, one for image, and one for historical user interactions.

The important production detail is filtering. Recommendation queries almost always need constraints:

• only active inventory

• only content the user is allowed to see

• exclude items already consumed

• filter by locale, category, price, or availability

• boost recent or high-quality items

If the system retrieves similar items first and filters later, it can waste most of the candidate set. That is why I care about vector search with metadata filtering, not just nearest-neighbor math.

The Tradeoff: Personalization vs. Diversity

My first vector recommendation prototype was too literal. If a user read one article about RAG evaluation, the next shelf became a wall of RAG evaluation posts. The similarity score was working, but the product experience was narrow.

The fix was to rerank for diversity after retrieval:

1. Retrieve 100 candidates by vector similarity.

2. Remove seen or unavailable items.

3. Group candidates by topic or source.

4. Limit how many results can come from the same cluster.

5. Mix in a few exploration items.

This reduced pure similarity a little, but the shelf became more useful. Users do not always want twenty versions of the same thing.

Production Notes

For real-time recommendations, latency budget matters. If the page needs to render in 300 ms, the recommendation service cannot spend 250 ms on embedding and search. I usually precompute item vectors and update user vectors asynchronously. The online path should mostly be retrieval, filtering, and reranking.

Batching also matters. Embedding every click synchronously is expensive and brittle. I prefer to stream events into a queue, aggregate recent behavior in short windows, and update user vectors every few minutes unless the product truly requires instant adaptation.

Finally, monitor the boring metrics:

• retrieval latency p95 and p99

• empty result rate

• duplicate recommendation rate

• click-through by shelf type

• diversity by category or cluster

• stale inventory returned by the service

The model team may care about offline ranking metrics, but the product will feel the operational metrics first.

What I Would Build First

If I were starting from scratch, I would not build the most complex recommender. I would build a content-based vector retriever, add a popularity fallback, and layer in collaborative signals later. That gives the team a working baseline, handles new items reasonably well, and creates enough logs to learn what users actually respond to.

Recommendation systems are not one algorithm. They are retrieval, ranking, feedback, and product constraints glued together. Vectors make the retrieval layer much more flexible, but the system still needs careful design around cold start, filtering, latency, and diversity.

How to Evaluate RAG Applications

Priya Singh — Fri, 15 May 2026 19:56:00 GMT

The Moment I Realized My RAG Was Confidently Wrong

Last month I shipped a knowledge-base chatbot for an internal team. It looked great in demos — fluent answers, fast responses, everyone loved it. Then a product manager asked it a straightforward question about our API limits and got a response that was factually wrong but sounded completely authoritative.

The retrieved documents were correct. The answer was wrong. And I had no systematic way to tell how often this was happening across thousands of queries.

That’s when I went deep on RAG evaluation. Here’s what I learned about actually measuring whether your pipeline is working — not just whether it feels like it’s working.

The Three Metrics That Matter

When you treat your RAG system as a black box, you have three things to work with: the user’s query, the retrieved chunks, and the generated response. The relationship between these three elements tells you almost everything you need to know.

Context Relevance measures whether the retrieval step is pulling the right documents. If your chunks don’t contain the information needed to answer the query, even the best LLM can’t save you. Low context relevance is the single most common failure mode I’ve seen in production RAG systems — and it’s almost always a chunking or embedding problem, not a generation problem.

Faithfulness checks whether the generated answer actually reflects what’s in the retrieved documents. This is your hallucination detector. A low faithfulness score means the model is making things up or mixing in information from its parametric memory instead of sticking to the provided context.

Answer Relevance evaluates whether the response actually addresses the question. You can have perfect retrieval and faithful generation, but if the answer is incomplete or off-topic, the user still doesn’t get what they need.

One thing I learned the hard way: these three metrics are not independent. A problem in context relevance cascades downstream — bad retrieval leads to unfaithful generation leads to irrelevant answers. When debugging, always start with retrieval.

How to Score These Automatically

Here’s what actually happened when I tried to evaluate manually: I spent two hours scoring 50 query-response pairs and realized this would never scale. For a production system handling thousands of queries daily, manual evaluation is a non-starter.

The fix was simpler than I expected: use an LLM as a judge. The LLM-as-a-Judge approach, where you use a capable model like GPT-4 to score responses, reaches about 80% agreement with human raters. That sounds low until you realize that two human raters typically don’t agree much more than that on subjective assessments.

Here’s how I set this up for answer relevance:

eval_prompt = """
Rate how well this response answers the question.
Score 0-10 (0 = completely irrelevant, 10 = perfect answer).

Question: {question}
Response: {response}

Score:
"""

def evaluate_answer_relevance(question, response, llm_client):
    result = llm_client.complete(
        eval_prompt.format(question=question, response=response)
    )
    return int(result.strip())

The key to making LLM-as-a-Judge work reliably is prompt engineering. Position bias is real — LLMs pay more attention to content at the beginning and end of long prompts. I use chain-of-thought prompting to force the model to reason through its score before committing to a number.

The Ground Truth Question

You might have noticed none of the above metrics require ground-truth answers. That’s intentional — annotating ground truth is expensive and time-consuming.

But when you do have ground truth, you unlock more precise evaluation. You can measure retrieval recall (what fraction of relevant documents did we actually retrieve?) and answer correctness (does the generated answer match the known-correct answer?).

Here’s the practical shortcut: use an LLM to generate synthetic ground truth from your knowledge documents. Tools like Ragas and LlamaIndex have built-in generators that produce question-answer pairs from your corpus. The generated questions won’t be perfect, but they’re good enough for regression testing and comparing pipeline configurations.

White-Box Evaluation: Testing Components Separately

When I can see inside the pipeline, I evaluate each component independently. This is where you find the actual bottleneck.

[Embedding model](https://zilliz.com/blog/choosing-the-right-embedding-model-for-your-data?utm_campaign=mediumkoc) evaluation: Use information retrieval metrics — context recall (did we find the right chunks?) and context precision (are the retrieved chunks actually relevant?). The MTEB benchmark is the standard leaderboard, but be cautious: since the benchmark datasets are public, some models may be overfit. Always validate on data from your actual domain.

Reranker evaluation: Measure how much the reranker improves the ordering of retrieved results. Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG) are the key metrics here. A reranker that doesn’t meaningfully improve NDCG over the base retrieval isn’t worth the latency cost.

LLM evaluation: For simple factual questions, you can use deterministic metrics like ROUGE-L and token overlap against ground truth. For open-ended questions, fall back to LLM-as-a-Judge faithfulness scoring.

The Evaluation Tools I Actually Use

After trying most of the options out there, here’s where I’ve landed:

• Ragas: Best for black-box RAG evaluation. Clean interface, good metric coverage, works with any RAG framework. The synthetic test data generation is a time-saver.

• TruLens-Eval: Good integration with LangChain and LlamaIndex if you’re already in those ecosystems. The browser-based monitoring is helpful for tracking evaluation over time.

• DeepEval: Lightweight, fast. Good for CI/CD integration when you want evaluation as part of your deployment pipeline.

The common pattern across all of these: they use LLM-as-a-Judge under the hood and charge you API tokens accordingly. Budget for evaluation costs — on a large test set, evaluation can cost more than the actual inference.

What I Do Differently Now

After going through this, I changed my RAG development workflow. Before deploying any pipeline change, I run a standardized evaluation suite: 200 synthetic questions generated from the knowledge base, scored across all three metrics. If context relevance drops below 0.7 or faithfulness drops below 0.8, the change doesn’t ship.

It’s not perfect — there are failure modes these metrics miss, especially around nuance and completeness. But it catches the catastrophic failures, which is what matters most in production.

Next, I’m looking into building a continuous evaluation loop that samples live traffic and flags quality regressions in real time. The challenge is doing this without burning through GPT-4 tokens at scale — I’m experimenting with smaller, fine-tuned judge models to bring the cost down.

How to Evaluate RAG Applications was originally published in GoPenAI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Why Your Vector Search Returns 10 Results From the Same Document

Priya Singh — Wed, 13 May 2026 10:12:33 GMT

Milvus Week: Array of Structs and MAX_SIM

Last week I was building a ColBERT-style retrieval system for a client’s internal knowledge base, and I hit the exact problem that probably every ML engineer who’s worked with multi-vector search knows: the vector database kept returning multiple chunks from the same document instead of giving me a ranked list of unique documents. I’d fetch top-10 results, and six of them would be fragments of the same article. The whole post-processing layer I built — grouping by doc_id, deduplicating, reranking — felt exactly like what a database should be handling natively.

Then I saw the Milvus 2.6.4 release notes and the Array of Structs + MAX_SIM combination. I spent a weekend digging into it. Here’s what actually happened.

The Core Problem: Embeddings Are Not Entities

The architectural gap has always been the same. Most vector databases treat each embedding as an isolated row. But real applications operate on entities — documents, products, videos, scenes. When you chunk a document into 20 paragraphs and embed each one, you have 20 rows in your index. Ask for top-5, and you might get 5 rows from the same document.

I’ve patched this problem four different ways across different projects:

• Grouping by metadata field after retrieval

• Setting a max-per-document cap and re-querying if needed

• Running a separate reranking model that penalizes redundancy

• Using ColBERT’s late interaction but still handling dedup manually

All of these work. None of them are satisfying. You’re pushing application-layer logic into a problem that should be solved at retrieval time.

The use cases where this shows up are consistent:

• RAG knowledge bases: articles are chunked into paragraph embeddings, so the search engine returns scattered fragments instead of the complete document

• E-commerce recommendation: a product has multiple image embeddings, and your system returns five angles of the same item rather than five unique products

• Video platforms: videos are split into clip embeddings, but search results surface slices of the same video rather than a single consolidated entry

• ColBERT / ColPali-style retrieval: documents expand into hundreds of token or patch-level embeddings, and your results come back as tiny pieces that still require merging

Array of Structs: One Entity, One Row

Milvus 2.6.4 introduces an Array of Structs field type. A single record now holds an ordered list of Struct elements, where each Struct follows the same predefined schema — it can contain vectors, strings, scalar fields, whatever belongs to that sub-element.

Here’s what a document record looks like with this structure:

{
  'id': 0,
  'title': 'Walden',
  'title_vector': [0.1, 0.2, 0.3, 0.4, 0.5],
  'author': 'Henry David Thoreau',
  'year_of_publication': 1845,
  'chunks': [
    {
      'text': 'When I wrote the following pages...',
      'text_vector': [0.3, 0.2, 0.3, 0.2, 0.5],
      'chapter': 'Economy',
    },
    {
      'text': 'I would fain say something, not so much...',
      'text_vector': [0.7, 0.4, 0.2, 0.7, 0.8],
      'chapter': 'Economy'
    }
  ]
}

The chunks field is the Array of Structs field. Every paragraph that belongs to this entity lives inside one row. No more 1:N explosion of rows per document.

This is the right data model for almost every multi-vector use case I encounter:

• RAG knowledge bases: entire document (all chunks) as one record

• E-commerce: all product images as one record

• Video search: all clip embeddings as one record

• ColPali document search: all patch embeddings as one record

MAX_SIM: Entity-Level Scoring That Makes Sense

The new field type alone wouldn’t be enough. You still need a scoring mechanism that operates at the entity level, not the individual-vector level. That’s what MAX_SIM provides.

When you query with MAX_SIM, Milvus compares your query vector (or token vectors) against every vector stored in the entity’s Array of Structs field, and takes the maximum similarity as that entity’s score. The entity is ranked based on that single score — no duplicate-filled result sets, no complex post-processing.

The Milvus docs walk through a concrete example worth understanding. Say you search for “Machine Learning Beginner Course,” which gets tokenized into three vectors: machine learning, beginner, course. Now you have two candidate documents:

• doc_1: “Introduction Guide to Deep Neural Networks with Python”

• doc_2: “Advanced Guide to LLM Paper Reading”

For doc_1, the per-token best matches (using cosine similarity in the [0,1] range) are:

• machine learning → deep neural networks (0.9)

• beginner → introduction (0.8)

• course → guide (0.7)

• Sum = 2.4

For doc_2:

• machine learning → LLM (0.9)

• beginner → guide (0.6)

• course → guide (0.8)

• Sum = 2.3

doc_1 wins, which is the intuitive result — it’s more of an introductory guide.

Three things to note about how MAX_SIM behaves:

1. Semantic, not lexical. “Machine learning” scores high against “deep neural networks” despite zero shared tokens. The scoring lives entirely in embedding space, making it robust to synonyms and paraphrases.

2. Length-agnostic. doc_1 has 4 vectors, doc_2 has 5. MAX_SIM doesn’t care — it matches each query vector to the best available candidate within each entity, regardless of how many exist.

3. Every query token contributes. The sum ensures that a document that matches well on some tokens but poorly on others doesn’t unfairly dominate. Lower-quality matches directly reduce the overall score.

Setting This Up in Milvus: What the Code Looks Like

Here’s how you’d define a collection schema with an Array of Structs field and set up retrieval with MAX_SIM:

from pymilvus import MilvusClient, DataType, FieldSchema, CollectionSchema

client = MilvusClient("milvus.db")

# Define the schema
schema = client.create_schema(
    auto_id=False,
    enable_dynamic_field=True
)

# Entity-level fields
schema.add_field("id", DataType.INT64, is_primary=True)
schema.add_field("title", DataType.VARCHAR, max_length=512)

# Array of Structs field for multi-vector storage
schema.add_field(
    "chunks",
    DataType.ARRAY,
    element_type=DataType.STRUCT,
    struct_fields=[
        FieldSchema("text", DataType.VARCHAR, max_length=4096),
        FieldSchema("text_vector", DataType.FLOAT_VECTOR, dim=768),
        FieldSchema("chapter", DataType.VARCHAR, max_length=256),
    ]
)

# Index params — HNSW index on the nested vector field
index_params = client.prepare_index_params()
index_params.add_index(
    field_name="chunks.text_vector",
    index_type="HNSW",
    metric_type="COSINE"
)

client.create_collection(
    collection_name="documents",
    schema=schema,
    index_params=index_params
)

One production consideration worth flagging: with large entities — documents with hundreds of chunks — the memory layout per record changes significantly compared to single-vector schemas. I’d recommend starting with a conservative estimate of average chunks-per-entity and monitoring memory consumption during index build, especially if you’re running Milvus on memory-constrained nodes.

Design Tradeoffs I’m Still Thinking About

Array of Structs + MAX_SIM solves the grouping and deduplication problem cleanly, but it’s not a universal drop-in replacement.

When it works extremely well:

• ColBERT and ColPali retrieval, where you’re doing late interaction across many token or patch vectors

• Document retrieval where you want entity-level ranking from the start

• E-commerce and media, where a “result” is always a single product or video

Where I’d think twice:

• If your chunks need to surface individually in the response (you want the specific paragraph, not just the document), you still need to identify the best matching chunk post-retrieval. MAX_SIM tells you which entity wins, not which internal vector was the best match. You’d need a second pass for chunk-level answers.

• Write-heavy pipelines where entities are frequently updated. The field type doesn’t change Milvus’s segment behavior, but it’s worth testing your specific update pattern before committing.

One thing I learned the hard way on a previous RAG project: if your chunking strategy produces wildly variable chunk counts per document — some docs have 3 chunks, others have 300 — the entity-level scores aren’t directly comparable. Normalize or filter by entity size if that matters for your recall metrics.

Where This Fits in a Practical RAG Stack

For the ColBERT-style setup I was building, I’m migrating to Array of Structs with MAX_SIM as the retrieval layer. The change that matters most in production: eliminating the deduplication pass that was running after every vector search call. In my setup, that post-processing step was adding roughly 40–80ms of latency per query depending on the result set size. With entity-level retrieval built into the database, that cost disappears.

The pattern I’m moving to:

1. At index time: one record per entity, all chunk vectors stored in the Array of Structs field

2. At query time: late-interaction scoring via MAX_SIM, entity-level ranked results returned directly

3. Final step: fetch the stored chunk text fields from the winning entity to build the LLM context window

No intermediate grouping. No dedup. No reranking middleware for deduplication purposes. Just retrieval that returns what the application actually needs.

This is the kind of database-level primitive that makes the application stack simpler. I’ll be writing a follow-up once I’ve run this in a real traffic environment and can share actual recall and latency numbers comparing it to my current post-processing approach.

Beyond the Black Box: Building Class Activation Maps in PyTorch from Scratch

Priya Singh — Wed, 13 May 2026 08:01:22 GMT

When you’re shipping deep learning models into production — especially for high-stakes applications like medical imaging or autonomous vehicles — accuracy isn’t the only thing that matters. Interpretability becomes just as critical. The model can’t just “be right”; it needs to show its work.

This is where Class Activation Mapping (CAM) comes in. It’s a simple yet powerful way to make convolutional neural networks (CNNs) a bit less of a black box, and I’ve found it incredibly useful for both debugging and demoing models.

Let’s walk through what CAM does, why it matters, and how you can implement it from scratch using PyTorch — without relying on wrappers or high-level explainability libraries.

Why Interpretability Matters in Vision Models

CNNs have become the go-to tool for image classification, object detection, and segmentation. But despite their predictive power, they’re hard to trust blindly — especially in mission-critical applications. Here’s a quick example:

Imagine you’ve built a model that classifies road signs for a self-driving car. It flags a stop sign — but is it focusing on the sign, or the red car parked next to it? Without interpretability, you’d never know.

CAM addresses this by showing which parts of the image contributed most to a classification. You get a heatmap overlay on the image that essentially answers: “Why did the model think this was a stop sign?”

The Core Idea Behind CAM

Let’s get a bit technical. CAM is only applicable to a specific kind of CNN architecture — where the model ends with a global average pooling (GAP) layer, followed by a fully connected (FC) layer.

Here’s how it works under the hood:

The CNN processes the input image and outputs a set of feature maps from the last convolutional layer.
The GAP layer averages each feature map into a single scalar.
The FC layer multiplies these scalars by learned weights to produce class scores.

The key insight: if you know the weights in the FC layer for a specific class (say, “zebra”), you can multiply those weights back into the original feature maps to see which spatial regions were most responsible for the prediction.

This gives you a class-specific heatmap — aka, the CAM.

Visual Example

Take this heatmap output for an image containing a zebra and a car:

Heatmap: A Class Activation Map highlights the regions around the zebra and car as more significant than other image parts.

The model is clearly focusing on the zebra. That’s what we want to see: the model’s attention aligns with our human intuition.

A Ground-Up Implementation with PyTorch

Let’s build CAM from scratch using PyTorch and a pre-trained ResNet18. Here’s what we’ll do:

Use a hook to capture the last convolutional feature maps.
Extract the weights for the predicted class from the final FC layer.
Compute a weighted sum of the feature maps using those weights.
Normalize and upsample the result to create a heatmap.

Step 1: Load and Set Up the Model

import numpy as np
import cv2
from torchvision import models, transforms
import torch
from torch.nn import functional as F
model = models.resnet18(pretrained=True)
model.eval()

Step 2: Register a Forward Hook to Grab Feature Maps

activation = {}
def get_activation(name):def hook(model, input, output):
        activation[name] = output.detach()return hook
model.layer4.register_forward_hook(get_activation('final_conv'))

Step 3: Load and Transform the Input Image

image_path = "path_to_your_image.png"
image = cv2.imread(image_path)
orig_image = image.copy()
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])
input_tensor = transform(image).unsqueeze(0)

Step 4: Forward Pass and Fetch Class Index

outputs = model(input_tensor)
class_idx = F.softmax(outputs, dim=1).argmax().item()

Step 5: Retrieve Feature Maps and Weights

feature_maps = activation['final_conv'][0]
weights = model.fc.weight[class_idx].detach().numpy()

Step 6: Generate the CAM

def compute_cam(feature_maps, weights):
    nc, h, w = feature_maps.shape
    cam = weights.dot(feature_maps.reshape((nc, h * w)))
    cam = cam.reshape(h, w)
    cam = cam - np.min(cam)
    cam /= np.max(cam)
    cam = np.uint8(255 * cam)return cv2.resize(cam, (image.shape[1], image.shape[0]))
    cam = compute_cam(feature_maps.numpy(), weights)

Step 7: Overlay CAM on Original Image

heatmap = cv2.applyColorMap(cam, cv2.COLORMAP_JET)
overlay = heatmap * 0.3 + orig_image * 0.5
cv2.imshow('CAM Result', overlay.astype(np.uint8))
cv2.waitKey(0)

You should see something like this:

CAM in the Wild: Lessons from Real Projects

When I first added CAM to a medical image classifier, I was shocked to find that it consistently latched onto image corners — areas with hospital watermarks, not pathology. That alone saved us weeks of debugging.

In another project involving drone footage, CAM revealed that the model was biased toward shadows when identifying “moving vehicles.” Without the visualization, we wouldn’t have caught that misbehavior until it hit production.

In both cases, CAM was my early warning system.

Limitations and When to Reach for Grad-CAM

Now, CAM is great — but it’s not flexible. You need a GAP layer before the FC layer, which many modern architectures don’t have. If you’re using something more customized or want generalization across architectures, you’ll want Grad-CAM instead.

Grad-CAM works by computing gradients of the class score with respect to feature maps. It doesn’t require architectural changes, so it’s a drop-in solution for most use cases.

There’s also Grad-CAM++, which improves localization when multiple objects are present.

For a deeper dive:

Final Thoughts: Use CAM to Build Trust

Interpretability tools like CAM are essential when you’re putting models in production — especially in domains where debugging is hard and stakes are high.

If you’re building your own stack, self-hosting your own model inference (like with Triton) and vector search (I often reach for Milvus for scalable, GPU-accelerated retrieval), overlaying CAM visualizations can be an easy way to monitor whether your vision model is still behaving as expected post-deployment.

CAM might not be the fanciest interpretability method around anymore, but for getting started — and for injecting some transparency into your CNNs — it’s a rock-solid foundation.

LangChain Is Not a Framework — It’s a Wiring Diagram for LLM Systems

Priya Singh — Thu, 30 Apr 2026 13:26:01 GMT

Last week I was helping a friend prototype an internal knowledge assistant for their legal team. They had a working prompt, a decent embedding model, and a pile of PDFs. “I just need to connect them,” they said. Twenty minutes into their codebase, I realized what they actually needed was not a better model — it was a wiring diagram. That is exactly what LangChain turned out to be for them, and for most of the production systems I have built over the past year.

There is a common misconception that LangChain is a framework you adopt wholesale, like Django or Rails. It is not. It is closer to an integration layer — a set of conventions for connecting prompts, models, retrievers, memory, and tools into something that behaves like a system instead of a notebook cell.

This post is my honest take on LangChain after shipping multiple RAG applications and agent-style systems in production. Where it helped, where it got in the way, and how I actually use it day to day.

The Problem LangChain Solves

Calling an LLM is easy. Building a system around one is not.

The moment you move beyond a single prompt and a single response, you run into real architectural questions. How do you feed context from a vector database into your prompt? How do you chain a summarization step before a generation step? How do you let the model decide which tool to call, and then route the result back into the conversation?

You can wire all of this yourself. I have done it plenty of times. But LangChain gives you a vocabulary for these patterns. It makes implicit decisions explicit: this is the retriever, this is the prompt template, this is the chain that ties them together. Even when I end up rewriting LangChain prototypes into custom code later, the architecture it forces me to think through usually survives.

Chains Are About Separation, Not Complexity

The core abstraction in LangChain is the chain — a sequence of operations where the output of one step feeds into the next. This sounds trivial until you realize how many production LLM systems jam everything into a single prompt and hope for the best.

Here is what actually happens in a real system. Your user asks a question. You need to retrieve relevant documents, reformat them into context, build a prompt, send it to the model, parse the response, maybe check for AI hallucination, and return a structured answer. Each of those steps has different failure modes, different latency profiles, and different caching strategies.

Chains force you to separate these concerns. That separation is what makes it possible to benchmark one step without running the whole pipeline, to cache retrieval results independently of generation, and to swap out your model without touching your retrieval logic.

One thing I learned the hard way: do not over-chain. My first LangChain project had twelve steps in the chain, including three that were basically no-ops. Debugging was a nightmare. Now I keep chains to five steps or fewer and handle edge cases outside the chain.

Retrieval Is Where LangChain Earns Its Keep

LangChain becomes genuinely useful when retrieval enters the picture. And retrieval is where most of the actual engineering work lives in a RAG system.

The pattern is straightforward. You chunk your documents, generate vector embeddings for each chunk, store them in a vector database, and then at query time you embed the question, search for similar chunks, and feed those chunks into the LLM as context. LangChain wraps this entire flow in a RetrievalQA chain that handles the plumbing.

What I appreciate about this abstraction is the retriever interface. It does not care whether your backend is Milvus, Pinecone, pgvector, or a flat file. You implement get_relevant_documents() and the rest of the chain just works. In practice, I have used Milvus for most of my production systems because it handles the scale I need — millions of vectors with sub-10ms p99 latency — but the point is that LangChain does not lock you in.

Here is a minimal example that reflects how I actually start RAG prototypes:

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Milvus
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

embeddings = OpenAIEmbeddings()
vectorstore = Milvus(
    collection_name="knowledge_base",
    embedding_function=embeddings,
    connection_args={"host": "localhost", "port": "19530"}
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

qa = RetrievalQA.from_chain_type(
    llm=OpenAI(temperature=0),
    retriever=retriever,
    chain_type="stuff"
)
answer = qa.run("How does our authentication flow work?")

Nothing fancy. But it is readable, debuggable, and easy to evolve. That matters more than cleverness when you are iterating on retrieval quality at 2am.

Agents: Powerful but Easy to Misuse

LangChain’s agent abstractions get a lot of attention. Agents let the model decide which tool to call, observe the result, and decide what to do next. This is genuinely powerful for certain use cases — research assistants, data exploration tools, anything where the task is not fully known upfront.

They are also one of the easiest ways to build something fragile.

Here is what actually happened the first time I deployed an agent in production: it worked perfectly on our test queries, then a customer asked a slightly ambiguous question and the agent entered a loop — calling the same search tool four times with progressively worse reformulations, burning through tokens and returning garbage. The fix was simpler than I expected: I added a max-iteration cap and a confidence check after each tool call.

I only reach for agents now when the workflow genuinely requires dynamic routing. If you know the steps upfront, use a chain. Chains are predictable. Agents are flexible. Pick the one that matches your problem.

Memory: Less Is Usually More

LangChain offers memory abstractions — conversation buffers, summary memory, entity memory. They are often misused.

The trap is persisting everything. In a customer support bot I built last year, we initially stored the entire conversation history in memory and fed it all back into every prompt. After about fifteen turns, the context window was full of irrelevant early messages, costs were climbing, and response quality had degraded noticeably.

Here is what I actually do now. I treat memory as a sliding window — last five turns maximum, with a separate summary buffer that condenses older context into a single paragraph. For anything that needs to persist beyond the session, I write it to the vector database and retrieve it on demand. This keeps token costs flat and retrieval relevant.

The same principle applies to Multimodal RAG systems where your context is not just text but images and tables. Memory bloat gets worse when you are dealing with unstructured data across multiple modalities — another reason to be aggressive about what you keep and what you discard.

What I Actually Measure

When evaluating a LangChain-based system, I do not care about demo impressiveness or how many components are chained together. I measure four things:

• Retrieval precision at k=5 — are the chunks actually relevant?

• End-to-end latency p95 — can a user wait this long?

• Token cost per query — can we afford this at 10x current traffic?

• Answer faithfulness — does the response actually follow the retrieved context, or is the model making things up?

LangChain helps with iteration speed on all of these, but it does not optimize any of them by default. You still need to tune your chunking strategy, pick the right embedding model, and configure your vector index parameters. The abstraction layer just makes it faster to experiment.

When I Skip LangChain

LangChain is not always the right tool. I skip it when:

• The application is a single prompt with no retrieval — just call the API directly.

• Latency is critical and I need full control over every network call.

• The team is experienced enough to build the plumbing themselves and LangChain’s abstractions would add indirection without adding clarity.

• I need fine-grained streaming control that LangChain’s abstractions make awkward.

LangChain shines during exploration and early system design. Mature systems sometimes outgrow it, and that is completely fine. The architecture it helped you discover is the real value — not the library itself.

What Comes Next

I have been experimenting with LangGraph for workflows that need conditional branching and cycles — things that plain chains cannot express cleanly. But that is a different post. For now, if you are building your first RAG system or trying to bring structure to an LLM prototype that has gotten out of hand, LangChain is still the fastest way to get from “it works in a notebook” to “it works in production.”

Just remember: it is a wiring diagram, not the system itself. The quality of your retrieval, your embeddings, and your prompts still determines whether the thing actually works.

LangChain vs LangGraph

Priya Singh — Mon, 27 Apr 2026 09:16:31 GMT

Last week I was rebuilding a document Q&A pipeline that had outgrown its original design. What started as a clean RAG chain — retrieve context, stuff it into a prompt, get an answer — had turned into something with retry logic, a verification step that called the LLM a second time, and a branching path where certain queries got routed to a different retriever entirely. I had duct-taped it together with nested if-statements and callbacks, and it was becoming painful to debug. That’s when I sat down and properly evaluated whether LangChain alone was still the right tool, or whether LangGraph — the graph-based orchestration layer from the same team — was what I actually needed.

If you’re at a similar crossroads, here’s what I learned from living with both in production.

What LangChain Actually Does Well

LangChain is middleware. It sits between your model and your application and gives you a library of connectors and abstractions so you don’t have to write boilerplate for every integration. Need to load PDFs, split them into chunks, embed them, store them in a vector database, and query them with semantic search? LangChain has components for each of those steps, and they plug together with a consistent interface.

The core orchestration mechanism is called LCEL (LangChain Expression Language). You pipe components together in a sequence — retriever, prompt template, model, output parser — and it handles the data flow. For straightforward pipelines, this works well. I’ve shipped several RAG services where LangChain was the right level of abstraction: connect to a Milvus instance for vector similarity search, feed results into a prompt, return the answer. Done in under a hundred lines.

Where LangChain earns its keep is the component library. Document loaders for dozens of formats. Text splitters that understand markdown structure. Vector embeddings wrappers for OpenAI, Cohere, HuggingFace, and others. Model interfaces that let you swap providers without rewriting your chain. If your use case fits the pattern of “connect things in a line and run them,” LangChain is genuinely productive.

Where LangChain Starts to Creak

The trouble starts when your workflow isn’t linear. My Q&A pipeline needed to:

1. Retrieve candidate documents

2. Check whether the retrieved context was actually relevant (a lightweight classifier)

3. If not, reformulate the query and try again — up to two retries

4. If relevant, generate an answer

5. Run a fact-check pass against the original documents to reduce AI hallucination

6. Return the answer with confidence metadata

Steps 2 and 3 are a loop. Step 5 is a conditional branch. LangChain’s Memory components can hold simple conversational context, but they weren’t designed for tracking “which retry am I on” or “did the verification pass.” I could make it work with custom callbacks and state variables stuffed into the chain’s metadata, but I was fighting the abstraction rather than using it.

This is the honest assessment: LangChain is good at pipelines, not at workflows. A pipeline is a sequence. A workflow has branches, loops, retries, and state that persists across steps. If you need the latter, you need something built for it.

What LangGraph Brings to the Table

from langgraph.graph import StateGraph, END
from typing import TypedDict, List

class PipelineState(TypedDict):
    query: str
    documents: List[str]
    answer: str
    is_relevant: bool
    retry_count: int

def retrieve(state: PipelineState) -> dict:
    # Call your vector store here — e.g., Milvus retriever via LangChain
    docs = retriever.invoke(state["query"])
    return {"documents": [d.page_content for d in docs]}

def check_relevance(state: PipelineState) -> dict:
    # Lightweight classifier or LLM call to judge relevance
    score = relevance_classifier(state["query"], state["documents"])
    return {"is_relevant": score > 0.7}

def reformulate_query(state: PipelineState) -> dict:
    new_query = query_rewriter.invoke(state["query"])
    return {"query": new_query, "retry_count": state["retry_count"] + 1}

def generate_answer(state: PipelineState) -> dict:
    answer = rag_chain.invoke({
        "context": "\n".join(state["documents"]),
        "question": state["query"]
    })
    return {"answer": answer}

def route_after_relevance(state: PipelineState) -> str:
    if state["is_relevant"]:
        return "generate"
    if state["retry_count"] < 2:
        return "reformulate"
    return "generate"  # Give up retrying, answer with what we have

graph = StateGraph(PipelineState)
graph.add_node("retrieve", retrieve)
graph.add_node("check_relevance", check_relevance)
graph.add_node("reformulate", reformulate_query)
graph.add_node("generate", generate_answer)

graph.set_entry_point("retrieve")
graph.add_edge("retrieve", "check_relevance")
graph.add_conditional_edges("check_relevance", route_after_relevance, {
    "generate": "generate",
    "reformulate": "reformulate",
})
graph.add_edge("reformulate", "retrieve")
graph.add_edge("generate", END)

app = graph.compile()

result = app.invoke({
    "query": "How do I configure HNSW index parameters?",
    "documents": [],
    "answer": "",
    "is_relevant": False,
    "retry_count": 0,
})

LangGraph models your application as a directed graph. Each node is an action — calling an LLM, querying a database, running a classifier, formatting output. Edges define how control flows between nodes, and they can be conditional. You get loops, branching, retries, and parallel execution as first-class concepts rather than hacks.

The critical difference is state management. LangGraph maintains a centralized state object that every node can read from and write to. It supports rollbacks and history, so you can inspect exactly what happened at each step. When my verification node decided the retrieval was bad, it could increment a retry counter in the state, modify the query, and route back to the retrieval node — all expressed declaratively in the graph definition.

Here’s a simplified version of what the refactored pipeline looked like:

Every step is a plain function. The graph definition is separate from the logic. You can look at the graph structure and understand the flow without reading through implementation details. That separation is what I was missing.

A Design Tradeoff That Bit Me

One thing I didn’t expect: LangGraph’s state management adds overhead that matters at scale. Every node invocation serializes and deserializes the state object. For my pipeline, the state included retrieved document texts — sometimes 15–20 chunks of 500 tokens each. Serializing that on every node transition added measurable latency, roughly 40–60ms per step on a modestly-sized state.

My fix was to store document references (IDs) in the graph state rather than full text, and only hydrate the content when a node actually needed it. This meant adding a lightweight cache layer, but it cut per-step overhead significantly. The lesson: keep your LangGraph state lean. Treat it like you’d treat a database row, not a dumping ground for intermediate artifacts.

The other tradeoff is complexity. For a straightforward RAG pipeline — retrieve, prompt, answer — LangGraph is overkill. You’re defining nodes, edges, state schemas, and routing functions for something that LCEL handles in five lines. I’ve seen teams adopt LangGraph prematurely because it feels more “serious,” then spend days debugging graph definitions for workflows that are fundamentally linear. Use it when you actually need branching or state. Not before.

Multi-Agent Coordination

Where LangGraph genuinely shines is multi-agent setups. I’ve been experimenting with a system where one agent handles retrieval and answer generation, a second agent handles fact-checking, and a third handles query understanding and routing. Each agent is a subgraph with its own internal logic, and the parent graph coordinates them.

This pattern — sometimes called Agentic RAG — is where the graph abstraction pays for itself. Agents can run in parallel where their inputs are independent. The parent graph manages shared state, handles timeouts, and defines fallback behavior. Trying to build this with plain LangChain chains and callbacks would be a nightmare of spaghetti logic.

LangGraph also integrates with LangSmith for debugging, which becomes essential when you have multiple agents making decisions. Being able to trace which node fired, what state it saw, and what it produced is the difference between a debuggable system and a black box.

Production Considerations

A few things I’ve learned deploying both in production:

Latency budgets matter. Each node transition in LangGraph has overhead. For user-facing applications with tight latency requirements, count your nodes carefully. A graph with eight nodes and three conditional branches will be slower than a single LCEL chain, even if the actual LLM calls are identical. Profile early.

Batching is easier with LangChain. If you’re processing bulk documents — say, embedding and indexing thousands of pages into Zilliz Cloud — LangChain’s batch interfaces are more mature. LangGraph is designed for single-request workflows, not batch ETL.

State persistence needs planning. LangGraph supports checkpointing state to external storage, which is critical for long-running conversations or multi-turn interactions. But you need to choose your backend (Redis, Postgres, etc.) and handle serialization yourself. It’s not plug-and-play.

Testing graph logic separately from node logic is the single best practice I can recommend. Write unit tests for each node function in isolation, then write integration tests for the graph routing. If you mix the two, debugging becomes miserable.

When to Use Which

Use LangChain alone when your workflow is a pipeline: data in, steps in sequence, result out. RAG over a vector index, summarization chains, simple chatbots with memory. It’s productive, well-documented, and has integrations for practically everything.

Use LangGraph when your workflow has loops, branches, retries, or multiple agents making decisions. Customer service bots that escalate based on sentiment. Research assistants that iteratively refine their searches. Any system where “what happens next” depends on “what just happened.”

Use both together — and this is what I ended up doing. LangChain provides the component library: retrievers, model wrappers, document loaders, embedding functions. LangGraph provides the orchestration: how those components interact, when they retry, how state flows between them. The retriever node in my graph uses a LangChain retriever under the hood. The LLM calls go through LangChain’s model interface. LangGraph just manages the flow.

That’s the practical answer. Not one or the other — the right layer for the right job.

Building Interactive AI Chatbots with Vector Search

Priya Singh — Thu, 23 Apr 2026 09:06:01 GMT

Last week I was helping a fintech client migrate their support chatbot from a simple FAQ matcher to something that could actually hold a context-aware conversation. The old system would crumble the moment someone asked a follow-up question or phrased something slightly differently than the training data. “Why can’t I see my recent transaction?” would work fine, but “Where did my money go?” would send it into a loop of generic responses.

The breakthrough came when I stopped thinking about it as a keyword-matching problem and started treating it as a vector search problem. Here’s what actually happened when I rebuilt it.

The Core Architecture: Why Vector Databases Changed Everything

Traditional chatbots rely on exact matches or simple pattern recognition. You build a library of intents, map keywords to responses, and hope users phrase things the way you anticipated. It breaks down fast in production.

A vector database solves this by converting both your knowledge base and user queries into numerical representations — vectors — that capture semantic meaning. When someone asks “Where did my money go?”, the system doesn’t look for the word “transaction”. It looks for *concepts* that are mathematically similar in high-dimensional space.

Here’s the simplest example I can show. Let’s say a user asks about Vietnamese restaurants nearby. The flow looks like this:

from pymilvus import connections, Collection
import openai
# Connect to vector database
connections.connect(host="localhost", port="19530")
collection = Collection("restaurant_kb")
# User query
user_query = "What are the best Vietnamese restaurants near me?"
# Generate embedding for query
query_embedding = openai.Embedding.create(
    input=user_query,
    model="text-embedding-ada-002"
)["data"][0]["embedding"]
# Search for similar vectors
search_params = {"metric_type": "COSINE", "params": {"nprobe": 10}}
results = collection.search(
    data=[query_embedding],
    anns_field="embedding",
    param=search_params,
    limit=5,
    output_fields=["name", "cuisine", "location", "rating"]
)
# Extract top results
for hit in results[0]:
    print(f"{hit.entity.get('name')} - {hit.entity.get('cuisine')} - {hit.distance}")

The query gets vectorized, the database finds the five closest matches in vector space, and the chatbot can now synthesize a response like “Based on your location and preferences, Pho 79 and Saigon Bistro are highly rated options within 2 miles.”

What makes this powerful isn’t just the search — it’s the *memory*. The chatbot can store previous interactions as vectors too, so when the user follows up with “What about Italian instead?”, the system understands the context without needing explicit session management.

What I Learned Building This in Production

The first version I deployed had terrible latency. Generating embeddings on every query was eating 200–300ms per request, and our SLA was 500ms total. I ended up batching non-urgent updates and caching common queries, but the real fix was simpler than I expected: pre-computing embeddings for the knowledge base and only generating them on-the-fly for user input.

One thing I learned the hard way: not all embedding models are created equal for conversational data. I started with a general-purpose BERT model and got mediocre results because it wasn’t trained on dialogue patterns. Switching to a domain-tuned transformer (fine-tuned on customer support transcripts) cut our false positive rate in half.

The tradeoff was model size and inference cost. The general BERT model was 110MB and ran locally; the fine-tuned one was 340MB and needed GPU inference. For this client, the accuracy gain was worth it, but I’ve had other projects where a lighter model made more sense because they were optimizing for response time over precision.

Handling the Messy Reality of User Input

Users don’t type clean, grammatically correct sentences. They abbreviate, misspell, use slang, or paste entire error messages into the chat. I had to layer in several NLP techniques to normalize input before vectorization:

• Text normalization: lowercase, strip extra whitespace, expand contractions

• Synonym expansion: map “acc” to “account”, “txn” to “transaction”

• Fallback mechanisms: if vector similarity score is below a threshold (I used 0.65 for cosine similarity), trigger a clarification prompt instead of guessing

Here’s a snippet of the preprocessing pipeline I used:

import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
def preprocess_query(text):
    # Lowercase and strip whitespace
    text = text.lower().strip()
    # Expand common abbreviations
    abbreviations = {
        "acc": "account",
        "txn": "transaction",
        "bal": "balance",
        "stmt": "statement"
    }
    for abbr, full in abbreviations.items():
        text = re.sub(r'\b' + abbr + r'\b', full, text)
    # Remove stopwords (optional - depends on embedding model)
    stop_words = set(stopwords.words('english'))
    tokens = text.split()
    tokens = [w for w in tokens if w not in stop_words]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(w) for w in tokens]
    return " ".join(tokens)
# Example
raw_query = "Why can't I see my recent txns?"
clean_query = preprocess_query(raw_query)
print(clean_query)  # "see recent transaction"

I debated whether to remove stopwords before embedding. Turns out it depends on your model — transformer-based embeddings often handle stopwords fine, but older bag-of-words approaches benefit from stripping them out. I A/B tested both and kept stopwords in because the transformer model used positional context.

Scaling to Real Traffic

The prototype ran on a single Milvus instance on a 4-core VM. That worked fine for internal testing, but production traffic spiked unpredictably — 50 concurrent users one hour, 500 the next. I needed horizontal scaling without rewriting the entire stack.

I migrated to Zilliz Cloud, which is a managed version of Milvus. The main wins were:

• Auto-scaling: it spins up replicas during traffic spikes and scales down overnight

• Caching: frequently queried vectors are cached at the edge, shaving 50–100ms off latency

• No ops overhead: I don’t have to tune index parameters or manage sharding myself

The latency improvement was measurable. Median query time dropped from 280ms to 120ms, and P95 went from 600ms to 250ms. Part of that was network proximity (their cluster was closer to our app servers), but the built-in query optimization was the bigger factor.

Retrieval-Augmented Generation: The Secret Weapon

The real magic happened when I wired the vector search layer into a RAG pipeline. Instead of just retrieving similar documents and showing them to the user, I fed the top results into a large language model?utm_campaign=mediumkoc) as context for generation.

Here’s the workflow:

1. User asks: “What’s the fee for international wire transfers?”

2. Vector search pulls the top 3 relevant KB articles (fee schedules, wire transfer guide, international banking FAQ)

3. Those articles get passed as context to GPT-4

4. The model generates a natural-language answer grounded in those docs

Let me show you exactly how I wired this up:

from pymilvus import Collection
import openai

def rag_chatbot(user_query):
    # Step 1: Vectorize query
    query_embedding = openai.Embedding.create(
        input=user_query,
        model="text-embedding-ada-002"
    )["data"][0]["embedding"]
    # Step 2: Search vector database
    collection = Collection("support_kb")
    results = collection.search(
        data=[query_embedding],
        anns_field="embedding",
        param={"metric_type": "COSINE", "params": {"nprobe": 16}},
        limit=3,
        output_fields=["title", "content"]
    )
    # Step 3: Build context from top results
    context = "\n\n".join([
        f"Document: {hit.entity.get('title')}\n{hit.entity.get('content')}"
        for hit in results[0]
    ])
    # Step 4: Generate response with LLM
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful banking assistant. Answer based only on the provided context."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {user_query}"}
        ],
        temperature=0.3
    )
    return response["choices"][0]["message"]["content"]
# Example
answer = rag_chatbot("What's the fee for international wire transfers?")
print(answer)

The key constraint here is: answer only from the provided context. This prevents AI hallucination — the model won’t make up fee amounts or policies that don’t exist in your KB. I set temperature to 0.3 to keep responses factual and consistent.

One thing that didn’t work as expected: I initially passed all KB articles as context (thinking more data = better answers). The LLM got overwhelmed and started cherry-picking random sentences. Limiting to the top 3 most relevant docs gave much cleaner, more focused responses.

Conversation Design: Making It Feel Human

The technical stack is only half the battle. If the chatbot sounds robotic or can’t handle conversational flow, users bail.

I built in a few tricks to make interactions feel more natural:

• Context tracking: store the last 3 turns of conversation as vectors, so the model understands references like “that one” or “the second option”

• Personality tuning: adjusted the system prompt to match the brand voice (this client wanted professional but friendly)

• Empathy markers: if the user expresses frustration (“This is ridiculous”), the system detects negative sentiment and routes to a human agent instead of attempting another automated response

Here’s the sentiment check I added:

from textblob import TextBlob

def check_sentiment(text):
    polarity = TextBlob(text).sentiment.polarity
    if polarity < -0.3:  # Negative sentiment threshold
        return "escalate"
    return "continue"
# Example
user_message = "This is ridiculous, why can't you just tell me the fee?"
if check_sentiment(user_message) == "escalate":
    print("Routing to human agent...")

Crude but effective. I’ve seen more sophisticated sentiment models (BERT-based classifiers), but for this use case the extra complexity wasn’t justified.

Practical Challenges I Hit

Privacy and Security: Users paste sensitive info into chat — account numbers, SSNs, transaction IDs. I had to implement PII redaction before storing conversation history. I used a regex-based scrubber for obvious patterns (9-digit SSNs, 16-digit card numbers) and flagged edge cases for manual review.

Multi-Modal Input: Some users tried to upload screenshots of error messages or PDFs of statements. The initial design only handled text. I extended it to extract text from images (using Tesseract OCR) and parse PDFs, then vectorized the extracted content. This added latency (OCR is slow), so I made it async — the chatbot responds with “Processing your document, one moment…” while the job runs in the background.

Consistency Across Channels: The chatbot ran on web, mobile app, and SMS. Each channel had different character limits and formatting constraints. I built a response adapter layer that truncated or reformatted answers based on the channel, but it was messier than I’d like. Lesson learned: design for the most constrained channel first (SMS) and scale up from there.

What I’d Do Differently Next Time

If I were starting this project today, I’d prototype with a simpler rule-based fallback layer *before* jumping into vector search. There are categories of queries (password resets, account lockouts) that don’t benefit from semantic search — they just need fast, deterministic routing. I could’ve saved a week by handling those upfront and reserving vector search for ambiguous or knowledge-heavy queries.

I’d also spend more time on observability. We didn’t have good visibility into *why* certain queries performed poorly until I added logging for similarity scores, retrieved documents, and LLM token usage. Once I had that data, it was obvious where the bottlenecks were (spoiler: half of them were poorly indexed KB articles with vague titles).

Next Steps If You’re Building This

1. Start small: Pick one domain (support, product recommendations, internal Q&A) and build a focused knowledge base. Trying to be conversational about everything at once is a recipe for mediocrity.

2. Measure retrieval quality separately from generation quality: Your chatbot might give bad answers because the vector search returned irrelevant docs, not because the LLM failed. Instrument both stages independently.

3. Invest in data preprocessing: Garbage in, garbage out. If your KB articles are full of jargon, inconsistent terminology, or outdated info, no amount of vector magic will fix it. Clean your data first.

4. Test with real users early: Internal QA teams won’t phrase questions like actual customers. Get a beta group on it ASAP and watch session replays to see where the model fumbles.

Vector-powered chatbots aren’t perfect, but they’re the closest I’ve come to building something that feels genuinely conversational at scale. The combination of semantic search and retrieval-augmented generation bridges the gap between rigid FAQ bots and fully custom generative AI agents. And once you crack the architecture, the hard part shifts from “Can we do this?” to “How do we scale this sustainably?” — which is exactly where you want to be.

Blog: LangChain 1.0 and Milvus: How to Build Production-Ready Agents with Real Long-Term Memory

Priya Singh — Tue, 21 Apr 2026 02:44:46 GMT

Last week I was debugging an agent that kept “forgetting” things. We’d ask it to recall a decision from three sessions ago — the kind of thing that would be obvious to any human support engineer — and it confidently stated it had no record of it. The data existed somewhere. But our agent couldn’t reach it. The reasoning was fine; the memory architecture was not.

That pushed me to do a proper audit of our stack. We were running an older LangChain setup with its Chain-based patterns, and a lot of the production problems we’d been fighting — context overflows, state loss between restarts, boilerplate code every time we swapped model providers — traced back to design choices baked into that older version.

Here’s what I found when I actually dug into LangChain 1.0 and paired it with Milvus for persistent memory.

Why the Chain-Based Design Was a Problem in Production

The original Chain-based design in LangChain 0.x worked well for prototypes. Wire up a SimpleSequentialChain, add a prompt template and an LLM?utm_campaign=mediumkoc) call, and something works in half an hour. That's genuinely useful when you're validating an idea.

But chains are rigid. They define a fixed execution path. The moment your logic needs to branch — retry with different context, choose a different tool based on an intermediate result — you’re fighting the framework. I’ve seen teams end up with deeply nested custom chains that nobody could debug. Others bypassed LangChain entirely and called the API directly.

The other issue was production control. Chains had no built-in concept of middleware or execution hooks. PII redaction? Write it yourself. Token limit handling? Your problem. Human-in-the-loop approval? Figure it out and wire it manually.

LangChain 1.0: The ReAct Loop as the Default

The core shift in LangChain 1.0 is committing fully to the ReAct pattern: Reason → Tool Call → Observe → Decide. The team analyzed production agent implementations across the ecosystem and found that successful agents converge on this loop regardless of use case. So they made it the standard.

The entry point is create_agent():

from langchain.agents import create_agent

agent = create_agent(
    model="openai:gpt-4o",
    tools=[search_knowledge_base, query_crm],
    system_prompt="You are a support agent. Use the tools to answer customer questions accurately."
)

result = agent.invoke({
    "messages": [{"role": "user", "content": "What's the status of order #12345?"}]
})

Three parameters. A working agent. Under the hood this runs on LangGraph, which gives you state persistence, interruption recovery, and streaming without writing any of that infrastructure yourself.

One thing worth noting: the model parameter accepts either a string identifier or a pre-instantiated model object. In production, you'll often need the object form — you may need to configure timeouts, retry settings, or API keys that aren't the defaults. Pass the object.

Middleware: Where Production Control Lives

What made LangChain 1.0 immediately useful for our team was the Middleware system. It exposes hooks at strategic points in the ReAct loop — before model calls, after tool responses, at termination — so you can inject logic without modifying core agent code.

PII detection is one of the prebuilt options. We use it to redact sensitive fields before they reach third-party models:

from langchain.agents import create_agent
from langchain.agents.middleware import PIIMiddleware

agent = create_agent(
    model="gpt-4o",
    tools=[],
    middleware=[
        PIIMiddleware("email", strategy="redact", apply_to_input=True),
        PIIMiddleware("credit_card", strategy="mask", apply_to_input=True),
        PIIMiddleware("api_key", detector=r"sk-[a-zA-Z0-9]{32}", strategy="block"),
    ],
)

Summarization is the other one I use constantly. When conversation history approaches token limits, it automatically condenses older messages:

from langchain.agents.middleware import SummarizationMiddleware

agent = create_agent(
    model="gpt-4o",
    tools=[weather_tool, crm_tool],
    middleware=[
        SummarizationMiddleware(
            model="gpt-4o-mini",
            max_tokens_before_summary=4000,
            messages_to_keep=20,
        ),
    ],
)

Here’s the design tradeoff that matters: summarization reduces token usage but loses precision. Summaries flatten detail. For domains where specific facts matter — exact figures, previous commitments, specific case notes — you need a complementary store that preserves raw details even after context gets compressed. That’s where Milvus comes in.

Tool retry with configurable exponential backoff is also worth setting up early:

from langchain.agents.middleware import ToolRetryMiddleware

agent = create_agent(
    model="gpt-4o",
    tools=[database_tool, search_tool],
    middleware=[
        ToolRetryMiddleware(
            max_retries=3,
            backoff_factor=2.0,
            initial_delay=1.0,
            max_delay=60.0,
            jitter=True,
        ),
    ],
)

Add jitter=True. Without it, multiple agent instances will all retry a failed service at the same moment and you'll amplify the problem instead of recovering from it.

Wiring Up Long-Term Memory with Milvus

The summarization tradeoff I mentioned above is real. Once you summarize a long session, detailed context — specific resolutions, exact numbers, prior decisions — gets compressed or dropped. An agent trying to recall something from a past session can’t reach it.

The fix is pairing short-term context management with a proper long-term memory layer backed by vector search. I used Milvus as the vector database for this. The langchain_milvus package wraps it as a standard VectorStore:

from langchain.agents import create_agent
from langchain_milvus import Milvus
from langchain_openai import OpenAIEmbeddings
from langchain.agents.middleware import SummarizationMiddleware
from langgraph.checkpoint.memory import InMemorySaver

long_term_memory = Milvus(
    embedding=OpenAIEmbeddings(),
    collection_name="agent_memory",
    connection_args={"uri": "http://localhost:19530"}
)

agent = create_agent(
    model="openai:gpt-4o",
    tools=[
        long_term_memory.as_retriever().as_tool(
            name="recall_memory",
            description="Retrieve relevant historical context and past decisions"
        ),
        query_crm,
    ],
    checkpointer=InMemorySaver(),
    middleware=[
        SummarizationMiddleware(
            model="openai:gpt-4o-mini",
            max_tokens_before_summary=4000,
        )
    ],
    system_prompt="You have access to historical context. Use recall_memory when you need to retrieve past interactions."
)

The pattern: short-term context lives in LangGraph’s checkpointer (fast, in-session), while important interactions get vectorized and stored in Milvus for cross-session recall. When the agent needs something from a past session, it calls recall_memory, which runs a semantic search against the Milvus collection and returns the most relevant chunks.

One thing I learned the hard way: be deliberate about what you write to long-term memory. We initially stored every message, which flooded retrieval with noise. The signal degraded fast. We switched to writing only resolved interactions, key decisions, and explicitly stated user preferences. Retrieval quality improved noticeably.

Structured Output Without the Per-Provider Boilerplate

This is a smaller win but a real one. Before LangChain 1.0, getting structured output from an agent meant writing provider-specific code. OpenAI has a native JSON mode; other models require tool-call workarounds. Every model switch meant rewriting adapters.

Now you define a Pydantic schema and pass it as response_format:

from langchain.agents import create_agent
from pydantic import BaseModel, Field

class TicketSummary(BaseModel):
    issue_category: str = Field(description="Category of the support issue")
    resolution: str = Field(description="How the issue was resolved")
    follow_up_required: bool = Field(description="Whether follow-up action is needed")

agent = create_agent(
    model="openai:gpt-4o",
    tools=[query_crm],
    response_format=TicketSummary,
    system_prompt="After resolving an issue, return a structured summary."
)

LangChain detects whether the model supports native structured output and selects the enforcement strategy automatically. Switch from GPT-4o to another model and this code doesn’t change.

LangChain vs LangGraph: Choosing the Right Layer

A question that comes up: when do you use create_agent() versus building directly in LangGraph?

create_agent() covers the majority of standard agent scenarios — a single agent that reasons, calls tools, and returns a result. LangGraph becomes necessary when you need custom state machines: agent A handles step one, passes state to agent B for step two, with conditional routing based on intermediate results. That's outside what create_agent() is designed for.

The practical thing is that they’re complementary. You can start with create_agent() and introduce LangGraph for the specific parts of your workflow that need finer control. There's no need to choose one and commit upfront.

Production Considerations Before You Ship

A few things that matter once you leave local testing:

Embedding model versioning: If you vectorize new documents with a different model version than what your existing index was built with, retrieval quality silently degrades. Version-lock your embedding model and record which version each Milvus collection was indexed with. This sounds obvious until you’ve lost a week debugging retrieval regressions to a model version bump.

Milvus deployment mode: For development, Milvus Lite runs in-process with no server required. For production, you need a standalone or distributed deployment — or a managed option like Zilliz Cloud if you’d rather not handle the operational overhead. The connection args change; your application code doesn’t.

Retrieval latency in the loop: Each tool call adds a round trip. If your agent calls Milvus on every turn, you’ll feel the latency accumulate. Keep frequently-needed context in short-term memory and only fall back to Milvus retrieval when in-session context doesn’t cover the query. Profile your agent’s tool call patterns early.

The combination of LangChain 1.0’s structured ReAct loop, composable middleware, and Milvus for durable long-term memory covers most of what we needed to move from reactive firefighting to building reliable features on top of a stable agent architecture. The memory problem that started this investigation is solved — and the production controls we’d been writing from scratch are now just configuration.

Locality Sensitive Hashing in the Real World: When Approximation Beats Perfection

Priya Singh — Thu, 16 Apr 2026 06:31:38 GMT

I didn’t learn Locality-Sensitive Hashing (LSH) from textbooks.

I learned it the hard way — when a similarity system that worked beautifully on 1 million items collapsed under 80 million.

At that scale, “exact” stops being elegant and starts being expensive.

LSH isn’t fashionable anymore. It doesn’t get the same attention as neural embeddings or large language models. But if you’ve ever built large-scale retrieval, recommendation, or deduplication systems, you know this truth:

Approximation is often the only thing standing between you and a production outage.

In this post, I want to talk about LSH the way engineers actually encounter it:

not as a theory, but as a trade-off — one that still matters deeply in modern AI systems, including RAG pipelines, multimodal retrieval, and hybrid search stacks.

The Problem That Forces You to Care About LSH

Let’s start with the problem LSH exists to solve.

You have:

High-dimensional data (text embeddings, image vectors, audio features)
A distance metric (cosine similarity, Jaccard, Euclidean)
A requirement to find “similar” items quickly

What you don’t have:

Time to compute exact distances between everything

Brute-force nearest neighbor search scales linearly. That’s fine at 100K vectors. It’s unacceptable at 100M.

This is why systems lean on approximate methods — LSH being one of the earliest and most influential.

If embeddings are new to you, this glossary entry on vector embeddings gives a good grounding before we go further.

LSH Intuition (Without the Math Wall)

Here’s the mental model I use.

Instead of comparing vectors directly, LSH asks:

“Can I hash similar things into the same bucket with high probability?”

The trick is that the hash functions are locality-sensitive:

Similar inputs → same hash bucket (likely)
Dissimilar inputs → different buckets (likely)

This flips the problem:

From “search everything”
To “search only the buckets that matter”

You’re trading perfect recall for speed and scalability — and doing it consciously.

The First Time I Used LSH in Production

My first real encounter with LSH was in a near-duplicate detection pipeline:

Millions of user-generated documents
Heavy redundancy
Tight latency requirements

Exact similarity was overkill. We didn’t need the best match — we needed a good enough candidate set.

LSH delivered:

Massive reduction in comparison count
Predictable latency
Easy horizontal scaling

But it also forced us to confront trade-offs early, which is why I still respect it as a system design tool.

Common LSH Variants (And When I’d Actually Use Them)

MinHash (Set Similarity)

If your data looks like:

Token sets
Shingles
Binary features

MinHash is still excellent for estimating Jaccard similarity. I’ve used it for:

Document deduplication
Web crawl cleanup
Feature overlap analysis

Random Projection LSH

This one comes up more in vector spaces:

Hash via random hyperplanes
Preserve cosine similarity

It’s conceptually simple and surprisingly effective when embeddings are noisy but structured.

Why You Rarely See LSH Alone Anymore

Modern systems often combine:

LSH for coarse filtering
Dense retrieval or re-ranking for precision

LSH doesn’t compete with embeddings — it complements them.

LSH vs Modern ANN Indexes: An Honest Comparison

Here’s the question I get a lot:

“Why use LSH when we have HNSW, IVF, and graph-based indexes?”

Short answer: you usually shouldn’t — unless your constraints demand it.

LSH shines when:

You need extreme simplicity
Memory usage must be predictable
Data distribution changes frequently
You want fast rebuilds and stateless shards

Graph-based ANN shines when:

You want high recall
You can afford memory
Data is relatively stable

Some systems — Milvus, for example — focus more on graph and IVF-based ANN rather than LSH, because they optimize for high-recall vector similarity at scale. That’s a design choice, not a universal rule.

Where LSH Still Shows Up in Modern RAG Systems

Even in Retrieval-Augmented Generation pipelines, LSH ideas sneak back in.

In large RAG systems:

First-stage retrieval favors recall
Later stages favor precision

LSH-like hashing can be used as:

A pre-filter before vector search
A routing mechanism for sharded indexes
A cheap candidate generator

If you’re new to RAG architectures, this overview of Retrieval-Augmented Generation provides useful context.

A Small, Practical LSH Example (Python)

Here’s a minimal example using random projection LSH for cosine similarity.

This isn’t a full system — but it captures the core idea.

import numpy as np

def random_hyperplane_hash(vectors, num_planes=10):
    dim = vectors.shape[1]
    planes = np.random.randn(num_planes, dim)
    projections = np.dot(vectors, planes.T)
    return (projections > 0).astype(int)

# Example vectors
vectors = np.random.randn(1000, 128)

# Hash into buckets
hash_codes = random_hyperplane_hash(vectors)

# Group by bucket
buckets = {}
for idx, code in enumerate(map(tuple, hash_codes)):
    buckets.setdefault(code, []).append(idx)

print(f"Number of buckets: {len(buckets)}")

In practice, you’d:

Use multiple hash tables
Tune the number of planes
Combine this with downstream scoring

LSH alone is rarely the end of the pipeline.

Cost, Scale, and Why Approximation Still Matters

One lesson I’ve learned repeatedly: approximation is a cost control mechanism.

Every retrieval decision affects:

CPU/GPU usage
Latency
LLM token consumption

In RAG systems, poor retrieval inflates prompt size and model calls. Tools like the RAG cost calculator make this painfully clear.

LSH’s philosophy — reduce the search space early — aligns well with cost-aware system design.

Multimodal Data Makes LSH Relevant Again

As soon as you introduce:

Images
Audio
Video embeddings

Your embedding distributions get messier.

In multimodal pipelines, coarse hashing can help route queries to the right subspace before expensive similarity search. This is especially relevant in systems discussed in multimodal RAG.

LSH won’t give you perfect results — but it can dramatically reduce waste.

LSH in Agentic and Voice-Based Systems

In agentic systems, especially voice assistants, latency spikes are deadly.

When building voice-driven RAG agents, you often need:

Fast intent routing
Lightweight candidate generation
Predictable response times

Hash-based filtering can be a quiet enabler here, especially in early decision stages. You can see how these ideas surface in more complex pipelines like those described in this guide on building a voice assistant with agentic RAG.

When I Would Not Use LSH

Let’s be clear — LSH is not a silver bullet.

I wouldn’t use it when:

You need very high recall
Data is low-dimensional and small
Graph-based ANN is feasible
Ranking quality is more important than speed

LSH trades accuracy for speed. If that trade-off doesn’t serve your system goals, skip it.

Final Thoughts: Why LSH Still Deserves Respect

LSH taught the industry something important long before neural embeddings were popular:

You don’t need perfect similarity — just useful similarity.

Even today, that lesson shows up everywhere:

Approximate nearest neighbor search
Two-stage retrieval pipelines
Cost-aware RAG systems

LSH may not be the star of modern AI stacks, but its ideas are everywhere. And if you’re designing systems under real-world constraints — latency, scale, cost — it’s still worth understanding deeply.

Not because it’s trendy.

Because it works.

Picking the Right Embedding Model for Your RAG Pipeline: What I’ve Learned the Hard Way

Priya Singh — Thu, 16 Apr 2026 05:48:24 GMT

Last year, I spent three weeks debugging a RAG chatbot that kept returning eerily confident but completely wrong answers. The retrieval metrics looked fine on paper. The LLM was GPT-4. The vector database was solid. So what was the problem?

The embedding model. I had grabbed the first one off a tutorial, plugged it in, and never questioned it again. Turns out it was wildly mismatched for the domain.

That experience forced me to actually understand what I was putting at the front of my RAG pipeline. In this post I want to share what I now know about embedding model selection — covering the MTEB leaderboard, SBERT models, and how I think about the tradeoffs between open-source and proprietary options.

How Embedding Models Fit Into a RAG Pipeline

Before getting into model comparisons, let me make sure the mental model is clear, because I’ve seen a lot of confusion here.

RAG works in three stages:

Stage 1 — Ingestion: You run every document chunk through an embedding model to produce a vector embedding — a fixed-size numerical representation of the chunk’s semantic content. These vectors get stored in a vector database like Milvus.

Stage 2 — Retrieval: When a user asks a question, you embed that question using the exact same model, then run a nearest-neighbor search to find the most semantically similar chunks.

Stage 3 — Generation: You inject the top-K retrieved chunks into an LLM prompt as context, and the model answers the question based on your domain knowledge rather than just its training data.

The critical constraint here: your query embedding and your document embeddings live in the same vector space only if they were produced by the same model. Swap one without reindexing the other and your retrieval quality collapses completely.

There’s a research-backed reason to keep your top-K retrieval focused, by the way. The Lost in the Middle paper showed that LLM answer quality degrades when too many retrieved chunks are stuffed into the context window. Keep retrieval tight and relevant rather than broad and noisy.

Understanding SBERT: The Architecture Behind Most Embedding Models

Most embedding models you’ll encounter are built on SBERT — Sentence-BERT. It’s worth understanding what makes it different from vanilla BERT.

BERT was designed to understand individual tokens in context. SBERT extends this by training the model to produce a single, fixed-size vector that represents the meaning of an entire sentence. It does this using siamese network training: two sentences are encoded separately, and the model is trained to place semantically similar sentences near each other in vector space.

The practical consequence: SBERT understands that “the cat sat on the mat” and “the mat sat on the cat” mean different things. Basic BERT wouldn’t reliably catch that. For retrieval tasks, where you’re matching questions to semantically related passages, that sentence-level understanding is what makes SBERT work.

LLMs like GPT-4 are built on the decoder side of the transformer architecture — optimized for generation. Embedding models are built on the encoder side — optimized for representation. They are fundamentally different tools serving different purposes in the same pipeline.

How to Actually Use the MTEB Leaderboard

The HuggingFace MTEB Leaderboard is the industry standard for comparing embedding models, but most people use it wrong.

MTEB evaluates models across 8 tasks — retrieval, clustering, classification, semantic textual similarity, and more — across 58 datasets. When you’re building a RAG pipeline, you care about one column: Retrieval Average, measured as NDCG@10 (Normalized Discounted Cumulative Gain at the 10th result). This metric weights higher-ranked results more heavily, which aligns with how RAG actually works — the top few retrieved chunks carry most of the weight.

My workflow when picking a model:

Sort the MTEB leaderboard descending by the Retrieval column
Filter to models that fit within my memory and latency budget
Pick the smallest model that achieves acceptable retrieval scores
Run my own evals on a sample of real domain queries — because MTEB is known to have overfitting issues for some models

That last step is non-negotiable. I’ve been burned by models that score well on MTEB but perform poorly on technical or domain-specific text. The leaderboard is a starting point, not a final answer.

The Six Models I’ve Worked With in Production

Let me walk through the models I’ve actually used, with real notes on where each one shines and where it falls short.

Creator Model Embedding Dim Context Length Open Source MTEB Retrieval Score BAAI bge-base-en-v1.5 768 512 tokens Yes 53 BAAI bge-base-zh-v1.5 768 512 tokens Yes 69 VoyageAI voyage-2 1024 4K tokens No — VoyageAI voyage-code-2 1536 16K tokens No — OpenAI text-embedding-3-small 512–1536 8K tokens No 62 OpenAI text-embedding-3-large 256–3072 8K tokens No 65

BAAI/bge-base-en-v1.5 and bge-base-zh-v1.5

These are my go-to models when I’m prototyping or when I need to keep infrastructure costs near zero. They’re available on HuggingFace, run fine on CPU, and have no API call costs.

The 512-token context window is the real constraint. If your documents chunk naturally under that limit — which most paragraph-level chunking strategies do — you won’t notice it. But if you’re working with long technical passages that are hard to split cleanly, you’ll hit truncation issues that silently degrade retrieval quality.

For bilingual deployments, bge-base-zh-v1.5 is the most practical option I've found for Chinese-language content. The MTEB score of 69 on Chinese benchmarks is genuinely strong.

VoyageAI’s voyage-2 and voyage-code-2

I started using VoyageAI’s models after seeing a case study that showed significantly better NDCG@10 on technical documentation retrieval compared to the ada-002 generation.

voyage-2 is trained on conversational and dialog data, which means it handles question-to-passage matching better than general-purpose models for certain domains. In my experience with customer support RAG systems, it noticeably outperformed bge on short, intent-heavy queries.

voyage-code-2 is where things get interesting. It's trained specifically on code data with a 16K context window — the longest of any model in this group. For a code search or documentation RAG use case, that context window means you can embed entire functions or long docstrings without chunking at awkward boundaries. VoyageAI reports a 14% higher recall rate on code retrieval tasks, and in my own testing that number felt credible.

The downside: these are proprietary, API-only models. No self-hosting, and you’re dependent on their availability and pricing.

OpenAI text-embedding-3-small and text-embedding-3-large

These replaced ada-002 and the improvement is meaningful — higher MTEB retrieval scores, better multilingual performance, and lower pricing.

The most interesting engineering decision OpenAI made here is Matryoshka Representation Learning. Instead of training at a single fixed dimension, the model learns representations at multiple scales simultaneously. The practical result: you can truncate the embedding to a smaller dimension at query time with surprisingly small accuracy loss.

For text-embedding-3-large, going from 3072 dimensions to 256 drops the MTEB retrieval score from 65 to 62 — a 5% accuracy drop for a 12x reduction in storage and memory. For high-throughput applications where you're storing tens of millions of vectors, that tradeoff is often worth taking.

In my own RAG testing on Milvus technical documentation, text-embedding-3-small at dim=256 produced answers that were indistinguishable from dim=1536 for the majority of queries. The edge cases where higher dimensionality mattered were nuanced disambiguation questions — the kind that represent maybe 5% of real user traffic.

End-to-End Code: RAG with OpenAI Embeddings and Milvus

Here’s the full pipeline I use. This connects to Zilliz Cloud (managed Milvus), but the same code works with a self-hosted Milvus instance.

Step 1 — Connect:

from pymilvus import connections, utility
import os
from dotenv import load_dotenv

load_dotenv()
TOKEN = os.getenv("ZILLIZ_API_KEY")
CLUSTER_ENDPOINT = "https://in03-xxxx.api.gcp-us-west1.zillizcloud.com:443"
connections.connect(
    alias='default',
    uri=CLUSTER_ENDPOINT,
    token=TOKEN,
)

Step 2 — Define your embedding model:

import openai
from openai import OpenAI

openai_client = OpenAI()
EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIM = 512  # Using reduced dim - good enough for most use cases

Step 3 — Create the collection:

from pymilvus import MilvusClient

COLLECTION_NAME = "my_rag_collection"
mc = MilvusClient(uri=CLUSTER_ENDPOINT, token=TOKEN)
mc.create_collection(
    COLLECTION_NAME,
    EMBEDDING_DIM,
    consistency_level="Eventually",
    auto_id=True,
    overwrite=True
)

Step 4 — Chunk, embed, and insert:

# Assumes `chunks` is a list of LangChain Document objects
chunk_list = []
for chunk in chunks:
    response = openai_client.embeddings.create(
        input=chunk.page_content,
        model=EMBEDDING_MODEL,
        dimensions=EMBEDDING_DIM
    )
    embeddings = response.data[0].embedding

chunk_list.append({
        'vector': embeddings,
        'chunk': chunk.page_content,
        'source': chunk.metadata.get('source', ''),
    })
mc.insert(COLLECTION_NAME, data=chunk_list, progress_bar=True)
mc.flush(COLLECTION_NAME)

Step 5 — Query and generate:

SAMPLE_QUESTION = "What do the parameters for HNSW mean?"

response = openai_client.embeddings.create(
    input=SAMPLE_QUESTION,
    model=EMBEDDING_MODEL,
    dimensions=EMBEDDING_DIM
)
query_embeddings = response.data[0].embedding
results = mc.search(
    COLLECTION_NAME,
    data=[query_embeddings],
    output_fields=["chunk", "source"],
    limit=3,
    consistency_level="Eventually"
)
context = [r['entity']['chunk'] for r in results[0]]
contexts_combined = ' '.join(context)
llm_response = openai_client.chat.completions.create(
    messages=[
        {"role": "system", "content": f"Answer using only the context below. Context: {contexts_combined}"},
        {"role": "user", "content": SAMPLE_QUESTION}
    ],
    model="gpt-3.5-turbo",
    temperature=0.1,
)
print(llm_response.choices[0].message.content)

Full working notebook is in the Milvus bootcamp on GitHub.

My Decision Framework

After running a lot of these experiments, here’s how I now decide:

Prototyping / cost-sensitive / self-hosted: Start with bge-base-en-v1.5. It's free, fast, and good enough to validate your pipeline architecture.
Production English/multilingual chatbot: text-embedding-3-small with reduced dimensionality is hard to beat on the cost-quality curve.
High-stakes domain RAG (legal, medical, technical docs): Evaluate voyage-2 seriously. The MTEB score doesn't capture everything; domain-specific retrieval quality can surprise you.
Code search / developer tooling: voyage-code-2 with its 16K context window is worth the API dependency.
Extreme memory constraints or billions of vectors: text-embedding-3-large at dim=256 — the Matryoshka approach is genuinely clever engineering.

One thing I’ll keep repeating: always run your own evals on a slice of real production queries. MTEB is a proxy. Your actual retrieval quality on your actual data is the only number that matters.

I’m always interested in comparing notes on RAG infrastructure. If you’re doing something interesting with embedding model selection or fine-tuning, find me in the Milvus Discord or check out the vector database benchmark leaderboard for performance comparisons across setups.

Picking the Right Embedding Model for Your RAG Pipeline: What I’ve Learned the Hard Way was originally published in GoPenAI on Medium, where people are continuing the conversation by highlighting and responding to this story.