LLM TWIN COURSE: BUILDING YOUR PRODUCTION-READY AI REPLICA

Build Multi-Index Advanced RAG Apps

How to implement multi-index queries to optimize your RAG retrieval layer.

Paul Iusztin

Published in

Decoding ML

15 min read3 days ago

→ the 12th out of 12 lessons of the LLM Twin free course

What is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality and voice into an LLM.

Why is this course different?

By finishing the “LLM Twin: Building Your Production-Ready AI Replica” free course, you will learn how to design, train, and deploy a production-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices.

What will you learn to build by the end of this course?

You will learn how to architect and build a real-world LLM system from start to finish — from data collection to deployment.

You will also learn to leverage MLOps best practices, such as experiment trackers, model registries, prompt monitoring, and versioning.

The end goal? Build and deploy your own LLM twin.

The architecture of the LLM twin is split into 4 Python microservices:

the data collection pipeline: crawl your digital data from various social media platforms. Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines. Send database changes to a queue using the CDC pattern. (deployed on AWS)
the feature pipeline: consume messages from a queue through a Bytewax streaming pipeline. Every message will be cleaned, chunked, embedded (using Superlinked), and loaded into a Qdrant vector DB in real-time. (deployed on AWS)
the training pipeline: create a custom dataset based on your digital data. Fine-tune an LLM using QLoRA. Use Comet ML’s experiment tracker to monitor the experiments. Evaluate and save the best model to Comet’s model registry. (deployed on Qwak)
the inference pipeline: load and quantize the fine-tuned LLM from Comet’s model registry. Deploy it as a REST API. Enhance the prompts using RAG. Generate content using your LLM twin. Monitor the LLM using Comet’s prompt monitoring dashboard. (deployed on Qwak)

LLM Twin system architecture [Image by the Author]

Along the 4 microservices, you will learn to integrate 3 serverless tools:

Comet ML as your ML Platform;
Qdrant as your vector DB;
Qwak as your ML infrastructure;

Who is this for?

Audience: MLE, DE, DS, or SWE who want to learn to engineer production-ready LLM systems using LLMOps good principles.

Level: intermediate

Prerequisites: basic knowledge of Python, ML, and the cloud

How will you learn?

The course contains 10 hands-on written lessons and the open-source code you can access on GitHub, showing how to build an end-to-end LLM system.

Also, it includes 2 bonus lessons on how to improve the RAG system.

You can read everything at your own pace.

→ To get the most out of this course, we encourage you to clone and run the repository while you cover the lessons.

🔗 Check out the code on GitHub [1] and support us with a ⭐️

Lessons

→ Quick overview of each lesson of the LLM Twin free course.

The course is split into 12 lessons. Every Medium article will be its own lesson:

To better understand the course’s goal, technical details, and system design → Check out Lesson 1

Lesson 12: Build Multi-Index Advanced RAG Apps

This article will teach you how to implement multi-index structures for building advanced RAG systems.

To implement our multi-index collections and queries, we will leverage Superlinked, a vector compute engine highly optimized for working with vector data, offering solutions for ingestion, embedding, storing and retrieval.

To better understand how Superlinked queries work, we will gradually present how to build a complex query that uses two vector indexes, adds filters based on the metadata extracted using an LLM, and returns only the top K most similar documents to reduce network I/O overhead.

Ultimately, we will dig into how Superlinked can help us implement and optimize various advanced RAG methods, such as query expansion, self-query, filtered vector search and rerank.

As this article is part of the LLM Twin course, before we start, here is some essential context you have to know to move along with this lesson (which you can read independently if you want to):

In Lesson 11, we implemented the real-time RAG ingestion pipeline (using Bytewax) and server (using Superlinked).
In Lesson 5, we presented 4 advanced RAG algorithms in depth and how to implement them.

Figure 1: RAG ingestion pipeline and server

Now, let’s move on to Lesson 12, our current lesson.

Exploring the multi-index RAG server
Understanding the data ingestion pipeline
Writing complex multi-index RAG queries using Superlinked
Exploring the 4 advanced RAG optimization techniques
Is Superlinked OP for building RAG and other vector-based apps?

🔗 Check out the code on GitHub [1] and support us with a ⭐️

1. Exploring the multi-index RAG server

We are using Superlinked to implement a powerful vector compute server. With just a few lines of code, we can implement a fully-fledged RAG application exposed as a REST API web server.

When using Superlinked, you declare your chunking, embedding and query strategy in a declarative way (similar to building a graph), making it extremely easy to implement an end-to-end workflow.

Let’s explore the core steps in how to define an RAG server using Superlinked ↓

First, you have to define the schema of your data, which in our case are the post, article, and repositories schemas:

from superlinked import schema

@schema
class PostSchema:
  content: String
  platform: String
  ... # Other fields

@schema
class RepositorySchema:
  content: String
  platform: String
  ...

@schema
class ArticleSchema:
  content: String
  platform: String
  ...

post = PostSchema()
article = ArticleSchema()
repository = RepositorySchema()

You can quickly define an embedding space based on one or more schema attributes. The embedding space is made out of the following properties:

the field to be embedded;
a model used to embed the field.

For example, this is how you can define an embedding space for a piece of text, more precisely on the content of the article:

from superlinked import TextSimilaritySpace, chunk

articles_space_content = TextSimilaritySpace(
    text=chunk(article.content, chunk_size=500, chunk_overlap=50),
    model=settings.EMBEDDING_MODEL_ID,
)

Notice that we also wrapped the article's content field with the chunk() function that automatically chunks the text before embedding it.

The model can be any embedding model available on HuggingFace or SentenceTransformers. For example, we used the following MODEL_ID:

from pydantic_settings import BaseSettings

class Settings(BaseSettings):
  EMBEDDING_MODEL_ID: str = "sentence-transformers/all-mpnet-base-v2"

  REDIS_HOSTNAME: str = "redis"
  REDIS_PORT: int = 6379

settings = Settings()

It also supports defining an embedding space for categorical variables, such as the article’s platform:

from superlinked import CategoricalSimilaritySpace,

articles_space_plaform = CategoricalSimilaritySpace(
    category_input=article.platform,
    categories=["medium", "superlinked"],
    negative_filter=-5.0,
)

Along with text and categorical embedding spaces, Superlinked supports numerical and temporal variables:

Multi-index structures

Now, we can combine the two embedding spaces defined above into a multi-index structure:

from superlinked import Index

article_index = Index(
    [articles_space_content, articles_space_plaform],
    fields=[article.author_id],
)

The first attribute is a list with references to the text and categorical embedding spaces. At the same time, the fields parameter contains a list of all the fields to which we want to apply filters when querying the data. These steps will optimize retrieval and filter operations to run at low latencies.

Note that when defining an Index in Superlinked, we can add as many embedding spaces as we like that originate from the same schema, in our case, the ArticleSchema, where the minimum is one, and the maximum is all the schema fields.

…and, viola!

We defined a multi-index structure that supports weighted queries in just a few lines of code.

Using Superlinked and its embedding space and index architecture, we can easily index different data types (text, categorical, number, temporal) into a multi-index structure that offers tremendous flexibility in how we interact with the data.

The following section will show you how to query the multi-index collection defined above. But first, let’s wrap up with the Superlinked RAG server.

To do so, let’s define a connector to a Redis Vector DB:

from superlinked import RedisVectorDatabase

vector_database = RedisVectorDatabase(
    settings.REDIS_HOSTNAME,
    settings.REDIS_PORT,
)

…and ultimately define a RestExecutor that wraps up everything from above into a REST API server:

from superlinked import RestSource, RestExecutor, SuperlinkedRegistry

article_source = RestSource(article)
repository_source = RestSource(repository)
post_source = RestSource(post)

executor = RestExecutor(
    sources=[article_source, repository_source, post_source],
    indices=[article_index, repository_index, post_index],
    queries=[
        RestQuery(RestDescriptor("article_query"), article_query),
        RestQuery(RestDescriptor("repository_query"), repository_query),
        RestQuery(RestDescriptor("post_query"), post_query),
    ],
    vector_database=vector_database,
)
SuperlinkedRegistry.register(executor)

Based on all the queries defined in the RestExecutor class, Superlinked will automatically generate endpoints that can be called through HTTP requests.

In Lesson 11, we showed in more detail how the RAG Superlinked server works, how to set it up and how to interact with its query endpoints:

Build a scalable RAG ingestion pipeline using 74.3% less code

End-to-end implementation for an advanced RAG feature pipeline

medium.com

2. Understanding the data ingestion pipeline

Before we understand how to build queries for our multi-index collections, let’s have a quick refresher on how the vector DB is populated with article, post, and repository documents.

The data ingestion workflow is illustrated in Figure 2. During the LLM Twin course, we implemented a real-time data collection system in the following way:

We crawl the data from the internet and store it in a MongoDB data warehouse.
We use CDC to capture CRUD events on the database and send them as messages to a RabbitMQ queue.
We use a Bytewax streaming engine to consume and clean the events from RabbitMQ in real time.
Ultimately, the data is ingested into the Superlinked server through HTTP requests.
As seen before, the Superlinked server does the heavy lifting, such as chunking, embedding, and loading all the ingested data into a Redis vector DB.
We implemented a vector DB retrieval client that queries the data from Superlinked through HTTP requests.
The vector DB retrieval will be used within the final RAG component, which generates the final response using the retrieved context and an LLM.

Note that whenever we crawl a new document from the Internet, we repeat steps 1–5, resulting in a vector DB synced with the external world in real-time.

Figure 2: The RAG data ingestion pipeline and Superlinked server

If you want to see the full implementation of the steps above, you can always check out the rest of the course’s lessons for free, starting with Lesson 1.

But now that we have an intuition on how the Redis vector DB is populated with data used for RAG let’s see the true power of Superlinked and build some queries to retrieve data.

3. Writing complex multi-index RAG queries using Superlinked

Let’s take a look at the complete article query we want to define:

article_query = (
    Query(
        article_index,
        weights={
            articles_space_content: Param("content_weight"),
            articles_space_plaform: Param("platform_weight"),
        },
    )
    .find(article)
    .similar(articles_space_content.text, Param("search_query"))
    .similar(articles_space_plaform.category, Param("platform"))
    .filter(article.author_id == Param("author_id"))
    .limit(Param("limit"))
)

If it seems like a lot, let’s break it into smaller pieces and start with the beginning.

What if we want to make a basic query that finds the most relevant articles solely based on the similarity between the query and the content of an article?

In the code snippet below, we define a query based on the article’s index to find articles that have the embedding of the content field most similar to the search query:

article_query = (
    Query(article_index)
    .find(article)
    .similar(articles_space_content.text, Param("search_query"))
)

As seen in the Exploring the multi-index RAG server section, plugging this query into the RestExecutor class automatically creates an API endpoint accessible through POST HTTP requests.

In Figure 3, we can observe all the available endpoints automatically generated by Superlinked.

Figure 3: Screenshot of the Swagger UI [4] generated automatically based on the Superlinked queries.

Thus, after starting the Superlinked server, which we showed how to do in Lesson 11, you can access the query as follows:

import httpx

url=f"{base_url}/api/v1/search/article_query"
headers = {"Accept": "*/*", "Content-Type": "application/json"}

data = {
      "search_query": "Write me a post about Vector DBs and RAG.",
}
response = httpx.post(
    url, headers=headers, json=data, timeout=600
)
print(result["obj"])

As you can observe, all the attributes wrapped by the Param() class within the query are expected as parameters within the POST request, such as the Param(“search_query”), which represents the user’s query.

Quite intuitive, right?

Now… What happens behind the scenes?

After the endpoint is called, the Superlinked server processes the search query based on the articles_space_content embedding text space, which defines how to chunk and embed a text.

Thus, that will happen to the search query: it will chunk and embed it.

Using the computed query embedding, it will search the vector space based on the article’s content and retrieve the most similar documents:

articles_space_content = TextSimilaritySpace(
    text=chunk(article.content, chunk_size=500, chunk_overlap=50),
    model="sentence-transformers/all-mpnet-base-v2",
)

Multi-index query

Now that we understand the basics of how a Superlinked query works, let’s add another layer of complexity and create a multi-index query based on the article’s content and platform:

article_query = (
    Query(
        article_index,
        weights={
            articles_space_content: Param("content_weight"),
            articles_space_plaform: Param("platform_weight"),
        },
    )
    .find(article)
    .similar(articles_space_content.text, Param("search_query"))
    .similar(articles_space_plaform.category, Param("platform"))
)

We added two things.

The first one is another similar() function call, which configures the other embedding space we should use for the query, which is articles_space_plaform.

Now, when making a query, Superlinked will use the embedding of both fields to search for relevant information:

the search query
the article’s platforms

But how do we configure which one is more important?

Here, the second thing that we added kicks in, which is the weights parameter within the Query(weights={…}) class.

Using the weights dictionary, we can add different weights per index to configure the importance of each within a particular query.

Let’s better understand this with an example:

data = {
    "search_query": "Write me a post about Vector DBs and RAG.",
    "platform": "medium",
    "content_weight": 0.9, # 90%
    "platform_weight": 0.1, # 10%
}
response = httpx.post(
    url, headers=self.headers, json=data, timeout=self.timeout
)

In the previous example, we set the content weight to 90% and the platform’s to 10%, which means that the article’s content will most impact our query but still favor articles from the same platform.

By playing with these weights, we tweak the impact of each index in our query.

Now, let’s add the last final pieces of the query, which are the filter() and the limit() functions:

article_query = (
    Query(
        article_index,
        weights={
            articles_space_content: Param("content_weight"),
            articles_space_plaform: Param("platform_weight"),
        },
    )
    .find(article)
    .similar(articles_space_content.text, Param("search_query"))
    .similar(articles_space_plaform.category, Param("platform"))
    .filter(article.author_id == Param("author_id"))
    .limit(Param("limit"))
)

The author_id filter helps us retrieve documents only from a specific author, while the limit function controls how many items we want to retrieve.

For example, if we find 10 similar articles but the limit is set to 3, the Superlinked server will always return a maximum of 3 documents. Thus reducing network I/O between the server and client:

data = {
    "search_query": "Write me a post about Vector DBs and RAG.",
    "platform": "medium",
    "content_weight": 0.9, # 90%
    "platform_weight": 0.1, # 10%
    "author_id": 145,
    "limit": 3,
}
response = httpx.post(
    url, headers=self.headers, json=data, timeout=self.timeout
)

That’s it! We can further optimize our retrieval step by experimenting with other multi-index configurations and weights.

4. Exploring the 4 advanced RAG optimization techniques

In Lesson 5, we explored 4 popular advanced RAG techniques to improve the accuracy of our generative AI system.

As a quick reminder, there are 3 main types of advanced RAG techniques:

Pre-retrieval optimization [ingestion]: tweak how you create the chunks
Retrieval optimization [retrieval]: improve the queries to your vector DB
Post-retrieval optimization [retrieval]: process the retrieved chunks to filter out the noise

Figure 4: Advanced RAG optimization options

Now, let’s explore the 4 methods initially implemented in Lesson 5 and understand how they can be integrated into our new architecture:

Query expansion (retrieval)
Self query (retrieval)
Filtered vector search (retrieval)
Rerank (post-retrieval)

By incorporating these 4 advanced RAG optimization techniques, we will better understand where Superlinked shines most.

Important > On optimizing the ingestion side, Superlinked handled everything from chunking, embedding, and loading into a vector DB, detailed in Lesson 11.

Query expansion (retrieval)

To implement query expansion, you use an LLM to generate multiple queries based on your initial user’s query.

These queries will contain multiple perspectives of the initial query.

Thus, when embedded, they hit different areas of your embedding space that are still relevant to our initial question.

Does Superlinked help here? Not really, as you have to expand your query before calling Superlinked.

Self query (retrieval)

What if you could extract the tags within the query and use them along your vector search?

That is what self-query is all about!

You use an LLM to extract critical metadata fields for your business use case (e.g., tags, author ID, number of comments, likes, shares, etc.)

In our custom solution, we are extracting just the author ID. Thus, a zero-shot prompt engineering technique will do the job.

Does Superlinked help here? Unfortunately, no, as you have to apply a self-query before calling the Superlinked server.

But… self-queries work hand-in-hand with vector filter searches, which we will explain in the next section.

Filtered vector search (retrieval)

This is a fancy name for applying a standard filter on your metadata before (or after) doing your vector search, hence “Filtered vector search.”

Does Superlinked help here? Yes! This is where Superlinked shines, allowing you to quickly index data structured on fields other than your vector index (or multi-index).

article_index = Index(
    [articles_space_content, articles_space_plaform],
    fields=[article.author_id],
)

article_query = (
    Query(article_index)
    ...
    .filter(article.author_id == Param("author_id"))
)

Thus, you can implement optimal queries tailored to your data with a few lines of code.

Rerank (post-retrieval)

Rerank is used to filter out the noise from your retrieved documents.

For example, you retrieved N documents from your vector DB using Superlinked. However, you want to be prudent about your context size, so you use a rerank model to score the relevancy of all the retrieved documents relative to your query.

Then, based on the rerank score, you pick only the top K (where K < N) documents as your final items to build up the context.

Does Superlinked help here? Unfortunately, it doesn’t support cross-encoder models [3] for reranking.

But they are just at the beginning of their journey. Supporting reranking makes a lot of sense. Thus, we speculate that they will add it along with other functionality that optimizes the retrieval component of an RAG system (or other AI application that works with embeddings).

In this article, we briefly discussed the 4 advanced RAG methods implemented in our course. Check out Lesson 5 for a detailed explanation of each method.

5. Is Superlinked OP for building RAG and other vector-based apps?

Superlinked has incredible potential to build scalable vector servers to ingest and retrieve your data based on operations between embeddings.

Figure 6: Screenshot from Superlinked’s landing page

As you’ve seen in Lesson 11 and Lesson 12, in just a few lines of code, we’ve:

implemented clean and modular schemas for your data;
chunked and embedded the data;
added embedding support for multiple data types (text, categorical, numerical, temporal);
implemented multi-index collections and queries, allowing us to optimize our retrieval step;
connectors for multiple vector DBs (Redis, MongoDB, etc.)
optimized filtered vector search.

The truth is that Superlinked is still a young Python framework.

But as it grows, it will become more stable and introduce even more features, such as rerank, making it an excellent choice for implementing your vector search layer.

If you are curious, check out Superlinked to learn more about them.

Conclusion

Within this article, you’ve learned how to implement multi-index collections and queries for advanced RAG using Superlinked.

After to better understand how Superlinked queries work, we gradually presented how to build a complex query that:

uses two vector indexes;
adds filters based on the metadata extracted with an LLM;
returns only the top K elements to reduce network I/O overhead.

Ultimately, we looked into how Superlinked can help us implement and optimize various advanced RAG methods, such as query expansion, self-query, filtered vector search and rerank.

This was the last lesson of the LLM Twin course. I hope you learned a lot during these sessions.

Decoding ML is grateful that you are here to expand your production ML knowledge from our resources.

🔗 Check out the code on GitHub [1] and support us with a ⭐️

Enjoyed This Article?

Join the Decoding ML Newsletter for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For FREE ↓

Decoding ML Newsletter | Paul Iusztin | Substack

Join for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For…

decodingml.substack.com

References

Literature

[1] Your LLM Twin Course — GitHub Repository (2024), Decoding ML’s GitHub Organization

[2] Understand Text Similarity Spaces (2024), Superlinked’s Documentation

[3] Retrieve & Re-Rank, Sentence Transformers Documentation

[4] Swagger UI, FastAPI documentation

[5] Understanding Categorical Similarity Space (2024), Superlinked’s Documentation

[6] Understanding Recency Spaces (2024), Superlinked’s Documentation

[7] Understand Number Spaces — MinMax Mode (2024), Superlinked’s Documentation

Images

If not otherwise stated, all images are created by the author.

LLM TWIN COURSE: BUILDING YOUR PRODUCTION-READY AI REPLICA

Build Multi-Index Advanced RAG Apps

How to implement multi-index queries to optimize your RAG retrieval layer.

Why is this course different?

What will you learn to build by the end of this course?

Who is this for?

How will you learn?

Lessons

Lesson 12: Build Multi-Index Advanced RAG Apps

Table of Contents

1. Exploring the multi-index RAG server

Multi-index structures

Build a scalable RAG ingestion pipeline using 74.3% less code

End-to-end implementation for an advanced RAG feature pipeline

2. Understanding the data ingestion pipeline

3. Writing complex multi-index RAG queries using Superlinked

Multi-index query

4. Exploring the 4 advanced RAG optimization techniques

Query expansion (retrieval)

Self query (retrieval)

Filtered vector search (retrieval)

Rerank (post-retrieval)

5. Is Superlinked OP for building RAG and other vector-based apps?

Conclusion

Enjoyed This Article?

Decoding ML Newsletter | Paul Iusztin | Substack

Join for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For…

References

Literature

Images

Written by Paul Iusztin