LLM TWIN COURSE: BUILDING YOUR PRODUCTION-READY AI REPLICA
Architect scalable and cost-effective LLM & RAG inference pipelines
Design, build and deploy RAG inference pipeline using LLMOps best practices.
→ the 9th out of 12 lessons of the LLM Twin free course
What is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality and voice into an LLM.
Why is this course different?
By finishing the “LLM Twin: Building Your Production-Ready AI Replica” free course, you will learn how to design, train, and deploy a production-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices.
Why should you care? 🫵
→ No more isolated scripts or Notebooks! Learn production ML by building and deploying an end-to-end production-grade LLM system.
What will you learn to build by the end of this course?
You will learn how to architect and build a real-world LLM system from start to finish — from data collection to deployment.
You will also learn to leverage MLOps best practices, such as experiment trackers, model registries, prompt monitoring, and versioning.
The end goal? Build and deploy your own LLM twin.
The architecture of the LLM twin is split into 4 Python microservices:
- the data collection pipeline: crawl your digital data from various social media platforms. Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines. Send database changes to a queue using the CDC pattern. (deployed on AWS)
- the feature pipeline: consume messages from a queue through a Bytewax streaming pipeline. Every message will be cleaned, chunked, embedded (using Superlinked), and loaded into a Qdrant vector DB in real-time. (deployed on AWS)
- the training pipeline: create a custom dataset based on your digital data. Fine-tune an LLM using QLoRA. Use Comet ML’s experiment tracker to monitor the experiments. Evaluate and save the best model to Comet’s model registry. (deployed on Qwak)
- the inference pipeline: load and quantize the fine-tuned LLM from Comet’s model registry. Deploy it as a REST API. Enhance the prompts using RAG. Generate content using your LLM twin. Monitor the LLM using Comet’s prompt monitoring dashboard. (deployed on Qwak)
Along the 4 microservices, you will learn to integrate 3 serverless tools:
Who is this for?
Audience: MLE, DE, DS, or SWE who want to learn to engineer production-ready LLM systems using LLMOps good principles.
Level: intermediate
Prerequisites: basic knowledge of Python, ML, and the cloud
How will you learn?
The course contains 10 hands-on written lessons and the open-source code you can access on GitHub, showing how to build an end-to-end LLM system.
Also, it includes 2 bonus lessons on how to improve the RAG system.
You can read everything at your own pace.
→ To get the most out of this course, we encourage you to clone and run the repository while you cover the lessons.
Costs?
The articles and code are completely free. They will always remain free.
But if you plan to run the code while reading it, you have to know that we use several cloud tools that might generate additional costs.
The cloud computing platforms (AWS, Qwak) have a pay-as-you-go pricing plan. Qwak offers a few hours of free computing. Thus, we did our best to keep costs to a minimum.
For the other serverless tools (Qdrant, Comet), we will stick to their freemium version, which is free of charge.
Meet your teachers!
The course is created under the Decoding ML umbrella by:
- Paul Iusztin | Senior ML & MLOps Engineer
- Alex Vesa | Senior AI Engineer
- Alex Razvant | Senior ML & MLOps Engineer
🔗 Check out the code on GitHub [1] and support us with a ⭐️
Lessons
→ Quick overview of each lesson of the LLM Twin free course.
The course is split into 12 lessons. Every Medium article will be its own lesson:
- An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin
- The Importance of Data Pipelines in the Era of Generative AI
- Change Data Capture: Enabling Event-Driven Architectures
- SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG — in Real-Time!
- The 4 Advanced RAG Algorithms You Must Know to Implement
- The Role of Feature Stores in Fine-Tuning LLMs
- How to fine-tune LLMs on custom datasets at Scale using Qwak and CometML
- Best Practices When Evaluating Fine-Tuned LLMs
- Architect scalable and cost-effective LLM & RAG inference pipelines
- How to evaluate your RAG pipeline using the RAGAs Framework
- [Bonus] Build a scalable RAG ingestion pipeline using 74.3% less code
- [Bonus] Build Multi-Index Advanced RAG Apps
To better understand the course’s goal, technical details, and system design → Check out Lesson 1
Let’s start with Lesson 9 ↓↓↓
Lesson 9: Architect scalable and cost-effective LLM & RAG inference pipelines
In Lesson 9, we will focus on implementing and deploying the inference pipeline of the LLM twin system.
First, we will design and implement a scalable LLM & RAG inference pipeline based on microservices, separating the ML and business logic into two layers.
Secondly, we will use Comet ML to integrate a prompt monitoring service to capture all input prompts and LLM answers for further debugging and analysis.
Ultimately, we will deploy the inference pipeline to Qwak and make the LLM twin service available worldwide.
→ Context from previous lessons. What you must know.
This lesson is part of a more extensive series in which we learn to build an end-to-end LLM system using LLMOps best practices.
In Lesson 4, we populated a Qdrant vector DB with cleaned, chunked, and embedded digital data (posts, articles, and code snippets).
In Lesson 5, we implemented the advanced RAG retrieval module to query relevant digital data. Here, we will learn to integrate it into the final inference pipeline.
In Lesson 7, we used Qwak to build a training pipeline to fine-tune an open-source LLM on our custom digital data. The LLM weights are available in a model registry.
In Lesson 8, we evaluated the fine-tuned LLM to ensure the production candidate behaves accordingly.
So… What you must know from all of this?
Don’t worry. If you don’t want to replicate the whole system, you can read this article independently from the previous lesson.
Thus, the following assumptions are what you have to know. We have:
- a Qdrant vector DB populated with digital data (posts, articles, and code snippets)
- a vector DB retrieval module to do advanced RAG
- a fine-tuned open-source LLM available in a model registry from Comet ML
→ In this lesson, we will focus on gluing everything together into a scalable inference pipeline and deploying it to the cloud.
1. The architecture of the inference pipeline
Our inference pipeline contains the following core elements:
- a fine-tuned LLM
- a RAG module
- a monitoring service
Let’s see how to hook these into a scalable and modular system.
The interface of the inference pipeline
As we follow the feature/training/inference (FTI) pipeline architecture, the communication between the 3 core components is clear.
Our LLM inference pipeline needs 2 things:
- a fine-tuned LLM: pulled from the model registry
- features for RAG: pulled from a vector DB (which we modeled as a logical feature store)
This perfectly aligns with the FTI architecture.
→ If you are unfamiliar with the FTI pipeline architecture, we recommend you review Lesson 1’s section on the 3-pipeline architecture.
Monolithic vs. microservice inference pipelines
Usually, the inference steps can be split into 2 big layers:
- the LLM service: where the actual inference is being done
- the business service: domain-specific logic
We can design our inference pipeline in 2 ways.
Option 1: Monolithic LLM & business service
In a monolithic scenario, we implement everything into a single service.
Pros:
- easy to implement
- easy to maintain
Cons:
- harder to scale horizontally based on the specific requirements of each component
- harder to split the work between multiple teams
- not being able to use different tech stacks for the two services
Option 2: Different LLM & business microservices
The LLM and business services are implemented as two different components that communicate with each other through the network, using protocols such as REST or gRPC.
Pros:
- each component can scale horizontally individually
- each component can use the best tech stack at hand
Cons:
- harder to deploy
- harder to maintain
Let’s focus on the “each component can scale individually” part, as this is the most significant benefit of the pattern. Usually, LLM and business services require different types of computing. For example, an LLM service depends heavily on GPUs, while the business layer can do the job only with a CPU.
As the LLM inference takes longer, you will often need more LLM service replicas to meet the demand. But remember that GPU VMs are really expensive.
By decoupling the 2 components, you will run only what is required on the GPU machine and not block the GPU VM with other computing that can quickly be done on a much cheaper machine.
Thus, by decoupling the components, you can scale horizontally as required, with minimal costs, providing a cost-effective solution to your system’s needs.
Microservice architecture of the LLM twin inference pipeline
Let’s understand how we applied the microservice pattern to our concrete LLM twin inference pipeline.
As explained in the sections above, we have the following components:
- A business microservice
- An LLM microservice
- A prompt monitoring microservice
The business microservice is implemented as a Python module that:
- contains the advanced RAG logic, which calls the vector DB and GPT-4 API for advanced RAG operations;
- calls the LLM microservice through a REST API using the prompt computed utilizing the user’s query and retrieved context
- sends the prompt and the answer generated by the LLM to the prompt monitoring microservice.
As you can see, the business microservice is light. It glues all the domain steps together and delegates the computation to other services.
The end goal of the business layer is to act as an interface for the end client. In our case, as we will ship the business layer as a Python module, the client will be a Streamlit application.
However, you can quickly wrap the Python module with FastAPI and expose it as a REST API to make it accessible from the cloud.
The LLM microservice is deployed on Qwak. This component is wholly niched on hosting and calling the LLM. It runs on powerful GPU-enabled machines.
How does the LLM microservice work?
- It loads the fine-tuned LLM twin model from Comet’s model registry [2].
- It exposes a REST API that takes in prompts and outputs the generated answer.
- When the REST API endpoint is called, it tokenizes the prompt, passes it to the LLM, decodes the generated tokens to a string and returns the answer.
That’s it!
The prompt monitoring microservice is based on Comet ML’s LLM dashboard. Here, we log all the prompts and generated answers into a centralized dashboard that allows us to evaluate, debug, and analyze the accuracy of the LLM.
Remember that a prompt can get quite complex. When building complex LLM apps, the prompt usually results from a chain containing other prompts, templates, variables, and metadata.
Thus, a prompt monitoring service, such as the one provided by Comet ML, differs from a standard logging service. It allows you to quickly dissect the prompt and understand how it was created. Also, by attaching metadata to it, such as the latency of the generated answer and the cost to generate the answer, you can quickly analyze and optimize your prompts.
2. The training vs. the inference pipeline
Before diving into the code, let’s quickly clarify what is the difference between the training and inference pipelines.
Along with the apparent reason that the training pipeline takes care of training while the inference pipeline takes care of inference (Duh!), there are some critical differences you have to understand.
The input of the pipeline & How the data is accessed
Do you remember our logical feature store based on the Qdrant vector DB and Comet ML artifacts? If not, consider checking out Lesson 6 for a refresher.
The core idea is that during training, the data is accessed from an offline data storage in batch mode, optimized for throughput and data lineage.
Our LLM twin architecture uses Comet ML artifacts to access, version, and track all our data.
The data is accessed in batches and fed to the training loop.
During inference, you need an online database optimized for low latency. As we directly query the Qdrant vector DB for RAG, that fits like a glove.
During inference, you don’t care about data versioning and lineage. You just want to access your features quickly for a good user experience.
The data comes directly from the user and is sent to the inference logic.
The output of the pipeline
The training pipeline’s final output is the trained weights stored in Comet’s model registry.
The inference pipeline’s final output is the predictions served directly to the user.
The infrastructure
The training pipeline requires more powerful machines with as many GPUs as possible.
Why? During training, you batch your data and have to hold in memory all the gradients required for the optimization steps. Because of the optimization algorithm, the training is more compute-hungry than the inference.
Thus, more computing and VRAM result in bigger batches, which means less training time and more experiments.
The inference pipeline can do the job with less computation. During inference, you often pass a single sample or smaller batches to the model.
If you run a batch pipeline, you will still pass batches to the model but don’t perform any optimization steps.
If you run a real-time pipeline, as we do in the LLM twin architecture, you pass a single sample to the model or do some dynamic batching to optimize your inference step.
Are there any overlaps?
Yes! This is where the training-serving skew comes in.
During training and inference, you must carefully apply the same preprocessing and postprocessing steps.
If the preprocessing and postprocessing functions or hyperparameters don’t match, you will end up with the training-serving skew problem.
Enough with the theory. Let’s dig into the RAG business microservice ↓
3. Settings Pydantic class
First, let’s understand how we defined the settings to configure the inference pipeline components.
We used pydantic_settings and inherited its BaseSettings class.
This approach lets us quickly define a set of default settings variables and load sensitive values such as the API KEY from a .env file.
from pydantic_settings import BaseSettings, SettingsConfigDict
class AppSettings(BaseSettings):
model_config = SettingsConfigDict(env_file=".env", env_file_encoding="utf-8"
... # Settings.
# CometML config
COMET_API_KEY: str
COMET_WORKSPACE: str
COMET_PROJECT: str = "llm-twin-course"
... # More settings.
settings = AppSettings()
All the variables called settings.* (e.g., settings.Comet_API_KEY) come from this class.
4. The RAG business module
We will define the RAG business module under the LLMTwin class. The LLM twin logic is directly correlated with our business logic.
We don’t have to introduce the word “business” in the naming convention of the classes. What we presented so far was used for a clear separation of concern between the LLM and business layers.
Initially, within the LLMTwin class, we define all the clients we need for our business logic ↓
Now let’s dig into the generate() method, where we:
- call the RAG module;
- create the prompt using the prompt template, query and context;
- call the LLM microservice;
- log the prompt, prompt template, and answer to Comet ML’s prompt monitoring service.
Now, let’s look at the complete code of the generate() method. It’s the same thing as what we presented above, but with all the nitty-little details.
class LLMTwin:
def __init__(self) -> None:
...
def generate(
self,
query: str,
enable_rag: bool = True,
enable_monitoring: bool = True,
) -> dict:
prompt_template = self.template.create_template(enable_rag=enable_rag)
prompt_template_variables = {
"question": query,
}
if enable_rag is True:
retriever = VectorRetriever(query=query)
hits = retriever.retrieve_top_k(
k=settings.TOP_K,
to_expand_to_n_queries=settings.EXPAND_N_QUERY
)
context = retriever.rerank(
hits=hits,
keep_top_k=settings.KEEP_TOP_K
)
prompt_template_variables["context"] = context
prompt = prompt_template.format(question=query, context=context)
else:
prompt = prompt_template.format(question=query)
input_ = pd.DataFrame([{"instruction": prompt}]).to_json()
response: list[dict] = self.qwak_client.predict(input_)
answer = response[0]["content"][0]
if enable_monitoring is True:
self.prompt_monitoring_manager.log(
prompt=prompt,
prompt_template=prompt_template.template,
prompt_template_variables=prompt_template_variables,
output=answer,
metadata=metadata,
)
return {"answer": answer}
Let’s look at how our LLM microservice is implemented using Qwak.
5. The LLM microservice
As the LLM microservice is deployed on Qwak, we must first inherit from the QwakModel class and implement some specific functions.
- initialize_model(): where we load the fine-tuned model from the model registry at serving time
- schema(): where we define the input and output schema
- predict(): where we implement the actual inference logic
Note: The build() function contains all the training logic, such as loading the dataset, training the LLM, and pushing it to a Comet experiment. To see the full implementation, consider checking out Lesson 7, where we detailed the training pipeline.
Let’s zoom into the implementation and the life cycle of the Qwak model.
The schema() method is used to define how the input and output of the predict() method look like. This will automatically validate the structure and type of the predict() method. For example, the LLM microservice will throw an error if the variable instruction is a JSON instead of a string.
The other Qwak-specific methods are called in the following order:
- __init__() → when deploying the model
- initialize_model() → when deploying the model
- predict() → on every request to the LLM microservice
>>> Note that these methods are called only during serving time (and not during training).
Qwak exposes your model as a RESTful API, where the predict() method is called on each request.
Inside the prediction method, we perform the following steps:
- map the input text to token IDs using the LLM-specific tokenizer
- move the token IDs to the provided device (GPU or CPU)
- pass the token IDs to the LLM and generate the answer
- extract only the generated tokens from the generated_ids variable by slicing it using the shape of the input_ids
- decode the generated_ids back to text
- return the generated text
Here is the complete code for the implementation of the Qwak LLM microservice:
class CopywriterMistralModel(QwakModel):
def __init__(
self,
use_experiment_tracker: bool = True,
register_model_to_model_registry: bool = True,
model_type: str = "mistralai/Mistral-7B-Instruct-v0.1",
fine_tuned_llm_twin_model_type: str = settings.FINE_TUNED_LLM_TWIN_MODEL_TYPE,
dataset_artifact_name: str = settings.DATASET_ARTIFACT_NAME,
config_file: str = settings.CONFIG_FILE,
model_save_dir: str = settings.MODEL_SAVE_DIR,
) -> None:
self.use_experiment_tracker = use_experiment_tracker
self.register_model_to_model_registry = register_model_to_model_registry
self.model_save_dir = model_save_dir
self.model_type = model_type
self.fine_tuned_llm_twin_model_type = fine_tuned_llm_twin_model_type
self.dataset_artifact_name = dataset_artifact_name
self.training_args_config_file = config_file
def build(self) -> None:
# Training logic
...
def initialize_model(self) -> None:
self.model, self.tokenizer, _ = build_qlora_model(
pretrained_model_name_or_path=self.model_type,
peft_pretrained_model_name_or_path=self.fine_tuned_llm_twin_model_type,
bnb_config=self.nf4_config,
lora_config=self.qlora_config,
cache_dir=settings.CACHE_DIR,
)
self.model = self.model.to(self.device)
logging.info(f"Successfully loaded model from {self.model_save_dir}")
def schema(self) -> ModelSchema:
return ModelSchema(
inputs=[RequestInput(name="instruction", type=str)],
outputs=[InferenceOutput(name="content", type=str)],
)
@qwak.api(output_adapter=DefaultOutputAdapter())
def predict(self, df) -> pd.DataFrame:
input_text = list(df["instruction"].values)
input_ids = self.tokenizer(
input_text, return_tensors="pt", add_special_tokens=True
)
input_ids = input_ids.to(self.device)
generated_ids = self.model.generate(
**input_ids,
max_new_tokens=500,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id,
)
answer_start_idx = input_ids["input_ids"].shape[1]
generated_answer_ids = generated_ids[:, answer_start_idx:]
decoded_output = self.tokenizer.batch_decode(generated_answer_ids)[0]
return pd.DataFrame([{"content": decoded_output}])
Where the settings used in the code above have the following values:
class AppSettings(BaseSettings):
model_config = SettingsConfigDict(env_file=".env", env_file_encoding="utf-8")
... # Other settings.
DATASET_ARTIFACT_NAME: str = "posts-instruct-dataset"
FINE_TUNED_LLM_TWIN_MODEL_TYPE: str = "decodingml/llm-twin:1.0.0"
CONFIG_FILE: str = "./finetuning/config.yaml"
MODEL_SAVE_DIR: str = "./training_pipeline_output"
CACHE_DIR: Path = Path("./.cache")
The most important one is the FINE_TUNED_LLM_TWIN_MODEL_TYPE setting, which reflects what model and version to load from the model registry.
Access the code 🔗 here ←
The final step is to look at Comet’s prompt monitoring service. ↓
6. Prompt monitoring
Comet makes prompt monitoring straightforward. There is just one API call where you connect to your project and workspace and send the following to a single function:
- the prompt and LLM output
- the prompt template and variables that created the final output
- your custom metadata specific to your use case — here, you add information about the model, prompt token count, token generation costs, latency, etc.
Let’s look at the logs in Comet ML’sML’s LLMOps dashboard.
Here is how you can quickly access them ↓
- log in to Comet (or create an account)
- go to your workspace
- access the project with the “LLM” symbol attached to it. In our case, this is the “llm-twin-course-monitoring” project.
Note: Comet ML provides a free version which is enough to run these examples.
This is how Comet ML’s prompt monitoring dashboard looks. Here, you can scroll through all the prompts that were ever sent to the LLM. ↓
You can click on any prompt and see everything we logged programmatically using the PromptMonitoringManager class.
Besides what we logged, adding various tags and the inference duration can be valuable.
7. Deploying and running the inference pipeline
Qwak makes the deployment of the LLM microservice straightforward.
During Lesson 7, we fine-tuned the LLM and built the Qwak model. As a quick refresher, we ran the following CLI command to build the Qwak model, where we used the build_config.yaml file with the build configuration:
poetry run qwak models build -f build_config.yaml .
After the build is finished, we can make various deployments based on the build. For example, we can deploy the LLM microservice using the following Qwak command:
qwak models deploy realtime \
--model-id "llm_twin" \
--instance "gpu.a10.2xl" \
--timeout 50000 \
--replicas 2 \
--server-workers 2
We deployed two replicas of the LLM twin. Each replica has access to a machine with x1 A10 GPU. Also, each replica has two workers running on it.
🔗 More on Qwak instance types ←
Two replicas and two workers result in 4 microservices that run in parallel and can serve our users.
You can scale the deployment to more replicas if you need to serve more clients. Qwak provides autoscaling mechanisms triggered by listening to the consumption of GPU, CPU or RAM.
To conclude, you build the Qwak model once, and based on it, you can make multiple deployments with various strategies.
You can quickly close the deployment by running the following:
qwak models undeploy --model-id "llm_twin"
We strongly recommend closing down the deployment when you are done, as GPU VMs are expensive.
To run the LLM system with a predefined prompt example, you have to run the following Python file:
poetry run python main.py
Within the main.py file, we call the LLMTwin class, which calls the other services as explained during this lesson.
Note: The → complete installation & usage instructions ← are available in the README of the GitHub repository.
🔗 Check out the code on GitHub [1] and support us with a ⭐️
Conclusion
Congratulations! You are close to the end of the LLM twin series.
In Lesson 9 of the LLM twin course, you learned to build a scalable inference pipeline for serving LLMs and RAG systems.
First, you learned how to architect an inference pipeline by understanding the difference between monolithic and microservice architectures. We also highlighted the difference in designing the training and inference pipelines.
Secondly, we walked you through implementing the RAG business module and LLM twin microservice. Also, we showed you how to log all the prompts, answers, and metadata for Comet’s prompt monitoring service.
Ultimately, we showed you how to deploy and run the LLM twin inference pipeline on the Qwak AI platform.
In Lesson 10, we will show you how to evaluate the whole system by building an advanced RAG evaluation pipeline that analyzes the accuracy of the LLMs ’ answers relative to the query and context.
See you there! 🤗
🔗 Check out the code on GitHub [1] and support us with a ⭐️
Enjoyed This Article?
Join the Decoding ML Newsletter for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For FREE ↓
References
Literature
[1] Your LLM Twin Course — GitHub Repository (2024), Decoding ML GitHub Organization
[2] Add your models to Model Registry (2024), Comet ML Guides
Images
If not otherwise stated, all images are created by the author.