LLM TWIN COURSE: BUILDING YOUR PRODUCTION-READY AI REPLICA

The LLM-Twin Free Course on Production-Ready RAG applications.

Learn how to build a full end-to-end LLM & RAG production-ready system. Learn about and code along each component.

Alex Razvant
Decoding ML

--

→ overview of all the lessons within the LLM Twin free course

Image by DALL-E

Why is this course different?

By finishing the “LLM Twin: Building Your Production-Ready AI Replica free course, you will learn how to design, train, and deploy a production-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices.

Why should you care? 🫵

→ No more isolated scripts or Notebooks! Learn production ML by building and deploying an end-to-end production-grade LLM system.

What’s in for you?

If you’re intrigued by Generative AI, LLMs, or RAGs and want to learn in a step-by-step fashion how to architect, build, and deploy a real-world LLM system — this course is the perfect choice.

You’ll learn to design, code, and deploy each component from data collection to deployment. Additionally, you’ll also learn to leverage MLOps lifecycle’s best practices, such as Experiment Tracking, Model Registry, Dataset Versioning, Prompt Monitoring, and more.

The end goal? Build and deploy your own LLM twin.

Costs

The course code repository and attached materials (e.g. articles on medium) are completely free and will remain that way.

Regarding the tooling costs, we did our best to keep the prices at 0 or tend to a minimum and cover the entire development and deployment within the free/freemium offers.

However, if you plan to run the whole deployment stack, it is important to keep in mind the following key points.

  • AWS offers accessible plans to new joiners. For a new first-time account, you could get up to 300$ in free credits which are valid for 6 months. For more, consult the AWS Offerings page.
  • Qwak [2] also offers a generous plan of up to 100 QPU per month for a year, which is more than enough to set up and run the LLM-Twin components that use Qwak. Once the 100QPU limit is reached, the pricing switches to a pay-as-you approach, which consists of 1.2$ per additional QPU used. For more, consult the Qwak Pricing [7] and Qwak Compute [8]

The System Design

The architecture of the LLM twin is split into 4 Python microservices:

  1. the data collection pipeline: crawl your digital data from various social media platforms. Clean, normalize, and load the data to a NoSQL DB through a series of ETL pipelines. Send database changes to a queue using the CDC pattern. (deployed on AWS)
  2. the feature pipeline: consume messages from a queue through a Bytewax streaming pipeline. Every message will be cleaned, chunked, embedded (using Superlinked), and loaded into a Qdrant vector DB in real time. (deployed on AWS)
  3. the training pipeline: create a custom dataset based on your digital data. Fine-tune an LLM using QLoRA. Use Comet ML’s experiment tracker to monitor the experiments. Evaluate and save the best model to Comet’s model registry. (deployed on Qwak)
  4. the inference pipeline: load and quantize the fine-tuned LLM from Comet’s model registry. Deploy it as a REST API. Enhance the prompts using RAG. Generate content using your LLM twin. Monitor the LLM using Comet’s prompt monitoring dashboard. (deployed on Qwak)

Audience

We recommend this course for MLEs, DEs, DSs, or SWEs who want to learn to engineer production-ready LLM systems using LLMOps good principles.
Level: intermediate
Prerequisites: basic knowledge of Python, ML, and the basic cloud provider toolings.

Meet your teachers!

The course is created under the Decoding ML umbrella by:

🔗 Check out the code on GitHub [1] and support us with a ⭐️

The LLM-Twin Free Course

This course teaches you how to design, build, and deploy a production-ready LLM-RAG system. It covers all the components, system design, data ingestion, streaming pipeline, fine-tuning pipeline, inference pipeline alongside production monitoring, and more.

What is the project about?

We’re building an RAG system, able to write content based on your unique style, by scrapping previous posts/articles and code snippets written by you to construct the knowledge base, generate a dataset to fine-tune a capable and efficient open-source LLM, and then interconnect the components for a full end-to-end deployment while integrating evaluation and post-deployment monitoring.

The Course Lessons

The course is split into 12 lessons. Every Medium article covers one entire lesson.

In the first set of lessons, we discussed the architecture and started to prepare the data-related components, web scrapping, vector retrieval, and stream processing.

After that, we’re presenting ML-centered concepts such as LLM fine-tuning, dataset preparation, versioning, model validation, and registration.

In the last set, we’re continuing with building the RAG pipeline, evaluating fine-tuned LLM, evaluating the RAG pipeline, and deploying the system.

Let’s get into every lesson separately, and describe in short what‘s covered.

Lesson 1: Presenting the Architecture

Presenting and describing each component, the tooling used, and the intended workflow of implementation. The first lesson will prepare the ground by offering a wide overview of each component and consideration.

We recommend you start here.

🔗 Lesson 1: An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin

LLM twin system architecture [Image by the Author]

Lesson 2: Data Pipelines

In this lesson, we’ll start by explaining what a data pipeline is, and the key concepts of data processing and streaming, and then dive into the data scrapping and processing logic. We go into detail about what type of data we’re fetching and further enhance and store it.

Each concept is accompanied by a code block explaining the implementation logic.

🔗 Lesson 2: The Importance of Data Pipelines in the Era of Generative AI

Lesson 2: The Data Collection Pipeline [Image by author]

Lesson 3: Change Data Capture and Data Processing

In this lesson, we’re showcasing the CDC(Change Data Capture) integration within the LLM-Twin data pipeline. We’re showing how to set up MongoDB, the CDC approach for event-driven processing, RabbitMQ for message queuing, and efficient low-latency database querying using the MongoDB Oplog.

In the end, we’re explaining how to build and deploy the entire workflow.

🔗 Lesson 3: The Importance of Data Pipelines in the Era of Generative AI

Lesson 3: Event-Driven Processing using RabbitMQ, CDC, and MongoDB (Image by Author)

Lesson 4: Efficient Data Streaming Pipelines

In this lesson, we’ll focus on the feature pipeline. Here, we’re showcasing how we ingest data that we’ve gathered in the previous lesson, and how we’ve built a stream-processing workflow that fetches raw samples, structures them using Pydantic Models, cleans, chunks, encodes, and stores them in our Qdrant Vector Database.

In this lesson, you’re going to see the power and simplicity of Bytewax in action and its seamless integration out of the box with Qdrant Vector DB.

🔗 Lesson 4: SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG — in Real-Time!

Lesson 4: Efficient Data Streaming Pipelines using Bytewax and Qdrant Vector DB. (Image by Author)

Lesson 5: Advanced RAG Optimization Techniques

In this lesson, we’ll go more in-depth and showcase to you a few advanced techniques we’ve employed to increase the similarity and accuracy of the embedded data samples from our Qdrant Vector Database. We’ll present hybrid search, self-query, query expansion, and vector re-ranking — within the vector retrieval stage.

The contents of this lesson could make a significant difference between a naive RAG application and a production-ready one.

🔗 Lesson 5: The 4 Advanced RAG Algorithms You Must Know to Implement

Lesson 5: Advanced RAG Optimization Techniques. (Image by Author)

Lesson 6: Dataset preparation for LLM fine-tuning

In this lesson, we’ll discuss the core concepts to consider when creating task-specific custom datasets to fine-tune LLMs. Creating manual datasets is a tedious and time-consuming process, to avoid that — we’ll showcase a widely used technique to generate LLM-related datasets — Knowledge Distillation.

We’ll use our cleaned data from our Vector Database, and engineer specific Prompt Templates alongside using GPT3.5-Turbo API to generate our custom dataset. We’ll then further discuss data versioning and lineage, a key concept in MLOps by versioning our datasets as artifacts on Comet ML.

🔗 Lesson 6: The Role of Feature Stores in Fine-Tuning LLMs

Lesson 6: Generate custom datasets using Knowledge Distillation.

Lesson 7: Fine-tuning LLMs on custom datasets

In this lesson, we’re discussing the training pipeline.
We’ll show how to implement a fine-tuning workflow for a Mistral7B-Instruct model while using the custom dataset we’ve versioned previously.

We’ll present in-depth the key concepts that one must know when choosing to fine-tune an LLM model, including Quantisation, LoRA Adapter versioning, PEFT, how to use the Qwak platform and the Qwak CLI tool, alongside covering the entire workflow.

🔗 Lesson 7: How to fine-tune LLMs on custom datasets at Scale using Qwak and CometML

Lesson 7: Fine-tuning LLMs on custom datasets using Qwak and CometML. (Image by Author)

Lesson 8: Evaluating the fine-tuned LLM

In this lesson, we’re discussing one core concept of ML - Evaluation.
We’ll present the evaluation workflow we’ve implemented for our use case while covering common techniques for quantitative and qualitative evaluation (e.g Perplexity, ROUGE, BLEU) and showcase the full process of assessing the model’s performance using the GPT3.5-Turbo model and custom-engineered evaluation templates. Further, we’ll monitor each prompt and LLM-chain using CometML LLM.

🔗 Lesson 8: Best Practices When Evaluating Fine-Tuned LLMs

Lesson 8: Evaluating the quality of our custom fine-tuned LLM. (Image by Author)

Lesson 9: Deploying the Inference Pipeline Stack

In this lesson, we’ll showcase how to design and implement the LLM & RAG inference pipeline based on a set of detached Python microservices. First, we’ll split the ML and business logic into two components, and describe each one in part.

Then, we’ll integrate the monitoring functionality using CometML to capture all passes through the inference pipeline. In the end, we’ll showcase how to wrap up and deploy the inference pipeline on Qwak as a scalable and reproducible system.

🔗 Lesson 9: Architect scalable and cost-effective LLM & RAG inference pipelines

Lesson 9: Architecturing LLM & RAG inference pipeline. (Image by Author)

Lesson 10: RAG Pipeline Evaluation

In this lesson, we’re covering the RAG evaluation section — which is one of great importance. If no proper evaluation metrics are monitored or techniques are used, the RAG systems might underperform and hallucinate badly.

Here, we’ll describe the workflow of evaluating RAG pipelines using the powerful RAGAs framework, which is widely used in development/production use cases to steer and improve the system results. We’ll show how to install, set it up, compose the expected RAGAs evaluation format, and capture eval scores which will be included in full LLM execution chains and logged on Comet ML LLM [3].

🔗 Lesson 10: Evaluating RAG Systems using the RAGAs Framework

Lesson 10: Evaluating the RAG pipeline. (Image by Author)

Lesson 11: Refactor and improve the RAG ingestion pipeline

In this lesson, we will refactor the RAG ingestion pipeline using Superlinked, a framework specialized in vector computing for information retrieval.

We will take the ingestion pipeline implemented in Lesson 4 and swap the chunking, embedding, and vector DB logic with Superlinked.

Why Superlinked? Here are its core benefits:

  • Easier to define a clear data structure aggregating your structured and unstructured attributes
  • Out-of-the-box multi-indexing.
  • It offers a clean Python SDK that results in a concise, easy-to-maintain codebase.
  • An out-of-the-box vector compute server that you can scale independently from the rest of your system.
  • Intuitive interface for complex vector search queries.

🔗 Lesson 11: Build a scalable RAG ingestion pipeline using 74.3% less code

Lesson 11: The RAG feature pipeline architecture after refactoring. (Image by Author)

Lesson 12: Build Multi-Index Advanced RAG Apps

This lesson will teach you to implement multi-index structures for building advanced RAG systems.

To implement our multi-index collections and queries, we will leverage Superlinked (the same as in Lesson 11), a vector compute engine highly optimized for working with vector data. It offers solutions for ingestion, embedding, storage and retrieval.

To better understand how Superlinked queries work, we will gradually present how to build a complex query that uses two vector indexes, adds filters based on the metadata extracted using an LLM, and returns only the top K most similar documents to reduce network I/O overhead.

Ultimately, we will dig into how Superlinked can help us implement and optimize various advanced RAG methods, such as query expansion, self-query, filtered vector search and rerank.

🔗 Lesson 12: Build Multi-Index Advanced RAG Apps

Tooling:

Throughout the course, you’ll get the chance to learn and use the following:

  1. Comet ML: the ML Platform
    Trusted by ML teams at Uber, Netflix, Mobileye, and Etsy, CometML is an amazing platform widely known and used within the MLOps workflow.
    We’ll use the ModelRegistry, Experiment Tracking, and LLM Monitoring features from CometML, as these are key for any product around ML that reaches a production stage.
  2. Qdrant: the Vector Database
    The core of each RAG application is a knowledge base as the context source to enrich the LLM’s response. Qdrant is the leading open-source vector database and similarity search engine designed to handle high-dimensional vectors for performance and massive-scale AI applications.
  3. Qwak: the ML infrastructure
    The need for effective and fast-to-deploy approaches for products around ML is a must that defines your production success. Qwak encompasses all that you need, to deliver AI applications at speed, from idea to high-scale.
    We’ll use Qwak to delegate the heavy workflow of fine-tuning and serving our LLM leveraging the easy to iterate and cost-effective mantra of Qwak.
  4. Bytewax: the streaming engine
    To ensure that we have fresh data in a real-time fashion, we need a powerful approach to leverage the data streaming task. We haven’t found a better tool for that than Bytewax.
    Coupling the simplicity of Python with the performance of Rust, Bytewax allows us to build real-time streaming pipelines ingesting any type of data.
  5. Superlinked: the vector compute engine
    In the last two bonus lessons, we will show you how to refactor the RAG ingestion and retrieval pipelines (using Superlinked) to improve the indexing and retrieval steps of the RAG logic for better accuracy. Superlinked will also help us keep the code clean and concise, reducing it by 74.3%.

Our Recommendations

Our 4-month journey of hard work into putting this together while harnessing our best insights as ML Engineers ends here.

Our motivation to develop and publish this course was the lack of beginner or intermediate-level resources on how to architect a production-ready Generative AI solution from start to finish.

Loving what we do, we’ve planned, designed, and published this
free course, so it can represent a solid roadmap for anyone looking to learn about LLMs, RAGs, best open-source toolings, and best MLOps/LLMOps practices.

This course is not only a one-way take, as we’ve planned it as a set of microservices such that each of its modules can be returned to and adapted to specific use cases.

We recommend that you walk through each lesson, read its contents, and then execute the attached code interactively.

If you have any questions related to any of the lessons, feel free to :

  1. Support us with a ⭐️ and leave an issue on our GitHub Repository
  2. Follow us and leave a comment on the specific article on Medium Decoding ML
  3. Follow us on LinkedIn and message us if you have questions
    - Paul Iusztin | Senior ML & MLOps Engineer
    - Alex Vesa | Senior AI Engineer
    - Alex Razvant | Senior ML & MLOps Engineer
  4. Subscribe to our free Substack Newsletter and contact us there.

Ending Notes

Thank you for taking part in this journey on how to learn and implement in a hands-on fashion a full end-to-end production-ready LLM & RAG system deploying your content writing Twin.

The ML Community has been awesome.
We thank everyone who took part in this, who supported us with a star on our GitHub repository, subscribed to our Decoding ML Newsletter, followed us on LinkedIn and Medium, and overall showed interest in the current LLM-Twin course.

We’ve received massive support from the community via likes, shares, claps on Medium, and messages thanking us for this contribution, and we appreciate them all greatly!

The course wouldn’t have been possible without the massive support from the great teams at Bytewax [6], Qdrant [5], Comet ML [4], Qwak [2], and Superlinked [9], who not only developed excellent tools but have helped us whenever we had any questions.

References

[1] LLM Twin Github Repository, 2024, Decoding ML GitHub Organization

[2] Qwak, 2024, The Qwak.ai Platform Landing Page

[3] Comet ML LLM, The Comet ML LLM Platform

[4] Comet ML, The Comet ML Landing Page

[5] Qdrant, The Qdrant Vector Database Landing Page

[6] Bytewax, The Bytewax Landing Page

[7] Qwak Pricing, The Qwak MLOps Platform Pricing

[8] Qwak Compute, Qwak Instance Sizes Documentation

[9] Superlinked, Superlinked Landing Page

--

--

Alex Razvant
Decoding ML

Senior ML Engineer @ Everseen AI | Weekly expert ML & MLOps Insights | Author of Neural Bits Newsletter: neuralbits.substack.com