Stories by Tam Nguyen on Medium

9 Methods to Enhance the Performance of a LLM RAG Application

Tam Nguyen — Mon, 20 Nov 2023 02:48:25 GMT

It is easy to prototype your first LLM RAG (Retrieval Augmented Generation) application, e.g. using this chat-langchain template with below architecture.

Image by LangChain

But it is hard to make it work well. In this article, I gather and share some approaches to enhance the performance of the LLM RAG application.

For more details, please refer to references I mention in each section.

1. Store message histories and user feedbacks

Chat histories and user feedbacks are important for the application analytics. We will use them later in a next session.

In an above schema, 1 collection should have multiple embeddings. 1 user can have many chat sessions, and in each chat session, we store the messages (between human and AI) and their analytical information such as generated questions (condense questions or questions after query transformations), retrieved chunks and corresponding distance scores, user feedback, .etc.

2. Start evaluating your application

A naive RAG app can have some challenges, e.g.: bad retrieval (low precision, low recall, outdated information), bad response generation (hallucination, irrelevance, toxicity/bias), .etc.

Before improving it, we need a way to measure its performance. We can use the following ragas metrics for the evaluation.

Because some scores need ground_truth, there are 2 ways to create the labeled evaluation dataset:

Human-annotated labeled dataset: we can use some real questions and answers that users give good feedback from the first method chat_analysis table.
Generated labeled dataset (good for cold start): Prompt GPT-4 Turbo to generate questions from each chunk or multiple chunks to get pairs of question & doc chunks. Then run above pairs of question & context through GPT-4 Turbo to generate answer.

Eventually we can run through this labeled dataset with above pre-defined metrics and use GPT-4 Turbo as an evaluator/judge.

For a better observation, we should also use some LLMOps platforms such as LangSmith, MLflow or integrate DeepEval in CI/CD pipelines.

3. Multi-vector retriever

When splitting documents for retrieval, there are often conflicting desires:

We may want to have small chunks, so that their embeddings can most accurately reflect their meaning. If too long, then the embeddings can lose meaning.
We also want to have long enough documents that the contexts are retained. Separating the document many times by separators and chunk_size sometimes breaks the context unexpectedly. It's also hard to combine the chunks in a right order to form the meaningful document for a prompt context.

Below approaches allow us to balance precise embeddings and context retention by splitting documents into smaller chunks for embedding but retrieving larger text information or even the whole original document for the prompt context, since many LLM models nowadays support long context window, e.g. GPT-4 Turbo supports 128,000 tokens.

Image by TheAiEdge

Parent Document Retriever

Instead of indexing entire documents, data is divided into smaller chunks, referred to as Parent and Child documents.
Child documents are indexed for better representation of specific concepts, while parent documents are retrieved to ensure context retention.
Sometimes, the full documents can be too big to retrieve them as is. In that case, first split the raw documents into larger chunks, and then split them into smaller chunks. We then index the smaller chunks, but on retrieval we retrieve the larger chunks (but still not the full documents).

Hypothetical Questions

Documents are processed to generate potential questions they might answer.
These questions are then indexed for better representation of specific concepts, while parent documents are retrieved to ensure context retention.

Summaries

Instead of indexing the entire document, a summary of the document is created and indexed.
Similarly, the parent document is retrieved in the application.

4. Query transformations

Because the original query can not be always optimal to retrieve for the LLM, especially in the real world. The user often doesn’t provide the full context and thinks about the question from a specific angle.

Query transformation deals with transformations of the user’s question before passing to the embedding model. Below are a few variations of query transform methods and their sample prompt implementation. They are all using an LLM to generate a new or multiple new queries.

We can also combine multiple query transformation techniques to get the best result e.g.

5. Query construction for retrieval optimization

Self-querying

Do you remember a metadata column in an above embedding table? We can include additional information such as author, genre, rating, the date it was written, …, and any information about the document beyond the text itself. We can define a schema and store the metadata in a structured way alongside the vector representation.

With the database metadata schema, we use LLM to construct a structured query from the question to filter the document chunks. At the same time, the question is also converted into its vector representation for the similarity search. This kind of hybrid retrieval approaches are likely to become more and more common when RAG becomes a more widely adopted strategy.

Time-weighted retriever

In some cases, the information contained in the documents is only relevant if it is recent enough. In the context of a time-weighted retriever, the data is retrieved based on the hybrid score between semantic similarity and the age of the document. The algorithm for scoring it can be:

semantic_similarity + (1.0 - decay_rate) ^ hours_passed

Other query constructions

Image by LangChain

6. Document selection optimization

Re-ranking

After first-stage retrieval (lexical/keyword-based search or semantic/embedding-based search), doing re-ranking as a second stage to rank retrieved documents using relevance scores.

Image by Cohere

Maximal Marginal Relevance (MMR)

Sometimes we retrieve more than we actually need, there can be similar documents capturing the same information. The MMR metric penalizes redundant information.

The reranking is an iterative process where we measure the similarity of the vectors to the query and the similarity of the vectors to the vectors we have already re-ranked, end up with a vector similar to the query but dissimilar to the vectors we already reranked.

7. Context Optimization

Now that we have selected the correct documents to answer the question, we need to figure out how we will pass data as context for the LLM to answer the question.

Image by LangChain

Stuffing is the simplest method to pass data to the language model. It concatenates the text of those documents and pass it to the prompt.
MapReduce implements a multi-stage summarization. It is a technique for summarizing large pieces of text by first summarizing smaller chunks of text and then combining those summaries into a single summary. Instead of summarization, we can also iterate through the documents to extract the information likely to answer the question.
Refine method is an alternative method to deal with large document summarization. It works by first running an initial prompt on a small chunk of data, generating some output. Then, for each subsequent document, the output from the previous document is passed in along with the new document, and the LLM is asked to refine the output based on the new document.
Map-rerank strategy iterates through each document and tries to answer the question along with a score on how well the question was answered. We can then pick the answer with the highest score.

https://medium.com/media/202078b529c0068095c17f2f534c93d6/href

8. Multimodal RAG

When dealing with semi-structured or unstructured data e.g. tables, text, and images, we might need multimodal LLM and/or multimodal embeddings, below are some options:

Image by LangChain

Use multimodal embeddings (such as CLIP) to embed images and text together. Retrieve either using similarity search, but simply link to images in a docstore. Pass raw images and text chunks to a multimodal LLM for synthesis.
Use a multimodal LLM (such as GPT4-V, LLaVA, or FUYU-8b) to produce text summaries from images. Embed and retrieve text summaries using a text embedding model. And, again, reference raw text chunks or tables from a docstore for answer synthesis by the LLM. Then exclude image from the docstore, instead pass image text summary to LLM or use a multi-modal LLM for synthesis, with raw image and raw table, text.

9. Agents

Last but not least, you may not only build the RAG app to answer questions from documents, we can have multiple tools to augment the LLM app, or route question between multiple datastores. An agent uses the LLM to choose a sequence of actions to take to solve a problem.

The Agent can consist of some key components:

The Agent Core: This is the central component of the agent responsible for making decisions. It is powered by the LLM and a prompt that includes the agent’s personality, background context, and prompting strategies.
Tools and Toolkits: Tools are functionalities that the agent can access and utilize to perform specific tasks. Toolkits are collections of related tools that work together to accomplish specific objectives. There are two important design considerations around tools: giving the agent access to the right tools and describing the tools in a way that is most helpful to the agent.

There are some types of agents we should first start with:

Conclusion

I suggest to read all above methods and other RAG Strategies from Open AI, then pick the ones that are most relevant to your use case. You can also combine multiple approaches to get the best result. For example, the first architecture can be turned to a below one.

If you find this article useful, please give it a clap and share it with your friends. Also kindly checkout a generative_ai repo that contains some generative AI techniques and follow me in LinkedIn.

Thanks for reading!

Book Review: Simplifying Data Engineering and Analytics with Delta

Tam Nguyen — Sun, 24 Jul 2022 04:16:49 GMT

Building data engineering and analytics platform is hard. It does not only contains OLAP systems and data pipelines, but also many aspects that we need to care such as and data modeling, data governance, data observability, data quality and data lineage, .etc.

We also maybe sometimes confused in managing both data warehouse and data lake, at this point I tent to the Lakehouse architecture. I wrote articles about building Delta Lake data pipeline with code, but in order to build the comprehensive Lakehouse platform including all data components that I mentioned above, I recommend reading the book: “Simplifying Data Engineering and Analytics with Delta”.

Author: Anindita Mahapatra — a Solutions Architect at Databricks

What is the book about?

There are 3 sessions in the book:

Session 1: Data Engineering Principles

It talks about some big data design patterns and best practices for data modeling. Delta as a file format is also introduced here with its advantages over pure Parquet.

There are some useful theoretical data system designs that may help you in the interviews 😸 such as Spark distributed processing, stages of data modeling, metadata management, data formats, Lakehouse characteristics, .etc.

Session 2: Building end-to-end Delta Pipelines

Firstly, it shares differences between Lambda and Kappa architectures, then introduces Spark unifying batch and streaming with Delta, which includes Spark structured streaming code examples and windows, watermarking , .etc concepts.

Secondly, it walks through some common data pattern scenarios like handling CDC , SCD in Delta Lake with merge queries. It also explains how the Delta table works with transaction log, commits, checkpoint, .etc, and how to optimize tables with Z-order and file size.

Thirdly, this chapter gives some useful tips for exploratory data analysis (EDA) such as data profiling, data drift, data anonymity, statistical analysis, class imbalance, data skew and main types of joins, .etc.

The next thing are machine learning pipelines and challenges of ML development. Then we are introduced the ML model, MLOps and how Delta helps in feature engineering, model training, inferencing and monitoring pipelines.

The last thing in this chapter are DaaS and DaaP (data as a product) concepts in the context of a data mesh with some illustrations of Delta with unstructured data and Delta sharing.

Session 3: Delta Pipeline Operation in Production

There are some factors that we need to care in the production environment: scalability, high availability, RTO and RPO in different DR strategies. This chapter shares how Delta helps in DR. Then it talks about the data quality guarantee with validation examples in DLT. It also mentions continuous training (CT) and continuous monitoring (CM) which are additional steps beside CI/CD in ML pipelines.

Performance and cost are also big concerns in the data platform and Delta can optimize these things by compaction, data skipping, ZOrder, partition pruning, dynamic file pruning, optimum file size, delta caching, .etc.

Last but not least, provisioning a multi-tenant infrastructure for data democratization via policies and processes, capacity planning, monitoring and data sharing are the things that we need to take a look as DevOps/MLOps engineers.

Summary

This book is easy to read for all data practitioners, it shares real world use cases with demonstrations by PySpark example code blocks and the latest features of Delta Lake. It’s also well-structured by going through basic theoretical data principles to building, optimizing and maintaining practical data pipelines and machine learning pipelines.

You can preorder the book here.

Road to Lakehouse — Part 2: Ingest and process data from Kafka with CDC and Delta Lake’s CDF

Tam Nguyen — Sun, 23 Jan 2022 14:58:12 GMT

Road to Lakehouse — Part 2: Ingest and process data from Kafka with CDC and Delta Lake’s CDF

This is a second part of the Data Lakehouse and data pipelines implementation in the Delta Lake. Source code GitHub repositories are at the end of this article. For a high level pipeline architecture, please take a look at the first part.

Road to Lakehouse — Part 1: Delta Lake data pipeline overview

Delta Lake Data Pipeline

Raw Ingestion

I divide the Kafka data into 2 categories: event data which comes from the backend application and cdc data which is generated by Debezium. Below is a main PySpark job.

https://medium.com/media/d33a7c08d092d7eb3f379a4fcce20599/href

Ingest data from Kafka

To ingest the data from Kafka, we just need to specify credentials and topic name.

https://medium.com/media/70555aff2641e3d4f9d8fe4beade7c2e/href

Process the Kafka data

We need to get the schema of the topic first.

https://medium.com/media/9d74c94fabfee184f21340516ce8675b/href

In case of CDC, below is an example of MongoDB payload schema.

https://medium.com/media/f6cfad2967194d73b20690dc4d7bed3e/href

Once we have the schema, it is easy to get the data in plain texts.

https://medium.com/media/c7f545d43bf465761bf96436bac9d923/href

Load the processed data into a raw area

Eventually let write the stream to a delta table in the raw zone. Spark stores Kafka offsets in checkpoint locations for failure recoveries.

https://medium.com/media/a2c453470100cad596ee1b5371d770dd/href

Refined zone

While there are several ways to transform data inside the Delta Lake between different layers such as using dbt or Delta Live Tables, we can leverage the built-in property in delta tables which is the Change Data Feed (CDF) without any additional cost to merge changes to the next area in near real-time.

Below is a main PySpark job to extract changes from the raw table then process, load them into the refined table.

https://medium.com/media/3a68102e7a29e64f28e75fe4f595c919/href

Read change feed from raw tables

https://medium.com/media/8634510f8eeeaa48e5927c0f24122854/href

Process the CDF

We can do some transformations like flattening a nested json or exploding an array before loading the data into the refined layer.

https://medium.com/media/ef2d76accbd1a47f50b0298131b9451c/href

Conclusion

Above streaming data pipeline is a good starting point to build further business level tables. In your real project you might need to do more complicated operations like upserting or joining multiple tables, but the general idea of using Spark structured streaming and CDF is still valid. There are many more things to do with the Delta Lake like using SQLAlchemy ORM to query data, visualizing data with Streamlit and building ML workflows with Databricks Feature Store, AutoML, MLflow, .etc, hopefully we can discuss more about them in the next article.

Source code

ML workflow with Airflow, MLflow and SageMaker

Tam Nguyen — Thu, 26 Aug 2021 03:18:02 GMT

Regarding MLOps, there are many tools to support data, workflow, model, .etc management.

And we can start with a simple ML workflow using following platforms

Airflow for the run orchestration.
MLflow for the experiment tracking and organization.
SageMaker for job training, hyperparameter tuning, model serving and production monitoring.

ML workflow (Image by author)

About the Airflow and MLflow setups, we can deploy them in any infrastructure (K8s, ECS, .etc) with meta data stored in RDS.

We will use Airflow as a scheduler so we don’t need a complex worker architecture, all the computation jobs will be handled by SageMaker and other AWS services.

MLflow provided 4 main features related to ML lifecycle including central model registry, model deployment, project code management and experiment tracking.

Starting from determining what is observed, what should be predicted and how performance and error metrics need to be optimized.

The business problem is framed as a machine learning problem, follow by some steps

Data acquisition: ingesting data from sources including data collection, data integration and data quality checking.
Data pre-processing: handling missing data, outliers, long tails, .etc.
Feature engineering: running experiments with different features, adding, removing and changing features.
Data transformation: standardizing data, converting data format compatible with training algorithms.
Job training: training’s parameters, metrics, .etc are tracked in the MLflow. We can also run SageMaker Hyperparameter Optimization with many training jobs then search the metrics and params in the MLflow for a comparison with minimal effort to find the best version of a model.
Model evaluation: analyzing model performance based on predicted results on test data.
If business goals are met, the model will be registered in the SageMaker Inference Models. We can also register the model in the MLflow.
Getting predictions in any of the following ways:

Using SageMaker Batch Transform to get predictions for an entire dataset.
Setting up a persistent endpoint to get one prediction at a time using SageMaker Inference Endpoints.

9. Monitoring and debugging the workflow, re-training with a data augmentation.

For the data processing, feature engineering and model evaluation, we can use several AWS services

EMR: providing a Hadoop ecosystem cluster including pre-installed Spark, Flink, .etc. We should use a transient cluster to process the data and terminate it when all done.
Glue job: providing a server-less Apache Spark, Python environments. Glue’ve supported Spark 3.1 since 2021 Aug.
SageMaker Processing jobs: running in containers, there are many prebuilt images supporting data science. It also supports Spark 3.

Data accessing

All data stored in S3 can be queried via Athena with metadata from Glue data catalog.
We can also ingest the data into SageMaker Feature Store in batches directly to the offline store.

Sample workflow

Dataset: Kaggle Retail Data Analytics
For full stages, please refer to this GitHub repo

Training and hyperparameter tuning jobs

AWS SageMaker is cost-effective with EC2 spot instances.
In order to log the training parameters and metrics in MLflow, we should use the SageMaker script mode with a below sample training script.

https://medium.com/media/bf05f62024a66c97166a7585e42289c6/href

And we can specify the SageMaker estimator and hyperparameter ranges for the tuning jobs.

https://medium.com/media/e6f58f26d6becbb5d4b3192db59e4b9d/href

Then let’s take a look at different models’ metrics with different parameters in the MLflow.

MLflow experiment tracking (Image by author)

Conclusion

This basic ML workflow could be a good starting point, then we can expand it by adding other stuffs such as data versioning, CI/CD process, online feature store, .etc.

For more details, please take a look at a below GitHub repo, and welcome all constructive comments!

References

GitHub - tam159/mlops: ML Ops: Machine Learning Operations