GEN AI @ Haleon

Challenges in implementing Gen AI products

And the alternate solutions you can try

Dr. Varshita Sher

Published in

Trusted Data Science @ Haleon

6 min readJul 29, 2024

By Dr. Varshita Sher

Disclaimer: The article assumes understanding of and experience with basic Gen AI terms and terminology.

A robot working on a laptop — Image by Alexandra_Koch on Pixabay

Haleon, as a leading healthcare company, focuses on Generative AI due to the immense value it brings in processing and understanding large volumes of textual data. With vast amounts of feedback from people discussing the products we sell on social media and a plethora of internal company documents, generative AI plays a pivotal role. It enables us to efficiently analyze this data, extract meaningful insights, and enhance decision-making processes. As we build these Gen AI-driven products, we adhere to responsible AI guidelines to ensure safe and fair use, though we face certain challenges that are discussed below.

Introduction

The proliferation of orchestration libraries like LangChain, CrewAI, and LlamaIndex has empowered data scientists to swiftly develop a working POC that demonstrates the capabilities of complex foundational models and data pipelines. This enables faster iterations and more efficient validation of ideas. For example, you can create a simple pipeline to classify tweet sentiment or set up a RAG (Retrieval-Augmented Generation) pipeline to review content based on policy documents.

However, in most cases, the post-POC phase encounters unforeseen challenges that hinder the transition from POC to a fully developed data product. Let’s examine some of these challenges and explore potential solutions for each.

1. The Gen AI solution is very costly

Even with the advent of open-source models like DBRX and Mixtral, which are relatively cheaper compared to closed-source ones like the GPT family, some tasks require a significant number of tokens. This can lead to substantially higher costs even with cheaper models. Therefore, it is advisable to focus on minimizing token usage regardless of the model’s price.

A straightforward solution is to enlist a prompt engineering expert to optimize the system prompt. You might be surprised at how some of the newer LLMs can operate effectively with minimal guidance and fewer few-shot examples in the prompts.

Additionally, consider whether parts of your pipeline can benefit from a strategic selection of LLMs. For instance, if you’re handling a simpler instruction task, you could use an SLM (Small Language Model) such as Microsoft’s Phi. For more complex tasks like Agentic RAGs, it’s advisable to use more advanced LLMs like Claude or GPT-4.

You can also consider LLM response caching where frequent or repetitive API calls to the LLM are stored temporarily. This caching mechanism can significantly reduce latency and improve overall system performance by serving previously generated responses instead of making fresh requests each time.

If you are implementing RAG, ensure you cap the number and size of relevant documents appropriately, as all these need to be processed by the LLM. Avoid using excessively large chunks or overlapping chunks. Opt for vector databases that allow metadata filtering and Exact Nearest Neighbour (ENN) to pre-filter the results and ensure relevancy. ENN guarantees retrieval of the absolute closest vectors to your query, eliminating the accuracy limitations inherent in Approximate Nearest Neighbours (ANN).

Additionally, if you are using an orchestration library like LangChain, be mindful of the multiple LLM chains called under the hood to achieve the end results. For instance, dissecting the RetrievalQA chain in LangChain reveals that it comprises the StuffDocumentsChain and combine_documents_chain. Both chains, by default, use the same model supplied during initialization and have substantially long default system prompts. Moreover, the StuffDocumentsChain uses additional sub-chains to preprocess the chunks returned from the retriever. Consequently, the output token count can be significantly higher than if you implemented RAG from scratch.

It would be beneficial to choose an expensive model for the StuffDocumentsChain, as it needs to extract relevant answers from the retrieved chunks (a complex task), and a cheaper model for the combine_documents_chain, as it only needs to summarize the intermediate answers into a comprehensive response (a simpler task).

Note: If you are sending numerous few-shot examples as part of few-shot prompting, it might indicate that you need to consider fine-tuning. This can help you avoid sending these examples with every single API call. For instance, if you have many guidelines and examples for classifying tweets based on your company’s marketing policy, fine-tuning the model would allow you to only send the sentence to be classified, eliminating the overhead of including examples each time.

Be aware that fine-tuning introduces additional overhead costs, including hosting the fine-tuned model, and expenses related to GPUs for training and inference. This option should be carefully evaluated before committing to ensure it aligns with your cost and performance requirements.

2. The Gen AI Solution is very slow

Yes, the POC you built was able to showcase the solution but no one likes waiting for more than a couple of seconds to get the response back from the API call to the LLM.

So what can you do?

To improve response times, consider parallelizing your code. For example, if your goal is to summarize 10 documents, you can execute these tasks concurrently because each API call is independent of the others. Libraries like asyncio can help you achieve this parallel execution, or some orchestration libraries offer built-in support for parallel processing.

If you are using a Pay As You Go model with your LLM provider, remember that the Tokens Per Minute (TPM) rate limit displayed applies to all users on the platform, not just you. This shared limit can impact your usage during peak periods, potentially leading to longer wait times or, in the worst case, encountering Error Code 429, which indicates too many requests within a short timeframe. This can disrupt your workflow and cause delays in your project timeline. (Implementing exponential backoff logic can help manage the frequency of retry attempts but will negatively impact the underlying delays as this approach progressively increases the wait time between consecutive retries.)

To mitigate these issues, consider upgrading to paid PTUs (Provisioned Throughput Units) that offer guaranteed rate limits. This allows you to optimize your codebase to maximize use of the context window and TPMs. Additionally, explore load balancing strategies to distribute requests across multiple instances, ensuring more consistent performance and reducing the likelihood of encountering rate limits or delays.

If speed is a critical concern, consider switching to quantized models in your codebase. Quantized calculations are simpler and demand less processing power, resulting in faster inference times. This allows the model to generate responses more quickly. The main trade-off with quantization is a potential slight decrease in accuracy, but in most cases, this decrease is typically negligible.

3. The Gen AI solution is hallucinating a lot

One of the significant challenges observed with LLMs is their tendency to hallucinate. In some of the latest models, this hallucination isn’t related to answer accuracy or relevance but rather concerns adherence to formatting instructions. LLMs often struggle to consistently produce output in the requested format.

For example, when requesting a response in JSON format, the model might comply correctly only 80% of the time. The remaining 20% poses a risk of breaking the code if proper error handling isn’t implemented.

A straightforward approach is to retry each time the formatting instructions are not followed. However, this can result in considerable token wastage and increased latency.

Alternatively, albeit more complex, an effective solution involves manually identifying and observing the 20% of deviations. Custom parsers can then be developed to handle these format discrepancies, reducing the need for repeated retries and conserving tokens.

4. How do I trust the outputs from the Gen AI products?

This issue lacks a straightforward solution. Even after trying various approaches in development and user acceptance testing (UAT), issues may still arise in production. However, this doesn’t negate the importance of conducting evaluations. It’s crucial to begin planning how to evaluate LLM outputs specific to your use case — whether it’s translation, Q&A, classification, etc. — from the outset and research relevant metrics to quantify results (e.g., METEOR for translations, accuracy for classification).

One approach is to utilize LLM-based evaluators such as QAG metrics for RAG-based solutions. These are easy to set up and do not require a ground truth response, but they can be resource-intensive. Alternatively, a more traditional method involves gathering a dataset of questions and their ground truth responses to generate metrics like cohesiveness, relevance, and coherence.

For the most rigorous evaluation, consider manual UAT-based testing with human-in-the-loop, which remains essential despite its time constraints.

Hopefully, being aware of these challenges in advance helps you get a head start and avoid potential pitfalls. General AI is a nascent field, with rapid advancements and ongoing updates shaping its landscape. Some of the challenges discussed today may become obsolete in the near future. Until then, it is essential to stay informed, adapt strategies accordingly, and continually innovate to harness the full potential of this evolving technology.