Best practices for LLM optimization for call and message compliance: prompt engineering, RAG, and fine-tuning

From 80% to 98%: how we enhanced LLM model accuracy for compliance in medical marketing and sales calls

25 min readJun 25, 2024

(Alec Coyle-Nicolas, Simon Greenman, Salus AI)

Executive Summary

Large Language Models (LLMs) are increasingly used to ensure compliance with regulatory standards, legal contracts, product and service standards, and internal guidelines. These models analyze vast amounts of unstructured data, including voice calls, messages, and emails, to assess adherence to compliance mandates. In call center compliance, LLMs are being tested to see their effectiveness for monitoring conversations to ensure agents follow the appropriate rules.

Many engineers start with standard prompt design and engineering for compliance use cases, which often leads to a performance plateau at around 80% accuracy. To surpass this ceiling, an industry-wide playbook of optimization techniques is emerging. At Salus AI, we explored five key techniques to enhance model performance and share our findings in this post. We experimented with:

Prompt engineering: optimize the number of prompts and context window size
Prompt design: ensure appropriate directness, or indirectness, of language in the prompts
Pre-processing of input text: implement speaker diarization and labelling in call transcript texts to help better signal to the models
Test Retrieval Augmented Generation (RAG): breaking text into smaller chunks to optimize queries and costs
Fine-tune models on labelled data sets that identify examples of compliant and non-compliant conversations.

These techniques were tested over a three-month project aimed at optimizing LLM performance for compliance of marketing calls for premium health screening services. We analyzed thousands of anonymized calls where agents spoke to prospective consumers who had opted in to receive information about screening services and were informed that calls were recorded. These calls are highly regulated, requiring agents to adhere to strict guidelines, such as not claiming a test is free, notifying consumers that the call is recorded, asking essential questions about power of attorney and medical history, and confirming the consumer has not already taken the test. Violations can result in severe penalties, including jail time in extreme cases.

Our findings show that LLMs can be highly effective in ensuring compliance in premium health screening services marketing calls. We improved the initial model performance from 80% to 95–100% on compliance tasks through prompt design, RAG, and fine-tuning. This approach is as effective, if not more so, than traditional rule-based call analytics compliance and monitoring solutions. The graph below illustrates the improvement in compliance question results, showing the transition from initial prompt usage (green) to final performance (yellow).

Overall we found the following about using the different techniques:

Prompt Design and Engineering: Well-designed prompt language led to a performance improvement ranging from 9 to 68 percentage points, marking the most significant impact among all techniques tested. Prompts that required language models to infer or deduce meaning often resulted in poor performance. For example, changing the prompt from “Does the call agent suggest the test is a necessity?” to a more precise “Does the call agent tell the consumer the test is required?” improved accuracy from 69 to 99 percentage points. Additionally, focusing on a single question per prompt added an extra 1 to 15 percentage points in performance. Finally, adopting LangChain provided a tool to manage the increased complexity of multiple prompts, multiple questions, and variations.
Pre-processing Transcript Text: Speaker separation / diarization and labeling yielded mixed results depending on the complexity of the prompts. Simple prompts saw an accuracy gain of up to 9 percentage points, while complex prompts sometimes experienced a decline. Speaker labeling appears to shift the “cognitive load” from dialogue comprehension to speaker identity, resulting in more cautious classification by the language model. The addition of filtering out irrelevant speakers tended to have either a small positive impact or no impact at all.
Retrieval-Augmented Generation (RAG): For the most part, RAG had minor additional effects on most prompts’ performance, likely due to diminishing returns after significant improvements had already been made. However, it added over 50 percentage points in accuracy for few-shot prompts where examples of answers were given. The most notable impact of RAG was a cost reduction of up to 80%, as the number of calls to paid language model APIs was significantly reduced. The size of document chunks was identified as the RAG parameter with the largest effect on accuracy.
Fine-Tuning: Fine-tuning the model did not result in improved performance. This outcome was likely influenced by the unbalanced training dataset of 250 labeled input/output pairs, with only approximately 3% containing instances where the answer to the metric question was ‘true.’ With more time, creating synthetic data to ensure the model can adequately learn about the minority ‘true’ class could be beneficial.

Technique #1: Optimizing the number of prompts and use of context window

Action: prompt with 24 questions in a single user instruction

The initial stage of the project consisted of a series of experimental trials to set a baseline as to the efficacy of various prompting strategies on model performance for compliance-related questions. This initial exploratory approach closely mimicked the Naïve BatchPrompt strategy pioneered at Microsoft[1].

We combined instructions with transcripts of the marketing calls and compliance questions into a user message (see diagram below). This was then fed to the LLM. In this case we were using OpenAI’s GPT3.5.

The transcript and the questions are combined into a user message. Then this user message along with the set of instructions (system message) are provided to the LLM which then outputs a series of answers.

A range of different language and prompting techniques were used, including both questions requiring inference, and those that did not, as well as few-shot prompts — utilizing minimal transcript excerpts due to the constraints imposed by context length — and in-line exemplars of target phrases. Then these answers were parsed and recorded to compare to the ground-truth data and measure the accuracy of the LLM.

We started with 24 compliance question presented in one user instruction.

Findings: Don’t bother with putting that many questions in a single prompt. Results are just too poor.

Action: reduce context length and initial set of 24 questions in one prompt down to eight questions

We tried variations of the following eight questions in one prompt:

Did the call agent say that the test was free?
Did the agent ask the consumer if they make their own medical decisions?
Did the consumer state that they do not make their own medical decisions or answer a question from the call agent that would insinuate that they do not?
Did the agent ask the consumer if they have taken a premium health screening services test before?
Did the consumer state that they have taken this specific test before?
Did the agent state or insinuate that the test was required or mandatory?
Did the agent state or imply affiliation with the screening service provider, testing lab, the doctors, or insurance/medicare?
Did the agent claim that the test was a benefit of the consumer’s insurance?

Findings: A prompt with a smaller question set, such as an 8-question prompt, produced more reliable outputs but exhibited a wide range of accuracy, varying from 35% to 90% per question.

This initial set of experiments highlighted the inherent complexity of translating complex compliance concerns into a series of questions that both accurately capture the essence of the concerns and result in accurate outputs from the LLM, as seen in the results below:

Action: try single-question prompts

Observing the increase in accuracy when reducing the size of the question set, and aiming to establish more experimental control, we shifted our focus to testing a single question per prompt. This approach is also supported by literature, which indicates that LLM performance varies significantly based on the position or order of the data — in this case, the questions. Additionally, LLMs generally perform poorly when required to process a large input context.

Action: Adopt LangChain Library to manage the complexity of prompting

To manage the volume of prompts resulting from this shift, we adopted the LangChain library. LangChain provides a set of abstractions that allow researchers and developers to build “chains” — sequences of operations through which a piece of text is processed when interacting with a language model. These chains can involve various steps, from formatting the input to processing the output (response parsing), and can be linked together to handle more sophisticated tasks, as illustrated in the diagram below.

The utilization of LangChain coupled with the addition of custom experiment tracking tools facilitated the ability to iteratively analyze LLM outputs for different prompts, identify their strengths and weaknesses, and test new prompts. Analysis was done on outputs that were:

False Negatives — to identify language the LLM was failing to classify
False Positives — to identify language the LLM was incorrectly classifying
Correct Answers — to identify the language the LLM was correctly classifying

Findings: Limiting the prompt to a single question resulted in the best performance, with accuracy increases ranging from 1–15 percentage point. However, this increased costs due to the additional processing required.

The findings confirmed both the literature and the general logic behind the experiment. Reducing the prompt to focus on a single question resulted in consistent performance improvements as seen in the diagram below with multi-question prompts in green and single-question prompts in yellow:

Key Insight: Use low, but non-zero temperature for accurate but not overly deterministic results (~0.2)

The model’s hyper-parameter temperature, which controls creativity (higher values yield more adventurous responses, lower values more predictable ones), has proven to be crucial in balancing creativity with precision. Our initial findings suggest that a moderate temperature of around 0.2 creates an ideal environment for the model to produce both reasoned and structured outputs without becoming overly deterministic. To control as many variables as possible, this temperature was used for the remainder of the experiments.

Key Insight: Context length matters and to achieve the best results, use a single question per prompt

The length and complexity of the context fed into an LLM significantly influenced its interpretive performance. Initially, a set of 24 questions was used, but this format led to issues such as hallucinations, omissions, and format inconsistencies. The focus was then narrowed to eight questions. By simplifying the context and reducing the “cognitive load”, we saw an improvement in compliance with the desired output structure and an increase in both the completion rate and the thoroughness of the model’s responses. Finally, adopting LangChain and focusing on a single question per prompt resulted in further performance improvements.

Technique #2: Optimizing prompt design

Building on splitting prompts in to single questions, we looked to optimize question prompt design.

Overall Findings: analysis and prompt design resulted in accuracy increasing by 9 to 51 percentage points across all questions

The graph below shows initial results in green with single-question prompting and how they improved with prompt design in yellow:

Key Insight: Prompts that require LLMs to infer or deduce meaning often leave too much room for interpretation, resulting in poor performance

When considering all prompts tested in this phase a couple of trends emerged. In many instances performance is positively correlated with more explicit and specific prompting. For example, consider the two prompts below:

Upon analyzing the corpus of false positives generated by the LLM for prompt 1, it was revealed that the model interpreted the word “suggest” too liberally. It flagged any vaguely suggestive dialogue, such as “Taking this test would be beneficial for your family members, as it would notify them that they are more at risk of carrying this gene.” To address this, prompt 2 was designed to be more explicit, reducing the room for interpretation and subsequently leading to a significant increase in performance.

Key Insight: LLMs may over-index on specific words rather than grasping dialogue meaning.

However, this rule does not always apply as there are some types of dialogue that require more room for interpretation to successfully identify. In other words, the LLM might disproportionately focus on sentences containing a given word, even if those sentences are not relevant to the overall meaning of the dialogue. This can lead to inaccuracies because the model is not fully interpreting the broader context or intent behind the words. Consider the pair of prompts below:

Following the same methodology to improve the prompt, false positives for Prompt 1 were analyzed. It was revealed that some confusion might have arisen in processing the response to the question. Consider the scenario where a call agent asks, “Do you make your own medical decisions?” An affirmative response from the consumer should indicate autonomy, yet the LLM sometimes erroneously flagged this as the consumer stating they do not make their own decisions. This suggests the model may have been overly focused on the presence of decision-making language in the consumer’s response rather than understanding the context of the dialogue. Thus, Prompt 2 was designed to relax the decision-making verbiage and allow the LLM more room for interpretation.

Key Insight: There is no one-size-fits-all approach to prompt design. Developing an experimental framework that enables tracking model configurations and their outputs is imperative.

This inconsistency reveals that while explicitness often aids in reducing ambiguity, there are scenarios where a moderate level of inference enables the model to capture the intent and meaning of dialogue more accurately, rather than fixating on keywords. There is no one-size-fits-all approach to designing prompts for LLMs. The optimal prompt structure appears to be highly context-dependent, requiring a nuanced understanding of the interplay between the specificity of language and the necessity for interpretive leeway.

Technique #3: Pre-process call transcripts to separate speakers and identify their role

The anonymized call transcripts we received often lacked speaker separation. Even when speaker separation and diarization were present, the labels used were impersonal identifiers like SPEAKER A or SPEAKER B. This anonymity in speaker identification proved to be a stumbling block for language models, which occasionally misattributed statements to the wrong speakers. Since the core of our questions depended on accurately identifying the speakers, it was crucial to move beyond generic labeling.

Action: Developed a heuristic algorithm to replace generic speaker labels with specific roles, such as call agent and consumer

In response, we developed a heuristic-based algorithm after analyzing the structural patterns in the call transcripts. Additionally, we created a method to selectively filter out specific speakers from the transcripts. This allowed the LLMs to focus on the dialogue of key participants relevant to the compliance concern.

Consider, for example, the results of the heuristic algorithm, with the identified role in parentheses next to the generic label used in previous experiments:

SPEAKER A (call agent): Okay. Am I speaking to you on your cell phone or your landline?
SPEAKER B (consumer): I’m on a landline.
SPEAKER A (call agent): Okay. Can you verify that number for me? All right. Wonderful. Approximately how old were you when you were first diagnosed with anxiety disorder? Were you in your twenty s? Thirty s? Forty s. Fifty s in your 30s? Okay. Who in your family had dementia?
SPEAKER B (consumer): Mother.
SPEAKER A (call agent): Mother. Okay, got it. Are you taking any medications that are prescribed to you by a doctor? If so, how many medications are you taking?
SPEAKER B (consumer): One.
SPEAKER A (call agent): One. What is that medication for?
SPEAKER B (consumer): I’m diabetic and I take insulin.

The main rationale behind this development was the belief that tailoring the transcripts to include speaker labels and only the dialogue attributed to essential speakers would provide two key benefits: a leaner, more manageable transcript for the models to process (context reduction) and a richer, more relevant dataset free of superfluous distractions (context improvement).

Findings: Speaker labelling tended to have a positive impact on simple and explicit prompts but a negative impact on prompts that were more complex and/or required inference

Surprisingly, the benefit of specific speaker role labels was not universal, as evident in the graph below showing the accuracy for the best performing prompts for each compliance metric

Key Insight for Simple / Explicit Prompting: The data illustrates that when the query is straightforward, the LLM’s accuracy tends either remains stable or shows improvement.

Consider the prompts below with the results for unlabeled and labeled speakers in the transcript. These relatively simple prompts improved performance by 1 to 9 percentage points with speaker labels (see below):

Key Insight with Complex / Implicit Prompting: complex prompts that require the LLM to infer intent or meaning from the dialogue tended to either maintain or experience a dip in performance upon the addition of speaker labels.

Consider the examples of the more implicit prompting below where accuracy drops.

While these two general trends emerge when analyzing the data holistically, it is important to highlight that there were exceptions. Consider for example the prompt below which does require a moderate level of inference, but experienced an increase in accuracy with the addition of speaker labelling:

Key Insight: Speaker labelling seems to shift “cognitive load” from dialogue comprehension to speaker identity, leading to more cautious classification by the LLM

Furthermore, every prompt with a decrease in accuracy saw a decrease in false positives which was offset by an even larger increase in false negatives. A potential explanation could be that the addition of speaker labels and change in system messaging caused the LLM to over-emphasize speaker identify, shifting some allocation of its cognitive load from dialogue comprehension to attentiveness to speaker labels. This would enable it to avoid falsely attributing dialogue to the wrong speaker (reduction in false positives), but at the cost of shifting focus away from the semantic content of the dialogue and verbal cues that meets the criteria of the prompt (increase in false negatives).

Key Insight into Speaker Filtering. The addition of filtering irrelevant speakers tended to have either have a small positive impact or none

Analysis revealed that for questions that experienced a decline in performance (see diagram below), the answers were dependent on dialogue involving the consumer answering a question. The leading hypothesis for why this type of dialogue suffered is the imperfection of speaker diarization. Shorter utterances, such as single-word answers, are more difficult to attribute to the correct speaker, which leads to incorrect filtering and, ultimately, analytical errors.

Key Insight: Speaker filtering benefitted more straightforward prompts, but those that required comprehension of larger chunks of dialogue saw decreases in accuracy.

The introduction of speaker labelling and filtering has further demonstrated that LLMs may not always behave or perform as expected. The unexpected performance patterns observed in some metrics underscore the complexity of language understanding tasks and the need for a nuanced application of labelling, filtering, or any other pre-processing techniques tailored to the specific demands of each prompt.

Take for example the prompts below. The first one requires a higher comprehension level of a larger amount of dialogue, and the addition of labelling and filtering resulted in a decrease in accuracy. Conversely, the second prompt is more explicit and straightforward, requiring a lower level of dialogue comprehension and remains unaffected by the addition of speaker labeling or filtering.

Technique #4: Implement Retrieval-Augmented Generation (RAG) for large documents

Introduction

RAG, at its core, is a method that integrates the power of information retrieval with the generative prowess of language models. This technique involves first retrieving a set of documents relevant to a given query and then using those documents to inform the generation of a response.

Diverging from the traditional application of RAG models, which typically retrieve information from extensive external databases, the approach used was introspective, leveraging the content of individual call transcripts. By dividing a single transcript into various-sized chunks and embedding these segments into a vector store — essentially a temporary, call-specific database recreated for each analysis — the aim was to investigate how these models discern and utilize the most relevant information to answer True/False questions.

Text-Splitting

The first step in implementing RAG is breaking up the document(s). In this set of experiments the labelled and filtered transcript was split into smaller chunks, and only the most relevant parts for a given prompt question were provided to the LLM. Recognizing the unique nature of conversational dialogue, a custom text splitter was developed. This splitter operates on the principle of ‘speaker lines’, segmenting the text such that each chunk is a coherent set of utterances by a particular speaker.

This ensures the preservation of speaker labels and maintains the integrity of who is speaking, a vital context in our datasets. Furthermore, it ensures that compliance concerns that pivot on back-and-forth dialogue can capture just that, instead of relying on length or certain characters like some standard text splitters. The custom splitter can be tuned with parameters to control the number of speaker lines per chunk and the degree of overlap between these chunks, ensuring a seamless and contextually rich information flow.

Embeddings

At a high level, the process of RAG can be likened to creating a map of ideas, where each concept or piece of information is a point in a high-dimensional space. These points, or embeddings, represent the essence and meaning of the text in this conceptual space, with proximity indicating similarity. For a simpler visualization of this process, we can imagine a 2-dimensional space, to which we map each chunk of our document.

A simplified visualization of how embedding works. An embedding model is a way to map a chunk of text onto a high dimensional space. This is what it might look like if that space was just 2-D, where each chunk of text would be mapped onto a point in 2-D space. Each chunk is color coded, and as you see the two chunks where Jim, Amanda, and the markteter are mentioned are embedded to similar points in space since they are more similar than the third yellow chunk.

Vector Stores

The placement of these points in the vector space is not arbitrary. Picture a vast, orderly library — not of books, but of information points. In the graphic we used for our presentation, this is depicted by transforming chunks of our original transcripts into points on a two-dimensional graph via an ‘embedding model’. Each transcript chunk, distinctively color-coded, is processed by the embedding model and emerges as a point on this graph. This graph is our vector store, a systematic and organized space where each point represents a unique piece of our data.

Retrievers

A retriever functions much like an expert librarian within this vast library of data points, there to help navigate and retrieve relevant information. Imagine you have a specific topic or question in mind — in a RAG system, this is your input query. In a traditional library, you would approach a librarian for assistance in finding books relevant to your query. Similarly, in our RAG system, the retriever takes on this role. When presented with a query, the retriever navigates through the organized space of the vector store, efficiently sifting through countless data points to find themost pertinent and useful information.

Continuing with the library analogy, let’s explore two key techniques used by a vectorstore retriever (a simple type of retriever used in these experiments): Maximal Marginal Relevance (MMR) and Similarity Search. These techniques are akin to the methods our expert librarian uses to select the best books (data points) in response to a query.

Similarity Search: is exactly what it sounds like. In our analogy, this is like the librarian searching for books that are closest in content and context to your question. It ensures that the books (data points) retrieved are precisely aligned with the query, providing the most relevant and accurate information.
Maximal Marginal Relevance (MMR) Search: Imagine instead that you want to focus on a particular topic, but also want a diverse range of perspectives. MMR is a method used to strike a balance between relevance (how closely each book matches the query) and diversity (ensuring a variety of viewpoints). Think of MMR as the librarian’s skill in picking out books that cover different aspects of your query, preventing you from receiving redundant or overly similar information.

To highlight some of the findings from these experiments, we focus on experimenting with transcript chunks of both 2 and 4 ‘speaker lines’, using both MMR and similarity search to retrieve either 5 or 10 transcript chunks

Continuing with our simplified visualization of an embedding model, we can view the vectorstore as a snapshot of the graph, mapping each point to the respective chunk of text. Then we can subsequentially visualize the concept of retrievers. It essentially uses both the embedding model and the vector store to first embed the query and find the ‘closest’ chunks of adocument. In our example, the query contains the name Mary, so it might be close to both the blue and red chunks in this space, but it also includes ‘medical decisions’ so it is likely to be closer to the blue chunk.

Overall Findings: Our use of Retrieval-Augmented Generation (RAG) had for the most part a small effect on prompt performance, likely due to diminishing returns after significant model improvements prior to its use. However, we did reduce LLM compute costs by up to 80%.

As you can see in the diagram below performance was for the most part impacted by 0 to 2 percentage points both negatively and positively. One outlier was a 27% performance improvement discussed below.

Key Insights: Prompts requiring inference and deduction, along with those containing embedded examples, significantly benefited from the RAG approach.

The largest performance gain when a few-shot prompt examples were embedded. In the graph above we can see a significant improvement in the question where the call agents asks if the consumer has taken the test before.

Additionally consider the below prompt which saw decreases in accuracy with the addition of pre-processing, followed by a small jump when utilizing RAG:

Key Insights: simpler and explicit prompts tended to experience a small decrease in performance when subjected to RAG

In contrast to previous experimental phases, simpler and explicit prompts tended to experience a small decrease in performance when subjected to RAG, indicating that the process may not universally benefit all types of queries.

Consider the prompts that decreased in performance:

Does the call agent ask if the consumer makes their own medical decision?
Did the consumer state that they do not make their own medical decisions?
Does the consumer indicate they’ve undergone the exact premium health screening services test in question?

The two things they all share is their simplicity and their decrease in accuracy following the implementation of RAG.

Key Insight: Embedded examples of dialogue within a prompt anchor and improve the retrieval process.

The more complex prompts, often including embedded examples, generally experienced a decline in accuracy with the addition of speaker labelling and/or filtering but subsequently enjoyed the largest benefit from the addition of RAG. The inclusion of example dialogue within prompts appears to anchor the retrieval process, aligning the query closely with relevant transcript excerpts in the vector space and enhancing accuracy.

Yet, simplicity has its own narrative; while the simpler and more straightforward prompts tended to either benefit or perform consistently through prior experimentation, when faced with RAG, experienced a reduction in performance. This signals a potential disconnect between the bare-bones query and the richer context that RAG was designed to exploit. This dichotomy between complexity and simplicity underscores the necessity of a strategic balance in prompt design. It also brightly lights the path for further experimentation, combining the stability and performance of simple prompts with the embedding and retrieval power of more complex prompts.

Key Insight: The RAG parameter with the largest effect on accuracy was the size of the document chunks.

The limited set of experiments revealed that across all prompts, both the number of documents (transcript chunks) retrieved and provided to the LLM and the search type used had minimal effect on the model’s accuracy:

However, the number of ‘speaker lines’ per document chunk did appear to have a more prominent effect on accuracy, suggesting that the granularity of information in each chunk could be an important factor in achieving higher accuracy:

This improvement likely stems from the richer context and detailed information larger chunks provide, which are crucial for the model to make informed decisions. Given that the vector store is re-initialized for each call, containing only segments from the current transcript, the retrieval process becomes highly specialized. It effectively identifies the most pertinent segments from a constrained, yet highly relevant dataset.

This methodology not only underscores the versatility and adaptability of RAG models but also illuminates a path forward for optimizing their application in contexts where precision and relevance of information are critical. By leveraging larger transcript chunks, we enable the model to access a broader and more nuanced understanding of the conversation, leading to a more accurate retrieval of information and, consequently, more precise answers to specific True/False questions.

Technique #5: Conduct fine-tuning on curated labelled datasets

At its core, fine-tuning involves taking a pre-trained model — a model that has been previously trained on a large dataset — and further training it on a smaller, specific dataset relevant to a particular task or domain. This process is aimed at adapting the general capabilities of the model to perform better on tasks that require more specialized knowledge or understanding.

Findings: Fine-tuning on a small, approximately 250 example input/output pairs, unbalanced dataset leads to an over-fitting and deterministic model

Despite the careful preparation and hopes for improved performance, the fine-tuning experiment yielded disappointing results. The model’s responses were consistently incorrect, even when presented with inputs that were directly drawn from its training set.

A potential explanation for the poor performance was the distribution of ‘true’ answers within these transcripts. Only a small subset, approximately ~3%, contained instances where the answer to our metric question was ‘true.’ This imbalance presented a skewed representation of ‘true’ scenarios, vastly outnumbering them with ‘false’ instances.

While a dataset that mirrors the true proportion of outcomes in real-world scenarios (as this dataset did) ensures that the model is trained under conditions that closely represent its expected operational environment, it also may struggle to learn enough about the conditions under which the minority class (true) occurs and become biased. To test this bias, the fine-tuned model was simply prompted in numerous ways to simply output ‘true’, and would always output ‘false’, seeming to confirm this hypothesis.

A potential future avenue of exploration would be to synthetically generate a larger and more balanced dataset. By ensuring equal or close to equal representation of all classes, this should make the model better equipped to learn the distinguishing features of both classes, mitigating the risks of becoming biased towards the majority class. However, this may also lead the model to develop a mismatch between its expectations and the actual frequencies of events in the real-world resulting in a model that performs poorly when deployed.

Key Insight: For unbalanced datasets, synthetic data may be necessary to ensure that the model can learn enough about the minority class

The results of this fine-tuning phase underscore the challenges of enhancing LLM performance for highly nuanced tasks. It suggests that while fine-tuning can be a powerful tool, it is not infallible, especially when dealing with complex compliance concerns. The experience serves as a testament to the necessity of ongoing experimentation and the exploration of alternative methods to fully harness the potential of LLMs in specialized domains.

Conclusion: LLMs can be used effectively for regulatory compliance of marketing calls

We believe that large language models (LLMs) can be highly effective in ensuring premium health screening services marketing agent calls comply with regulations. By using prompt engineering, retrieval-augmented generation (RAG), and fine-tuning, we improved the initial LLM model performance from 80% to 95–100% on compliance tasks. This approach is as effective, if not more so, than traditional rule-based call analytics compliance solutions.

The techniques we utilized to optimize LLM model performance collectively make a significant difference. While RAG and fine-tuning might have had limited impact in our specific project, this is more a function of the project’s context and should not be seen as a negative for other projects. We also anticipate that new LLM models from companies like OpenAI, Anthropic, Google, and Mistral will significantly enhance the performance of compliance tasks.

From this project, we have learned that best practices include:

Collaborate with Domain Experts: Work closely with domain expert(s) to translate complex compliance issues into specific questions.
Develop Frameworks: Develop an experimental framework to record and analyze model outputs for each prompt.
Engineer Prompts: Design and engineer prompts carefully based on complexity (simple prompts vs. those allowing inference).
Utilize Pre-processing Techniques: Utilize pre-processing techniques such as speaker labeling and filtering with caution; while they may not always provide performance benefits, they do offer cleaner historical data.
Test RAG Strategies: Carefully test and select a Retrieval-Augmented Generation (RAG) strategy to balance cost savings and performance.
Decouple Prompts and RAG: Explore decoupling the prompt question from the RAG query to combine the stability and accuracy of simple prompts with the retrieval capabilities of complex prompts containing embedded examples.
Create Balanced Data for Fine-Tuning: For fine-tuning, create balanced synthetic data or explore alternative methods to ensure effectiveness.
Practice Continuous Experimentation: Continue targeted experiments to fully leverage Large Language Models (LLMs) for nuanced compliance tasks.

Authors and our Company

Alec Coyle-Nicolas, AI Engineer, Salus AI

Simon Greenman, Head of AI & CTO, Salus AI and Partner at Best Practice AI.

Salus AI is pioneering the use of AI to revolutionize regulatory compliance and oversight for industries like healthcare, finance, and consumer services.

Methodology and Background

We used OpenAI’s 3.5 Turbo model for this project. We have no doubts that later models from multiple Foundational Model proviers would improve performance though techniques and tactics would likely stay similar.

To establish a ground truth dataset to which we could compare model outputs for each question for a given transcript, we utilized past transcript ‘scores’. Traditionally a rules-based approach was used to score transcripts, which were then subsequently reviewed and corrected by a domain expert human. These corrected True/False scores were then pulled and formatted for comparison.

An open-source tool, TruLens, offers a straightforward way to track model configurations and outputs via a database and view them in a Streamlit application. Furthermore, it offered a straightforward way to compare outputs to ground-truth data, via feedback functions. Three binary feedback functions were created to compare the model output to the ground-truth data and track whether it was a correct answer, a false negative, or a false positive.

For this discussion let us collectively refer to the prompt, pre-processing methods (i.e. speaker labeleing and or filtering), RAG parameters, etc. as a model configuration. Beginning with the TruLens library, the database schema and library were customized in the following ways to enable faster analysis and experimental iterations:

Altered the schema to track the compliance concern that each model configuration aimed to address.
Altered the primary leaderboard page to only show the top performing model for each compliance concern
Added a secondary leaderboard that displayed all model configurations and their performance metrics for a specific compliance concern
Implemented GUI changes to extract and view relevant model configuration details (i.e RAG parameters, pre-processing techniques used, etc)

This greatly sped up the experimental process as it enabled automated experiment tracking and complete visibility into the parameters being tested. Each model configuration was tested using a set of 100 transcripts.

Commitment to Privacy and Informed Consent

In the preparation of this report, meticulous care was taken to ensure that all interactions analyzed were conducted under strict adherence to privacy norms and consent principles. Each call transcript included in our analysis has been thoroughly anonymized. This process involved the removal or modification of all personally identifiable information, ensuring that individual privacy is maintained. Names mentioned within these transcripts do not correspond to actual individuals; they have been replaced with fictional identifiers to further safeguard privacy and prevent any possible identification of the participants involved.

Furthermore, it is important to clarify that the collection of these call transcripts was predicated on informed consent. The entities responsible for the initial collection of these calls have affirmed that all participants were adequately informed about the recording process. They were made aware that their conversations could be recorded for analysis and were given assurances regarding the confidentiality and use of their data. Consent was obtained in compliance with legal standards, ensuring that participants were fully aware and in agreement with the terms under which their data would be used.

By adhering to these principles of anonymity and informed consent, we underscore our commitment to ethical research practices. We believe in the importance of protecting individual rights while enabling the insightful analysis of marketing interactions. This balance allows us to deliver valuable findings from our research, while maintaining a steadfast commitment to ethical standards and privacy protection.

References

[1] J. Lin, M. Diesendruck, L. Du, and R. Abraham, “BatchPrompt: Accomplish more with less.” arXiv, Sep. 05, 2023. doi: 10.48550/arXiv.2309.00384.

[2] R. Y. Pang et al., “QuALITY: Question Answering with Long Input Texts, Yes!,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, M. Carpuat, M.-C. de Marneffe, and I. V. Meza Ruiz, Eds., Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 5336–5358. doi: 10.18653/v1/2022.naacl-main.391.