Taming the Wild — Enhancing LLM Reliability

Ozgur Guler
12 min readAug 20, 2023
midjourney — OpenAI, Truthfulness

What is LLM Reliability?

A “reliable” Large Language Model (LLM) is characterized by its ability to produce outputs that are both informative and factually accurate. The reliability of an LLM is contingent upon addressing the following five challenges, which are inherently rooted in the stochastic nature of LLMs. These challenges constitute predominantly open problems in research.

1.Misinformation — LLMs may assimilate incorrect information from flawed datasets, leading to a deficiency in learning about less prominent entities and relations. This tendency can further result in the amplification of errors and biases present within their training data. To mitigate these issues, the quality of training data needs to be improved. Strategies such as constructing more faithful datasets and implementing rigorous data cleaning processes can significantly reduce instances of misinformation in the model’s responses. Currently, there are substantial initiatives underway to create clean and reliable datasets for training LLMs. One notable example is the RedPajama project, which assembled a comprehensive and fully open dataset comprising 1.2 trillion tokens, created in adherence to the methodology delineated in the original LLaMA paper.

2. Hallucinations — In the context of LLMs, ‘hallucinations’ refer to incorrect outputs generated even when the training data is factually correct. This phenomenon is similar to a cognitive disorder known as ‘confabulation,’ where false memories are produced without any intention to deceive. (This may be more familiar to science fiction fans as a concept explored in ‘Total Recall’.) The underlying cause of hallucinations within LLMs remains elusive, though theories pertaining to ‘data drift,’ variance between training data and data encountered by the model, is one potential cause. However, it’s essential to recognize that the current understanding of LLM hallucinations is largely based on empirical observations, and the cause is not fully comprehended.

The distinction between hallucinations and misinformation within LLMs merits further consideration. While misinformation often arises from the recitation of incorrect training data by the LLM, hallucinations are a more perplexing issue, manifesting even when the training data is accurate.

3.Inconsistency —Inconsistent LLM answers can create confusion among users and reduce user trust. The exact cause of inconsistency is unclear. The confusing and conflicting information in training data can certainly be one cause however inconsistency is largely attributed to the randomness inherent in LLMs.

4.Miscalibration — LLMs have been identified to exhibit over-confidence in topics where objective answers are lacking, as well as in areas where their inherent limitations should caution against LLMs’ uncertainty (e.g. not as accurate as experts.) An emerging technique is abstaining where LLM rejects to answer the questions in areas where it has lacking information.

5.Sycophancy — LLM might tend to flatter users by reconfirming their misconceptions and stated beliefs . This is a particularly evident phenomenon when users challenge the model’s outputs or repeatedly force the model to comply. It is possible that the RLHF stage could promote and enforce confirmation with human users. During the alignment, LLMs are fed with “friendly” examples that can be interpreted as being sycophantic to human users.

How to increase the reliability of LLM’s?

midjourney, prompt: openai, athena

The underlying mechanisms governing the behavior of LLMs remain largely opaque, with observed performance primarily grounded in empirical results rather than a comprehensive theoretical understanding. This empirical nature of LLMs bears similarity to certain physical phenomena, such as the phase transition of water to ice below a specific temperature; similarly, LLMs demonstrate advanced language generation capabilities when operating above a certain scale or complexity.

Addressing the reliability challenges associated with LLMs necessitates an empirical approach, reflective of the intricate and not fully understood dynamics of these models. With the rise in popularity of LLMs, a specialized subfield referred to as ‘prompt engineering’ has emerged, garnering substantial interest within the scientific community. This field has witnessed a burgeoning volume of research, with a growing number of papers being published weekly.

Efforts to enhance LLM reliability have led to the discovery of several strategies and methods, which I will cover below…

1. Better aligned models are more reliable

The paper titled ‘Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models’ Alignment,’ published by ByteDance Research on August 10th, offers an insightful examination of the challenges related to the trustworthiness of Large Language Models (LLMs). In this work, the authors advocate for the development of more ‘aligned’ models, contending that alignment with human preferences enhances trustworthiness. Within the context of this study, ‘alignment’ is specifically defined as the degree to which the generations of an LLM conform to the preferences and expectations of human users. This paper is instrumental in understanding the complex landscape of trustworthiness in the burgeoning field of LLMs, shedding light on a critical aspect of model evaluation and development.

GPT-3 (earlier OpenAI davinci-02 models) models were a step change in LLM performance. Still, they suffered from hallucinations, vulnerability to adversarial attacks, amplified biases which limited their use at the same time opened them to misuse. They were not reliable. GPT-3 models (now discontinued), following “Supervised Fine Tuning” and RLHF are significantly more aligned. The resulting “aligned” GPT-3 variant is the current ChatGPT model, the infamous “gpt-35-turbo”.

InstructGPT models are significantly more aligned when compared to GPT-3 following SFT & RLHF source: https://openai.com/research/instruction-following

Therefore, enhancing alignment serves as a general best practice for augmenting the trustworthiness of a LLM.

2. Give LLM’s more time to generate

Decomposing the “greedy decomposition” of individual prompts to multiple sub tasks, and guiding the LLM through each while keeping the LLM on task increases the reliability of LLM generations. Splitting complex tasks into simpler tasks with the likes of “Least to most prompting”, asking the model to self-reflect & self-refine, explain its generations and even asking if it might have missed anything in its response will keep the model on task and increase reliability of generations. Below are the most effective prompt engineering methods that came out of recent research.

  • Chain of Thought Prompting [2] — Chain-of-thought prompting is an few-shot learning approach to improve the reasoning ability of large language models in arithmetic, commonsense, and symbolic reasoning tasks. The main idea is to include a chain of thought, a series of intermediate natural language reasoning steps, in the few-shot prompting process.

The zero-shot variant of CoT [16,18] simply suggests adding a “Let’s think step by step” instruction in the end of the prompt. When the model gives away its thinking steps this brings in “explainability” benefits too.

Theory-of-Mind (ToM) tasks, focused on understanding human beliefs and goals, are vital for common-sense reasoning of LLM’s. With sufficiently large models (read as >100b parameters) ToM can be observed as a new emergent capability (Kosinsky paper). The larger the model is the more effective CoT generally is.

  • least to most prompting [3]— The key idea in this strategy is to prompt the LLM to break down a complex problem into a series of simpler subproblems and then solve them in sequence where the solution of each individual subproblem depends on the resolution of the prior subproblem[3]. With CoT prompting we give the model few-shot examples of the requested thinking pattern without asking the model to plan and strategize himself. Whereas with least to most prompting it is the LLM that does “task decomposition” and planning and therefore is similar to agency methods such as ReACT. With ReACT there is a whole set of service API’s the LLM can plan with and execute towards completion. With least to most prompting we are mostly after task decomposition without any external tooling.
Least-to-most prompting
  • Self-refine [4] The main idea with self-refine is to generate an initial output using an LLM; then, the same LLM provides feedback for its output and uses it to refine itself, iteratively.
  • RCI- Recursively self-critisize & Improve [12] RCI works by first having the LLM generate an output based on zero-shot prompting. Then, RCI prompts the LLM to identify problems with the given output. After the LLM has identified problems with the output, RCI prompts the LLM to generate an updated output.
  • Self-consistency [5] Self-consistency leverages the intuition that complex reasoning tasks typically admit multiple reasoning paths that reach a correct answer.

The self-consistency method contains three steps: (1) prompt a language model using chain-of-thought (CoT) prompting; (2) replace the “greedy decode” in CoT prompting by sampling from the language model’s decoder to generate a diverse set of reasoning paths; and (3) marginalize out the reasoning paths and aggregate by choosing the most consistent answer in the final answer set.

  • Tree of Thought (ToT) prompting [6] — The ToT technique is inspired by the human mind’s approach for solving complex reasoning tasks through trial and error. In this process, the human mind explores the solution space through a tree-like thought process, allowing for backtracking when necessary. ToT allows LLMs to interactively backtrack and explore alternate reasoning chains, avoiding fixation on a single line of flawed reasoning.
  • Self-reflection [7] — Employ a Large Language Model (LLM) to assess if another generative model’s output is progressing in the correct direction during its generation process. As stated in the paper, the objective of the reflection loop is to assist the agent in rectifying frequent instances of hallucination and inefficiency through a method of trial and error.
  • Self-Ask [8] is a method to repeatedly prompt the model to ask follow-up questions to construct the thought process iteratively. What makes self-ask prompting intriguing is the way it visibly illustrates the LLM’s reasoning process, breaking down the question into more manageable follow-up inquiries. The LLM possesses the awareness to recognize when the ultimate response has been attained, allowing it to transition from intermediate answers to a conclusive resolution.
Source [8]

3. micromanage LLM’s

What is implied here with “micromanaging” is being prescriptive as to what task is the LLM asked to carry out, usually as a “step-by-step” plan, without requiring any “greedy” resolutions of a long and sophisticated prompt where the LLM can get lost in backalleys of its own thinking. Here are some best practices as to how to construct your prompts to have more reliable generations…

midjourney- openai boss
  • Give clear instructions
  • Guide the LLM’s through a plan defined with small, clearly defined steps. This is also called the “theory of mind” (ToM) reasoning, which involves tracking the mental state of agents, such as their goals, and what they know (Kosinski, 2023; Langley et al., 2022)
  • Structure the instruction to keep the model on task
  • Constrain what the model can say (e.g. by using sentence labels instead of sentences) or by Azure OpenAI function calling.
  • Use a default prompt in system prompt when the model fails to return an answer.
  • Let the model know when to stop. Use LLM parameters like “number of tokens” and “”stop words” to guide the model to decide when to stop generating any further text.
  • Use Directional Stimulus Prompting — e.g. “Summarize the document based on the hint. Hint …” where possible.

4. Ensemble “prompting” with LLM’s

  • Ask for justifications of many possible answers, and then synthesize
  • Generate many outputs, and then use the model to pick the best one — the one with the majority vote [10].
  • Tree of thought [11] prompting enabled with prompts like “Imagine three different experts are answering this question.” can help the LLM to get unstuck out of “back alleys” of LLM’s own thinking…

5. Give LLM’s more context or tools

  • In-context learning with RAG pattern — use vector indexes to search amongst embeddings (semantic core) of your data. RAG pattern is easily deployable with Azure ML PromptFlow where you can point to your private data and RAG steps are automated by the PromptFlow itself. Refer to prior blog post on the subject [link].
  • If you add the right information to the input, you can easily ask the model to put citations in its answers from the documents you give. You can then check these citations automatically by looking for matching text in those documents. [link]
OpenAI — GPT best practices
  • Use 3rd party API’s (e.g. SERP API) to validate LLM answers.
  • You can leverage ReACT[9] — Reason and Act- pattern to enable the LLM to use the tools it requires for fact checking or leveraging the “service infrastructure” framework for inherent shortcomings (e.g. lack of arithmetic skills). A simple implementation of ReACT pattern with AzureOpenAI & Langchain refer to my earlier post “ReACT — Reason & ACT implementations for LLM Agency” here.

6. Auto-prompting — RCI, ReACT

Recursive Criticism and Improvement (RCI) leverages a pre-trained Large Language Model (LLM) to perform tasks via natural language instructions. The method employs a prompting system that first generates an output, identifies its shortcomings, and then refines it for an improved result.

ReAct is an innovative paradigm that melds the faculties of reasoning and action within Large Language Models (LLMs). Drawing inspiration from human capabilities to both ‘act’ and ‘reason,’ ReAct enhances LLMs by guiding them to generate verbal traces of reasoning as well as actionable steps for tasks. This hybrid approach enables LLMs to dynamically adjust plans and interact with external data sources like Wikipedia for more accurate and informed reasoning. The framework aims to resolve limitations such as fact inaccuracies and cascading errors, which have been observed in existing methods like Chain-of-thought (CoT) prompting.

7. Fine-tune custom models for specific models

Textbook are all you need that came out in June, MS Research team argues “data quality can dramatically change the shape of the scaling laws, potentially allowing to match the performance of large-scale models with much leaner training/models”. High quality data tend to make the whole LLM system a lot healthier and embedding the training to model weights is an effective way to mitigate the inherent reliability shortcomings of LLM’s.

[13] from Google Deepmind suggests while scaling and instruction tuning increases sycophancy, fine-tuning with simple synthethic data reduces sycophancy in LLM’s…

Conclusion

We have covered major empirical methods that may help with LLM reliability that have emerged following developer feedback and research. This is an increasingly active research area and am hopeful we will see more effective methods to make LLM’s more reliable soon.

Ozgur Guler

I am a Solutions Architect at MS where I work with Startups & Digital Natives focusing on app development with AzureOpenAI.

Subscribe to my AzureOpenAI Builders Newsletter where we cover the lates on building with #AzureOpenAI on LinkedIn here.

https://www.linkedin.com/build-relation/newsletter-follow?entityUrn=7057325620778680320

References

  • [1] Liu, Yang, et al. “Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models’ Alignment.” ByteDance Research, 9 Aug. 2023. [link]
  • [2] Wei, Jason, et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Google Research, Brain Team, Date. [link]
  • [3] Zhou, Denny, et al. “Least-to-most prompting enables complex reasoning in large language models.” Google Research, Brain Team, Date.[link]
  • [4] Madaan, Aman, et al. “Self-refine: Iterative refinement with self-feedback.” Language Technologies Institute, Carnegie [link]
  • [5] Wang, Xuezhi, et al. “Self-consistency improves chain of thought reasoning in language models.” Google Research, Brain Team. [link]
  • [6] Long, Jieyi. “Large language model guided tree-of-thought.” Theta Labs, Inc., San Jose, CA, Date. [link]
  • [7] Shinn, Noah, et al. “Reflexion: Language agents with verbal reinforcement learning.” Northeastern University, Princeton University, Massachusetts Institute of Technology, Date. [link]
  • [8] Press, Ofir, et al. “MEASURING AND NARROWING THE COMPOSITIONALITY GAP IN LANGUAGE MODELS.” Paul G. Allen School of Computer Science & Engineering, University of Washington; MosaicML; Meta AI Research; Allen Institute for AI [link]
  • [9]yao, shunyu, et al. “ReACT: synergizing reasoning and acting in language models.” department of computer science, princeton university; google research, brain team, year. [link]
  • [10] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E. H., Narang, S., Chowdhery, A., & Zhou, D. (Year). SELF-CONSISTENCY IMPROVES CHAIN OF THOUGHT REASONING IN LANGUAGE MODELS. *Google Research, Brain Team*. [link]
  • [11] Yao, Shunyu, et al. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” Princeton University; Google DeepMind, Year. [link]
  • [12] G. Kim, P. Baldi, and S. McAleer, “Language Models can Solve Computer Tasks,” University of California, Irvine; Carnegie Mellon University, Year. [link]
  • [13] Wei, Jerry, et al. “SIMPLE SYNTHETIC DATA REDUCES SYCOPHANCY IN LARGE LANGUAGE MODELS.” Google DeepMind. [link]
  • [14] OpenAI — Aligning language models to follow instructions [link]
  • [15] OpenAI — GPT Prompting best practices [link]
  • [16] Kojima, Takeshi, et al. “Large Language Models are Zero-Shot Reasoners.” The University of Tokyo; Google Research; Google Research, Brain Team, Year. [link]
  • [17] MS Learn Prompt Engineering [link]
  • [18] Rahimi Moghaddam, Shima, and Christopher J. Honey. “Boosting Theory-of-Mind Performance in Large Language Models via Prompting.” Johns Hopkins University, Year. [link]

--

--