A simple (not easy) technique for debugging LLMs using LLMs

Replication, problem identification, iteration, double checking and benchmarking

Published in

Inveterate Learner

5 min readFeb 27, 2024

The overall flow for debugging LLMs. Each of the parts are described in detail below.

I’ve been working with LLMs on production for the last 10 months. Naturally, reliability and accuracy become crucial. So is debugging LLMs. I haven’t seen enough discussions around the processes people have adopted for debugging LLMs. But that is natural given that there hasn’t been enough time for best practices to emerge yet. I want to contribute my bit to the conversation by sharing the technique I’ve arrived at which has helped me dramatically improve the accuracy and reliability of the systems I build, quickly and efficiently:

Replication: Copy the system prompt and user prompt for which the LLM is failing to OpenAI’s Playground (after formatting, i.e. “\n” should become a new line), set all the parameters like temperature, max_tokens, top_p, etc. to the exact same value as that in your API call and hit “Submit”. If you’re lucky, you would be able to replicate the issue on the first go. Depending on your settings, however, you might have to repeat this a few times until you can reproduce the bug because the LLM’s output can vary on each API call (the chances of this reduces if you set temperature to 0 but it doesn’t guarantee deterministic outputs as described in more detail below).
Problem identification: Once you’re able to replicate the issue, simply ask the LLM to “explain” the part of the output that is incorrect. It is very important not to ask “why did you output X” as the LLM will simply resort to apologising and giving the correct answer. The goal of this step is to figure out the part (s) of your prompt that is confusing the LLM, or any instructions that are missing in your prompt for the specific edge case where it is failing, probably because it is trying to provide an answer with missing information.
Iteration: Once you’ve identified what’s incorrect or missing, you need to update your prompt to handle that edge case. The way you word your new instruction matters a lot. Writing correct grammatical sentences with consistent, appropriate spacing and punctuation marks has an unexpected impact on the consistency and reliability of the outputs. You might have to experiment with adding the instruction at the start of the prompt or at the end, or inside the description for a field if you’re extracting a JSON and have provided a JSON template for the LLM to output. I tend to keep a section named “Important Instructions” in my system prompt to which I keep adding new instructions as the need surfaces upon debugging. Now, you might be tempted to pile on more and more instructions but Mark Twain once said, “I didn’t have time to write a short letter, so I wrote a long one instead”. It takes more effort to identify the clunky bits in your prompt and make your prompt concise. If it is becoming confusing for you to go through the list of instructions, it is likely confusing for the LLM as well. Keep going back and try to find redundant pieces of instructions or instructions that are contradicting each other. You might have written an instruction as an overly complicated sentence. Sometimes, making a sentence simpler solves the problem. You are likely to see that the problem persists even after adding a new instruction that should have ideally handled the edge case. In such cases, you might want to rephrase your instruction or make it “shout” by adding phrases like: “if so and so happens, make sure to ALWAYS do so and so. It is very important that you pay a lot of attention to this and never make a mistake”. There is nothing magical about this exact phrase. It is just an example meant to convey a point. The main takeaway being that prompt engineering requires a lot of perseverance the harder your task is and the more nuanced your edge case becomes. You cannot give up. You have to keep iterating. But don’t blindly iterate. Pay attention to the pattern of its mistakes and find smarter experiments to try rather than simply adding/removing sentences/words randomly/thoughtlessly.
Double checking: Alas, the moment has arrived. Your hours of hard work has paid off and the LLM is finally able to give the correct output. Unfortunately, it’s not yet the time to celebrate. Delete the assistant response and hit “Submit” again. You cannot believe yourself. It is back to generating the incorrect output. “How can that be?”, you might wonder. “The temperature is 0, shouldn’t the output be deterministic?”. Unfortunately, not. The parameters like temperature, top_p, etc. are just some of the sources of randomness in the inference step of an LLM. There are others that you cannot control. Because of this, even with the temperature being 0, the outputs of the LLM can vary. Sometimes, these variations are not harmful because they eventually land at the same answer, even if they are using a different combination of words in the output. But sometimes, they can break your heart and hurt your sleep. Typically, these variations that oscillate between the correct and the incorrect output are a result of ambiguity, confusion, contradiction or lack of instructions in your prompt. Thus, you’ll have to go back to the “Iteration” step to update your prompts until multiple LLM calls for the same input give the correct output consistently every time.
Benchmarking: Are you kidding me? We aren’t done yet? Unfortunately not. But we are engineers who care about building a solid, reliable system using a technology that has immense potential but doesn’t have a good reputation for reliability. So, you found what was missing or incorrect in your prompt and after 100s of iterations, you’ve finally landed on something that consistently produces the correct output for the edge case where it was previously failing. That’s awesome. But how do you know that the new prompt doesn’t create new edge cases? You don’t. So, you’ll need to rerun the inference on your entire benchmark dataset (if you don’t have a benchmark dataset and you’re building an LLM application for production, you really need one, like, yesterday, trust me). If none of the other examples are affected, bravo. The debugging is complete. At least, for the moment. However, if the performance worsens on other samples now, you’ll have to go back to step 1 for those new erroneous samples.

If the sounds like a lot of work, it is. But that is also what makes it fun (although it doesn’t look like anything remotely resembling fun when you’re knee deep in the debugging process). Not everyone will be able to persevere, find the correct underlying patterns that can apply correctly across all the samples, gracefully handling all the edge cases. But you did. And that’s what makes you awesome.

I hope you learnt something useful from this post. I’d love to hear your experiences (and frustrations) with prompt engineering (or LLMs, RAG, etc.) as well.

Disclaimer: this debugging tactic assumes that you’ve already tried (to no avail) some of the basic prompt engineering hacks like assigning a role to the LLM, enclosing your responses within delimiters, giving the model time to think using techniques like chain-of-thought reasoning. Many of these basic tactics have been documented in OpenAI’s prompt engineering guide along with other excellent prompt engineering guides available online.

A simple (not easy) technique for debugging LLMs using LLMs

Replication, problem identification, iteration, double checking and benchmarking

Written by Aman Dalmia