Evaluating the Behavior of Call Chaining LLM Agents

Oscar Health
Oscar Tech
Published in
6 min readNov 13, 2023

By Benjamin Baker, Evan Poe, Nazem Aldroubi

Background

We’re developing a GPT-powered agent designed to answer queries about claims processing. However, providing GPT-4 with sufficient context to respond to questions about internal systems presents a significant challenge due to the API request’s limited payload size. Moreover, LLMs sometimes struggle with multi-step thinking, making it difficult to obtain correct answers in a single invocation. To overcome these issues, we’ve implemented call chaining, a method that breaks down problems into multiple steps and provides just enough context for each step in each API request. The output from each call is forwarded to subsequent steps, allowing for a contextually aware problem-solving process that produces more accurate results.

While call chaining has proven to be an effective solution for enhancing GPT-4’s problem-solving capabilities, it does introduce a new set of challenges. One of the primary issues is the difficulty in pinpointing the exact step in the sequence where an unexpected output occurs. This, combined with the non-deterministic nature of these results, necessitates extensive testing to ascertain the true failure rate. An additional layer of complexity is GPT-4’s tendency to attempt a “best try” solution to problems it doesn’t fully understand, rather than failing outright. We can leverage the language model itself to address these issues.

Illustration of call chaining. There are many potential failure points due to GPT-4’s non-deterministic nature.

Ground Truth Evaluations

As a result of GPT-4’s non-deterministic outputs, small changes that solve one problem in the chain can create another problem elsewhere. Without a way to quickly measure the improvements brought about by each small tweak, it can be easy to lose progress. Getting evaluation right is fundamental to making progress. Luckily, with the right implementation and tuning, as well as data, LLMs can evaluate their own performance.

For our initial build, we took 500 previously handled claims questions and used those as test cases with an extra twist. After the agent responds to each question, it invokes GPT-4 to grade the agent-generated answer against a human provided answer (test case). We divide responses between pass/fail where anything lower than 9.5/10 is seen as a failure.

Prompt used for determining score given a ground truth. The examples (aka “few shot learning”) provide a rubric that improves the meaningfulness of the score.

Ground truth evaluations give us a quick and convenient way to monitor broad performance trends. By analyzing average evaluation scores for large batches of tests, we can assess whether our refinements to the system yield a general trend of improvement or signal areas needing further adjustment. The iterative process of testing and adjusting is critical here; as we parse through the subproblem traces, we gain insights into the particulars of both successful and erroneous responses. The aggregate scoring then offers a high-level view of our system’s ongoing development. Once the LLM is consistently hitting performance targets, we can trigger a formal human review.

Red: Some change in the prompt chain triggers a notable degradation in performance. Green: Iterative improvements are made until we consistently hit our performance targets and trigger a human review

Self Confidence Evaluations

But what if there are no ground truths to measure performance against? Is there a way for the agent to evaluate its own performance even when we have nothing to compare it to? In this instance, we use two separate techniques to evaluate performance. Internally, we call these: “same-context confidence evals” and “new-context confidence evals.”

Same-Context Confidence Evals

Recall that “Claims Assistant,” the GPT-4 powered agent, answers questions by looking up relevant information, compiling a list of summaries of relevant data, then making a final call to GPT-4 to connect the dots and produce an explanation. During the final phase, Explanation Generation, we include an extra instruction in the prompt:

Asking for a “confidence score” during response generation gives the agent the opportunity to evaluate its thought process and reasoning.

In some situations, same-context confidence evals are accurate enough for use — but not all. Same-context confidence evals run during the very last call of the chain. Sometimes one of the previous steps returns misleading information, which then misguides the generation step. For such snags, explanation generation may have full confidence in its reasoning, but still arrive at a wrong answer due to faulty assumptions upstream.

However, this technique is still valuable to quickly identify when the agent struggles to reason through a particular type of problem. By giving the agent the opportunity to voice uncertainty, we find that it often expresses low confidence when generating forced or hallucinatory responses.

This observation is backed up by a correlation between our ground truth evaluations and these self confidence evaluations. In testing with an implementation of same-context confidence evals, we found that for questions that scored:

  • Below or equal to 7 against the ground truth, they received a 7.1 same-context confidence eval.
  • Above 7 against the ground truth, they scored at about 8.5 same-context confidence eval.

There is a correlation, which we are confident we can improve with more work on the prompt. The more this is improved, the more trust can be placed in the responses for which the same-context confidence eval is high.

New-Context Confidence Evals

New-context confidence evals provide a confidence score that a response is correct with little additional context about how the response was generated. This is akin to asking for a fresh pair of eyes on a problem. Although it’s likely that the fresh eyes will not have the same detailed understanding of the problem, maybe a new perspective will notice something fundamentally different.

In practice, for our current implementation, new-context confidence evals are useful for recognizing glaring issues with a response, but not detecting issues with the reasoning used to reach the response. Used in conjunction with same-context confidence evals, we can cover multiple bases and get a picture of the agent’s performance even when ground truths are not available.

What’s next?

Evaluations have proven valuable components in the development of the Claims Assistant use case. There will be other useful ways to leverage them. For example, we can use evals to detect outages or drift. Additionally, we could surface a certainty score to the user alongside the response — or even error the response altogether for certainty below a defined threshold.

There are also more avenues to calculate self-evaluations. We may ask GPT-4 to devise its own evaluation criteria based on the specific problem it is solving, which would make evaluation both more tailored to specific questions and more flexible to future ones. We can also measure consistency between multiple invocations and runs. Unfortunately, running the agent twice on each problem to measure consistency is currently impractical due to added latency and rate limitations. However as the technology improves, speeds and rates will improve, opening the door to additional calls to measure consistency.

The more transparent and predictable the agent’s behavior, the more we can depend on it to handle important tasks. We will continue to invest in evaluations as a core part of our agent’s development.

--

--