The Challenge of Integrating LLMs into Deterministic Systems.

6 min readAug 24, 2023

When `Could not parse LLM output` wants to make you tear out your hair.

Navigating the turbulent seas of technological innovation is never for the faint-hearted, especially when attempting to harness the raw power of Language Learning Models (LLMs) and seamlessly integrate them into traditional computer systems. This challenge is akin to merging the unpredictable, swirling current of a river (our LLMs) into the steady and ordered flow of a canal (the deterministic systems). Like any intrepid explorer, we’ll inevitably stumble across obstacles and impasses along our journey — a valueError here, a parsing error there.

But fear not! It’s through overcoming these trials and tribulations that we’ll unearth new insights, and in this article, I’ll attempt to shed light on one of the most common issues that arise when integrating LLMs into your software. Strap in, and let’s dive into the complex world of transitioning from stochastic to deterministic systems.

Note: this article is assuming you know some of the basics of LLMs and Langchain (the most popular LLM agent framework)

This is a story about how to use language models in traditional computer systems. With the rise of ChatGPT, LLama2, Bard, etc, more and more developers are going to be integrating LLM systems into their software, and regardless of the application, everyone will come across the same error:

ValueError: Could not parse LLM output:

It is the bane of every AI Engineer, but why does this happen?

It comes from the fact that (for back end-systems) every LLM eventually has to become JSON (or something equivalent). When using a vanilla Langchain agent and the REACT framework you follow the following prompt:

Answer the following questions as best you can, but speaking as a pirate might speak. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin! Remember to speak as a pirate when giving your final answer. Use lots of "Arg"s

Question: {input}
{agent_scratchpad}

The key is that line Observation: the result of the action. That is the return from a Tool in Langchain, which means interacting with a function that needs an argument. Some inputs like Bing Search, can take in any string as an argument. Others, like Zapier’s NLA tool, require specific arguments to work correctly. Regardless, the agent has to know what to put into the tool and that's where the trouble begins because it needs to parse the string of text that is the agent’s history.

So what? Why can't we parse a model’s output?

Well, because the agent is stochastic.

definition: Stochastic refers to a random or probabilistic process. It means involving chance or probability. A stochastic process is one whose behavior is non-deterministic in nature and evolves over time due to random fluctuations. It is characterized by its probability distribution at a given time and how that changes over time.

Language models are Stochastic models which means their behavior is non-deterministic in nature and involves chance or probability. They generate outputs based on probability distributions, not deterministic rules. This makes parsing their output more difficult since the outputs can vary each time.

Note: we can set the temperature to 0, but we still have to run the function once to see what that output is, and that first output will still be a surprise

So how do we dynamically parse something? Well according to base-Langchain, we just hope! No seriously, this is their default parsing module:

class CustomOutputParser(AgentOutputParser):

    def parse(self, llm_output: str) -> Union[AgentAction, AgentFinish]:
        # Check if agent should finish
        if "Final Answer:" in llm_output:
            return AgentFinish(
                # Return values is generally always a dictionary with a single `output` key
                # It is not recommended to try anything else at the moment :)
                return_values={"output": llm_output.split("Final Answer:")[-1].strip()},
                log=llm_output,
            )
        # Parse out the action and action input
        regex = r"Action\s*\d*\s*:(.*?)\nAction\s*\d*\s*Input\s*\d*\s*:[\s]*(.*)"
        match = re.search(regex, llm_output, re.DOTALL)
        if not match:  # this is the MAJOR problem! 
            raise OutputParserException(f"Could not parse LLM output: `{llm_output}`")
        action = match.group(1).strip()
        action_input = match.group(2)
        # Return the action and action input
        return AgentAction(tool=action, tool_input=action_input.strip(" ").strip('"'), log=llm_output)

If that regex =r”Action\s*\d*\s*:(.*?)\nAction\s*\d*\s*Input\s*\d*\s*:[\s]*(.*)" doesn't find a match (which is a pretty strict requirement) then the whole thing topples, and we get that annoying parsing error:

ValueError: Could not parse LLM output:

Now, Langchain offers some solutions to this including a whole page in their documentation dedicated to “Handle parsing errors” and OpenAI tries to solve this with OpenAI Functions (also available via Langchain, and the solution that we use at Attention) this will only get you part way there. Even these special functions sometimes fail to produce proper JSON.

So what's there to do? How can I tell if this will be a problem?

Well, the first thing is to understand just how bad the “𝛼” is.

definition: For this article, 𝛼 is the probability that a process fails to transition from a “hard” to a “soft” system.

definition: For this article, a hard system is a system that has strict typing requirements like JSON, API calls, Code, etc while a soft system is a deterministic system like an LLM that can take non-exact input.

Let's assume we have a fairly robust system with a failure rate of 0.01 (determined from testing, but that's for another article). In a relatively simple system with 4 agents, where each agent has at least 3 transitions from “soft systems” to “hard systems” (This is probably an underestimation as the transition happens every iteration and the default max_iteration value for Langchain is 15)

The probability of at least of those transactions failing (and thus causing a complete failure) can be modeled by:

With n = 12, we end up with a probability of failure of around 11%.

Ouch.

Even with a 1% chance of failure, this still means that approximately 11% of executions could fail — This will not fly in an enterprise scenario.

So what do we do?

Well, that's a take for another article, but here’s a preview:

Get to 6-sigma level accuracy on transitions (turning 11% → .01%).
This can be achieved with expensive methods like retries combined with limiting methods like OpenAI functions, but these limit your flexibility and crank your time-to-result. They also often require you to know what values you want to extract prior to transition (not always the case in Auto-GPT style systems). Info on 6-sigma.
Make systems resilient to failure.
If we are okay with the rare agent failure and are willing to keep chugging with the problem regardless, then we can ignore this problem almost entirely. For example, if an agent doesn’t require previous input to act, and it fails, we can either append it to the end of the stack to try again, or just accept its failure and continue onwards.
Stay inside soft systems as long as possible.
The failure happens in transition. If we keep the outputs locked inside cycles of soft systems as much as possible then we reduce our chance of reaching a failure. This doesn’t mean only taking a pure LLM approach, it just means not structuring output. One fact we can take for granted is that LLMs (almost) always return strings of some kind. If we just act as routers, moving outputs from place to place, injecting them where they’re needed, and deleting them where they’re not, then we can greatly reduce the chance of a transition failure. This will not eliminate the problem, but it will help. This is also why AutoGPT is so impressive, it almost never leaves the soft system. And when it does, it allows for really high-count-retries, attempting a solution by brute force.

Hope this was helpful. These transitions will continue to plague LLM developers, and until we find a more solid solution it will act as a barrier keeping LLMs out of enterprise systems.

The Challenge of Integrating LLMs into Deterministic Systems.

When `Could not parse LLM output` wants to make you tear out your hair.

Written by Travis Barton