ChatChaos: The Good, the Bad, and the Ugly

Andrew Nguonly
5 min readDec 17, 2023

--

ChatGPT 4 prompt: “Generate an image of a chaotic LLM: ChatChaos. Sometimes the LLM returns invalid JSON. Sometimes the LLM responds slowly. Sometimes the LLM hallucinates. The image should be futuristic with a techno-punk style.”

In my previous article, I explored approaches for failing over and load balancing LLMs in the event of an outage with an LLM provider. This article is a continuation of that exploration.

How do developers protect and build resilience against chaotic LLM behavior?

Trouble in Paradise ⛈️

The last couple of months at OpenAI don’t need to be rehashed, but the aftermath of the situation calls for more stability in LLM apps. Not just in products depending on GPT, but in any system dependent on large language models. Since the return of Sam, a variety of complaints have sprouted up on X/Twitter around the performance of GPT.

At first glance, JSON mode seemed like a feature that would solve all of our biggest headaches. But not quite…

And now, GPT is somehow getting lazier?

Something’s gotta give…

Chaos Engineering 🔥

The behavior of LLMs can be unpredictable and chaotic. Various features (e.g. JSON mode) and techniques (e.g. Chain-of-Thought) have been developed to manage uncertainty and improve results, but chaos engineering has yet to be explored. The Principles of Chaos Engineering teach us that experimenting with “chaos” will lead to building more resilient systems.

Netflix built and open-sourced Chaos Monkey, a tool for randomly terminating VM instances and containers. With Chaos Monkey running in our production environments, we were forced to develop applications with the expectation of uptime whenever an EC2 instance was terminated by AWS. We simply had no choice.

ChatChaos is inspired by Chaos Monkey. The implementation is a chat model abstraction built with LangChain. ChatChaos returns invalid JSON, tries to hallucinate, and simulates long response times. The “chaotic” behavior built into the abstraction is also based on my experience building and operating admin AI. It’s a simple tool with good, bad, and ugly implications.

Before the Good

Like Chaos Monkey, the behavior in ChatChaos occurs on a schedule. The cadence is configured with a cron schedule and a ratio parameter further controls the frequency of chaotic events.

# initialize chat model
gpt_3_5_model = ChatOpenAI(model="gpt-3.5-turbo")

# configure ChatChaos
chat_chaos_model = ChatChaos(
model=gpt_3_5_model,
enabled=True,
cron=croniter("0 13 * * 1"),
duration_mins=15,
ratio=0.1,
enable_malformed_json=True,
)

Every Monday at 13:00 (cron), for 15 minutes (duration_mins), 10% (ratio) of chat model requests will return malformed JSON.

ChatChaos accepts a BaseChatModel as a parameter and the instance can be used in a chain just like any other chat model.

chat_prompt_1 = ChatPromptTemplate.from_messages([
("system", "You are a JSON object generator"),
("human", "Return a JSON object with 2 keys: name and age"),
])

# construct chain
chain = chat_prompt_1 | chat_chaos_model | StrOutputParser()

# prompt models
print(chain.invoke({}))

In practice, the enabled parameter should be set based on the application’s environment. Using an environment variable allows for quick toggling without significant code changes.

# configure ChatChaos
chat_chaos_model = ChatChaos(
...
enabled=os.environ["ENV"] in ["prod", "test"],
...
)

The Good 😇

The Good. Producing valid JSON is often a requirement when interfacing with LLMs. The preceding code example demonstrates how to configure ChatChaos to randomly return different forms of invalid JSON. The current implementation selects between injecting triple backticks (```) into the completion value, removing the closing curly bracket (}), or replacing all double quotes (") with single quotes (').

# Returning JSON (or any code) with triple backticks is a common behavior
# of LLMs
```
{"name": "Andrew", "age": 35}
```

# A JSON string may be truncated due to token limits
{"name": "Andrew", "age": 35

# Every once in awhile, a Python dictionary is returned instead
{'name': 'Andrew', 'age': 35}

There’s no perfect prompt that will guarantee a JSON object will be returned with the keys that an app requires. Instead of retrying on error, an effective approach is to “fix-forward” the response. LangChain describes a method for auto-fixing outputs using the OutputFixingParser.

Depending on the complexity of the required format, more custom parsing logic may be necessary. For example, sometimes keys are missing. On other occasions, the capitalization of the keys is incorrect, which causes errors when attempting to retrieve the key’s value.

# Missing 'age' key
{"name": "Andrew"}

# Capital letter in key
{"Name": "Andrew", "Age": 35}

Correcting malformed JSON is a delicate but worthwhile operation. After implementing exhaustive “fix-forward” logic, an app should rarely have problems processing JSON from LLMs.

The Bad 👿

The Bad. In addition to keeping users waiting, latency can cause disruptions in complex, distributed environments. To simulate long response times, ChatChaos can be configured to add a random delay before invoking the LLM.

# initialize chat model
gpt_3_5_model = ChatOpenAI(model="gpt-3.5-turbo")

# configure ChatChaos
chat_chaos_model = ChatChaos(
model=gpt_3_5_model,
enabled=True,
cron=croniter("0 13 * * 1"),
duration_mins=15,
ratio=0.1,
enable_latency=True,
latency_min_sec=45,
latency_max_sec=75,
)

Client timeout settings should be verified to ensure that cascading timeout errors are not propagated throughout a system. At the very least, end users should be given a responsive UX with feedback (i.e. loading animation). In use cases where response times are especially long, an asynchronous workflow should be considered instead.

In the best case, increased latency is simply a blip on the radar. In the worst case… well, let’s not go there.

The Ugly 👹

The Ugly. Hallucinations occur when an LLM responds to a prompt with factually incorrect or irrelevant information. It’s not entirely clear what causes this phenomenon and it’s not easy to force this behavior either. ChatChaos can be configured to hallucinate (kind of).

# initialize chat model
gpt_3_5_model = ChatOpenAI(model="gpt-3.5-turbo")

# configure ChatChaos
chat_chaos_model = ChatChaos(
model=gpt_3_5_model,
enabled=True,
cron=croniter("0 13 * * 1"),
duration_mins=15,
ratio=0.1,
enable_hallucination=True,
hallucination_prompt="Write a poem about the Python programming language.",
)

To provide control to the developer, a “hallucination prompt” can be specified, which is appended to the end of the final user message in the prompt. The supplemental text will ensure that an out-of-the-blue response is included in the completion.

It’s not obvious that this feature is useful or that mitigating hallucinations is even possible. So why simulate hallucinations at all? Is it worth the risk? Hallucinations may provoke users to engage with feedback mechanisms built into the application. For example, if a user is given the option to thumbs up (👍) or thumbs down (👎) a response, simulating chaotic outputs may be a method for stress testing the user feedback pipeline and the operational workflows supporting it. In the event of a real hallucination, how fast could you respond?

In the long term, hallucination detection techniques will be developed. Such techniques should be tested as well.

After the Ugly

Of course, ChatChaos can also be configured for total mayhem. The implementation of ChatChaos is easily extendable to other types of chaos.

# initialize chat model
gpt_3_5_model = ChatOpenAI(model="gpt-3.5-turbo")

# configure ChatChaos
chat_chaos_model = ChatChaos(
model=gpt_3_5_model,
enabled=True,
cron=croniter("0 13 * * 1"),
duration_mins=15,
ratio=0.1,
enable_malformed_json=True,
enable_halucination=True,
enable_latency=True,
)

At face value, the abstraction is a silly proof-of-concept. Who in their right mind would intentionally sabotage their users?

Chaos engineering is a novel approach to building resilient systems. Developers should adopt the ideas that work and discard the approaches that don’t. Not all of the ideas or features in ChatChaos are valid. Nonetheless, we should continue to push the boundaries of testing and development for LLM apps because chaos is the only constant.

References

  1. ChatChaos (GitHub)

--

--