Dynamic Failover and Load Balancing LLMs With LangChain

Andrew Nguonly
7 min readNov 10, 2023

--

Man Down! 🪖

Earlier this week, OpenAI suffered a major outage across ChatGPT and its APIs. The services experienced periodic downtime from unusual traffic patterns resembling DDoS attacks. The multi-day incident began on November 7, 2023, and ended on November 9, two days later.

https://status.openai.com/

It’s safe to assume that many apps and services depending on OpenAI also experienced intermittent disruptions. As the incident was unfolding, I came across a post (tweet) from LangChain’s official X (Twitter) account that piqued my interest.

Fallback or Failover

LangChain provides a feature called fallbacks, which enables developers to define “fallback” LLMs that would be invoked in a chain in the event of an error with the primary LLM. The behavior of the feature amounts to an if-statement. If an error occurs when invoking the primary LLM, try invoking the “fallback” LLM. The mechanism is simple.

However, I began to consider whether or not this mitigation step was sufficient for preventing downtime in mission-critical services. My work at Netflix introduced me to the importance of “failover.” Netflix’s AWS region failover architecture is well documented. My team’s work implementing failover for Kafka clusters allowed our deployments to be resilient against traffic spikes and other unexpected outages. The principle of failover is key to Netflix’s operational excellence.

“Fallback” and “failover” are just terms. This article describes experimental approaches to mitigating errors in LLM providers with LangChain. It’s motivated by my own experiences operating Kafka at Netflix and my growing interest in LLM operations.

Reinventing the BaseChatModel

Dynamic failover and load balancing are two techniques I explored, each with separate goals. The core implementation required subclassing LangChain’s BaseChatModel to create an abstraction that encompasses multiple chat models. It’s not clear whether or not this pattern is aligned with the goals of LangChain, but the implementation of fallbacks is designed similarly.

Dynamic Failover 🚧

Fallbacks are a static approach to error handling in LLMs with LangChain. To modify the fallbacks, the code needs to be updated, tested, packaged, and redeployed. At Netflix, failovers are initiated on-demand with no code changes.

The ChatDynamic class demonstrates the ability to failover to an LLM at runtime. It relies on an environment variable as the configuration for selecting which LLM to invoke.

class ChatDynamic(BaseChatModel):
"""Chat model abstraction that dynamically selects model at runtime."""
models: Dict[str, BaseChatModel]
default_model: str

@root_validator(pre=True)
def validate_attrs(cls, values: Dict[str, Any]) -> Dict[str, Any]:
"""Validate class attributes."""
models = values.get("models", {})
default_model = values.get("default_model", None)

if not models or len(models) == 0:
raise ValueError(
"The 'models' attribute must have a size greater than 0."
)

if default_model not in models:
raise ValueError(
f"The 'default_model' attribute '{default_model}' must exist "
"in the 'models' dict."
)

current_model_id = os.environ.get("DYNAMIC_CHAT_MODEL_ID")
if current_model_id is None:
logger.warning(
f"WARNING! Environment variable DYNAMIC_CHAT_MODEL_ID is not "
"set. Model ID '{default_model}' will be used."
)
elif current_model_id not in models:
raise ValueError(
f"DYNAMIC_CHAT_MODEL_ID '{current_model_id}' must exist in "
"the 'models' dict."
)

return values

@property
def _llm_type(self) -> str:
"""Return type of chat model."""
return "dynamic-chat"

def _generate(
self,
messages: List[BaseMessage],
stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
**kwargs: Any,
) -> ChatResult:
"""Select chat model from environment variable configuration."""
current_model_id = os.environ.get(
"DYNAMIC_CHAT_MODEL_ID",
self.default_model,
)
current_model = self.models[current_model_id]

return current_model._generate(
messages=messages,
stop=stop,
run_manager=run_manager,
**kwargs,
)

The ChatDynamic class accepts a map of BaseChatModels and failover logic lives in the method _generate(). When _generate() is called (during chain invocation) the environment variable DYNAMIC_CHAT_MODEL_ID is read and the corresponding LLM is invoked. Initializing a ChatDynamic instance requires specifying a default chat model. Chain construction works the same.

# initialize chat models
gpt_4_model = ChatOpenAI(model="gpt-4")
gpt_3_5_model = ChatOpenAI(model="gpt-3.5-turbo")

# specify all models that can be selected in the ChatDynamic instance
chat_dynamic_model = ChatDynamic(
models={
"gpt-4": gpt_4_model,
"gpt-3_5": gpt_3_5_model,
},
default_model="gpt-4",
)

# create chat prompt
chat_prompt = ChatPromptTemplate.from_messages([
("system", "You are an expert Python programmer"),
("human", "I'm learning how to program in Python"),
("ai", "Sure! I can help you write a Python app"),
("human", "Write a 'hello world' app in Python"),
])

# construct chain
chain = chat_prompt | chat_dynamic_model | StrOutputParser()

# prompt models
print(chain.invoke({}))

To initiate a failover, simply update the DYNAMIC_CHAT_MODEL_ID environment variable.

Why Failover? 🤔

Dynamic failover provides finer control and more flexibility as opposed to a static fallback approach. LLM degradation due to increased API latency or hallucinations would not be mitigated in the fallback configuration. In these cases, it’s preferable to initiate a failover to another LLM and maintain this state until the primary LLM has been restored. Even in the event of intermittent errors, it may be preferable to failover to reduce the risk of overloading the LLM provider (i.e. thundering herd).

Note: This failover method does not work for stateful APIs such as OpenAI’s Assistants API. Stateful systems such as Kafka require much more instrumentation to complete a failover as described.

Load Balancing ⚖️

LangChain’s documentation for fallbacks suggests that it should be used to bypass rate-limiting errors. The examples show mocking of OpenAI’s RateLimitError. Again, fallback is simple, but managing rate limits is complex and it may warrant a slightly more sophisticated approach.

An alternative technique is to load balance requests across multiple LLM providers. The ChatLoadBalance class demonstrates the ability to distribute requests across multiple LLM providers using different load balancing algorithms.

LOAD_BALANCER_TYPES = [0, 1, 2]


class ChatLoadBalance(BaseChatModel):
"""Chat model abstraction that load balances model selection at runtime.

Load balancer types:
0 - random
1 - round robin
2 - least rate limited
"""
models: List[BaseChatModel]
load_balance_type: int

# round robin state
last_used_model: int = 0

# least rate limited state
rate_limit_state: Dict[str, Any] = {}

@root_validator(pre=True)
def validate_attrs(cls, values: Dict[str, Any]) -> Dict[str, Any]:
"""Validate class attributes."""
models = values.get("models", [])
load_balance_type = values.get("load_balance_type", 0)

if not models or len(models) == 0:
raise ValueError(
"The 'models' attribute must have a size greater than 0."
)

if load_balance_type not in LOAD_BALANCER_TYPES:
raise ValueError(
"The 'load_balance_type' attribute must be in "
f"{LOAD_BALANCER_TYPES}"
)

return values

@property
def _llm_type(self) -> str:
"""Return type of chat model."""
return "load-balance-chat"

def _generate(
self,
messages: List[BaseMessage],
stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
**kwargs: Any,
) -> ChatResult:
"""Load balance chat model selection via random selection."""
# default to first model
current_model_idx = 0

# select model based on load balancer type
if self.load_balance_type == 0:
# random
current_model_idx = random.randint(0, len(self.models)-1)

elif self.load_balance_type == 1:
# round robin
if len(self.models) > 1:
current_model_idx = (self.last_used_model + 1) % len(self.models)
self.last_used_model = current_model_idx

elif self.load_balance_type == 2:
# TODO: least rate limited
# See https://github.com/langchain-ai/langchain/issues/9601
raise NotImplementedError(
"Least rate limited load balancer is not implemented."
)

current_model = self.models[current_model_idx]
logger.info(f"Selected chat model '{current_model}'")

return current_model._generate(
messages=messages,
stop=stop,
run_manager=run_manager,
**kwargs,
)

The ChatLoadBalance class accepts a list of BaseChatModels and an integer to set the desired load balancing algorithm. 0 denotes random selection (for testing), 1 denotes round-robin selection, and 2 denotes least-rate-limited selection (explained below). Similar to the ChatDynamic class, the load balancing implementation lives in the _generate() method.

# initialize chat models
gpt_4_model = ChatOpenAI(model="gpt-4")
gpt_3_5_model = ChatOpenAI(model="gpt-3.5-turbo")

# specify all models that can be selected in the ChatDynamic instance
chat_load_balance_model = ChatLoadBalance(
models=[gpt_4_model, gpt_3_5_model],
load_balance_type=1,
)

# create chat prompt
chat_prompt = ChatPromptTemplate.from_messages([
("system", "You are an expert Python programmer"),
("human", "I'm learning how to program in Python"),
("ai", "Sure! I can help you write a Python app"),
("human", "Write a 'hello world' app in Python. Attempt {index}"),
])

# construct chain
chain = chat_prompt | chat_load_balance_model | StrOutputParser()

# prompt models
for index in range(0, 6):
print(chain.invoke({"index": index}))

Least Rate Limited 📉

The least-rate-limited algorithm is meant to select a model (an LLM provider) whose rate limit is least consumed. In other words, select the model with the most remaining rate limit capacity. The implementation of this algorithm is incomplete (see TODO). At the moment, LangChain’s abstractions do not provide rate limit metadata from OpenAI API response headers (Issue #9601). Until this issue is resolved, the algorithm remains unimplemented.

Regardless, it’s easy to imagine how the rate limit state might be modeled. Modification of the state requires further investigation.

Sidebars 🍫

Disclaimers

The code examples above were only tested with OpenAI’s GPT-4 and GPT-3.5-turbo. These are the only LLMs I have access to at the moment 😅. Assuming the interfaces for the other BaseChatModels are consistent with ChatOpenAI, substituting any non-OpenAI chat model should work as expected.

After diving into the implementation of LangChain, I’m not 100% confident that the design of the ChatDynamic and ChatLoadBalance classes is aligned with how the library was designed or how it will evolve in the future. Take it for what it is.

Do Not Fallback For Long Inputs

I don’t necessarily agree that fallbacks should be implemented for long inputs (or for better models). In a previous article, I proposed dynamically computing the value of max_tokens to avoid token limits. The implementation is more complex. However, my opinion is that more optimizations should occur before invoking the LLM. This scenario should be addressed with a pre-processing feature and not with an error handling workflow. To use a crude (and maybe irrelevant) example, prompt injection should be protected against before invoking the LLM, not after an error occurs.

Man Up! 🧑‍🚀

https://status.openai.com/

And we’re back!

LangChain provides an extraordinary library for interfacing with LLMs. My intention is not to demote fallbacks, but instead to push our collective thinking on how we build resilient systems for LLM apps. In the future, LLMs will be embedded in products and features that power our daily lives. As developers, it’s our responsibility to ensure that we mitigate errors and reduce downtime. In the early stages of building LLM apps, becoming operationally excellent will not only be crucial, but it will be a differentiator as well.

Source Code

  1. ChatDynamic

--

--