Optimizing Javelin Guardrails

sharathr
Javelin Blog
Published in
3 min readApr 20, 2024
Photo by Julian Hochgesang on Unsplash

As a real-time platform on the critical path of applications, we continually look for opportunities to improve the throughput and latency of guardrails in the critical path of LLM traffic.

Taking a deeper look

One obvious disadvantage of stacking guardrails to execute one after another sequentially is that the latencies in executing each guardrail are cumulative. For instance, if there are multiple simultaneous checks that you want to run, this sequential execution flow makes it completely unusable in real-time application use cases.

To overcome this, we have introduced parallel executions, which show a dramatic improvement because multiple guardrails run in parallel (as defined through configuration). While it may sound simple, the underlying engineering is far more complex, involving recursive chain construction, spawning lightweight threads for each guardrail processor batch, managing multi-threaded state while coordinating responses across multiple processors, and sequencing follow-on actions that take the output from one batched guardrail execution to the next.

We took a step further and created a notion of blocking vs. non-blocking guardrails that optimize latency. While this significantly complicated the underlying execution handling, the benefit was obvious and well worth it.

We are delighted with the final output, which looks like this:

Guardrail processors, chained execution

Measuring the results

Previously, when stacked back to back, the system guardrails took about 18.6ms. The reconfigured chain now completes in 6.3ms for the 99th percentile, a 66.11% improvement in latency.

This improvement is dramatic when stacking external requests that take tens of milliseconds. For example, we had guardrails that looked up HTTP service endpoints or called microservices, which routinely took 100ms+ each. By executing them in parallel, we get remarkable latency improvements, most often not exceeding the latency of the slowest processor in the chain.

Next, we took this one step further by introducing the notion of blocking and non-blocking processors. We found that many guardrails just ended up requiring fire-and-forget semantics (writing custom logging, triggering external systems, offline processing, etc.). By introducing the non-blocking semantic, they no longer contribute to the latency of the guardrail chain’s critical path.

The chain in the figure above routinely took about 450ms, and with the improvements in place, we see dramatic improvements, bringing latency down to under 180ms. This 60% improvement creates a path to introduce critical security, business & organizational checks into the path of LLM requests and responses in a way that minimizes any impact on application traffic. It also paves the way for introducing non-blocking triggers like webhooks, security event alerting, and many more essential services to support enterprise adoption of LLMs. We are very excited about the opportunities that this opens up.

Guardrails Hub

One of Javelin's most requested features is our guardrails hub, which makes it easy for enterprises to quickly look at available guardrails and drag and drop new guardrails into their egress flows. We are releasing our guardrail hub with several out-of-the-box guardrails. By rolling out our performance improvements, we can now confidently enable 3rd party integrations that enable real-time, low-latency, high-throughput processing of messages.

We continue to work towards building a robust ecosystem of integrations for enterprises. To this end, we have integrated with several security systems and tools, including Lakera (for prompt injection detection), GCP, Nightfall, and Protecto DLP (for sensitive data detection), and have a robust integration roadmap.

Looking for more?

Have feedback? We’d love to hear from you.

--

--