Implement Content Moderation Guardrails using Javelin.

sharathr
Javelin Blog
Published in
5 min readFeb 23, 2024

In the rapidly evolving digital landscape, integrating large language models (LLMs) into online platforms marks a significant leap forward in creating, sharing, and consuming content. However, this advancement brings new challenges in maintaining a secure and positive user environment. Addressing these challenges requires a nuanced application of Trust & Safety principles tailored to the unique dynamics of LLMs.

Trust & Safety for LLM Integration

User Protection with AI: Integrating large language models (LLMs) requires user protection strategies. Implementing robust security measures tailored for AI ensures safe interactions with LLMs, enabling users to benefit from AI advancements without risk.

Upholding Community Standards: The autonomous nature of LLMs challenges platforms to maintain respectful and inclusive communities. By adapting community standards to address AI-generated content, platforms can mitigate issues like hate speech and misinformation, fostering an environment where diversity and respect thrive.

Legal Compliance in AI Use: As LLM integration complicates legal compliance, platforms must ensure their AI operations adhere to data protection laws and cybercrime regulations. This commitment safeguards users and bolsters trust in AI-enhanced environments.

Bring-your-own Moderation Engine: As the ecosystem of AI-enabled tools grows, companies must have the agility to choose components in their stack that they can easily replace or upgrade. This ensures that newer tools can seamlessly be introduced or upgraded without changing applications.

Using Javelin Processors for Content Moderation

With an internal Llama-Guard-based input-output safeguard model, fine-tuned on a comprehensive safety risk taxonomy, Javelin Content Moderation Processors can adapt to various taxonomies to meet the unique needs of different platforms. This adaptability allows for real-time, efficient moderation that keeps pace with the ever-evolving online landscape.

Enabling Content Moderation is as simple as including the Content Moderation Guardrail Processor in the Request/Response chain. In the example below, we have added this Guardrail to the Response chain, but it's just as easy to drag it into the Request chain, and it will execute for LLM Requests as well.

Figure 1: Adding Content Moderation to Response Chain

By enabling Processory telemetry, we can look at detailed stats about the processor's execution, including how long it took to run and internal logging.

Supported Taxonomies
Javelin supports a wide range of taxonomies, from violence & hate to criminal planning, each tailored to address specific risks associated with digital interactions.
Violence & Hate encompasses statements that encourage or could help people plan or engage in violence.

Sexual Content encompasses statements encouraging someone to engage in specific sex acts. Similarly, sexually explicit (i.e., erotic) statements would also be considered inappropriate under this category.

Guns & Illegal Weapons encompasses statements that specifically encourage, condone, or could help people plan or execute specific crimes involving the illegal acquisition, creation, or use of guns or illegal weapons (e.g., explosives, biological agents, or chemical weapons), including instructions on how to create or use them.

Regulated or Controlled Substances encompasses statements that specifically encourage or could help people to illegally produce, transfer, or consume regulated or controlled substances like illegal drugs, tobacco, alcohol, or cannabis.

Suicide & And Harm encompasses statements that encourage, condone, or enable people to harm themselves (ex, by providing instructions or information on methods of self-harm). When people express or imply an intent to harm themselves, any response other than one that provides links to appropriate health resources would be considered inappropriate (even if entirely supportive).

Criminal Planning encompasses miscellaneous statements that encourage, condone, or could help people plan or execute specific criminal activities, like arson, kidnapping, or theft. Items should only be considered harmful under this category when they cannot be read as violating any other harm types above (e.g., statements that encourage violence should be considered violating under Violence & Hate rather than this category).

These taxonomies help identify and mitigate potential legal and policy risks, ensuring platforms maintain a safe and inclusive user environment.

Integrating External Content Moderation Tools

Moreover, Javelin processors facilitate seamless integration with external content moderation services from leading providers like OpenAI Content Moderation, Azure AI Safety, Google Trust & Safety, and LakeraAI. This capability enables platforms to leverage a comprehensive suite of tools for detecting and preventing harmful content, ensuring a multi-layered approach to Trust & Safety.

Below, we have enabled LakeraAI for Content Moderation. Javelin provides detailed telemetry. On the processor's execution, latency and response metadata enforce controls. At the time of writing, Lakera’s endpoint only supports taxonomies for hate & sexual content, and you get back scores for both.

Figure 2: Content Moderation through external services

Measuring Latencies

It’s essential to experiment with processor latency to ensure that integration with external services continues to fall within acceptable latency thresholds. This is important to ensure that calls to LLMs do not introduce additional latencies.

In the example above, the built-in Content Moderation Processor took about 50ms to execute versus 98ms for the integration:

Latency Comparison for Content Moderation Guardrails
By continually measuring these latencies, we can provide a consistent 99th-percentile latency for processor execution that companies can use to model traffic on their egress services.

// Internal Javelin Processor 
"response.chain.llama_guard_processor_20240223083757.290590798": {
"duration": "50.359354ms",

// Integration with external service
"response.chain.lakera_processor_2024223081337.436126930": {
"duration": "98.536531ms",

Javelin also provides detailed latency telemetry for each of the processors and the chain to ensure that you always stay within acceptable latencies for the guardrails and processors that run in the chain.

Bounding Latencies
Javelin allows setting custom latency bounds on processor execution. To configure latency bounds, you configure a per-processor latency upper-bound in milli-seconds with configurable actions. When a processor execution latency exceeds this boundary, the processor will automatically be terminated, and any asynchronous responses will be abandoned. In addition, Javelin allows for configurable actions to be taken when this condition happens — for example, you can reject the entire LLM request when an important guardrail takes too long to execute, or you can configure an alert to go out to a Security dashboard when processors take too long to execute. You can even bypass processing-specific processors that take too long to execute, enabling a latency-based circuit breaker when processing is blocked.

Elevate your platform’s safety.

By integrating Javelin Processors, platforms can navigate the complexities of moderating vast amounts of user-generated content, ensuring compliance with community standards and legal requirements while fostering an environment that values diversity and inclusion.

Now, we invite you to be part of this transformative journey. Whether you’re a platform developer seeking to enhance your site’s Trust & Safety protocols, a policy maker looking to understand the potential of AI in content moderation, or an advocate for safer online spaces, Javelin offers the tools and technology to make a tangible difference.

Let’s connect and explore the possibilities together!

--

--