Secure your RAG

Published in

Javelin Blog

8 min readNov 2, 2023

With the increasing shift towards digitization, businesses and organizations constantly explore new ways to improve their operations and customer experience. A leading approach in this space involves the development of advanced customer support bots, internal knowledge graphs, and Q&A systems.

Retrieval Augmented Generation (RAG) applications are becoming more prevalent to make these more efficient. RAG’s ability to blend pre-trained models with proprietary data is a game-changer. However, ensuring that these applications are used safely without compromising data integrity is crucial. In this blog, we discuss the critical elements of RAG and introduce a way to manage RAG workflows with security and governance.

RAG: An Overview

Retrieval Augmented Generation, commonly known as RAG, is a cutting-edge architecture in artificial intelligence that blends the capabilities of large-scale neural language models with information retrieval systems. Instead of generating responses based solely on pre-trained knowledge, RAG retrieves relevant documents or data fragments from vast datasets. Then, it utilizes these fragments as context for the subsequent text generation. This allows RAG to tap into specific, up-to-date, or domain-specific knowledge, making it particularly effective for applications such as chatbots and Q&A systems that require real-time information access and customization based on external data.

Security and Controls

Javelin LLM Gateway is an Enterprise-grade LLM (Large Language Model) Gateway that enables enterprises to apply policy controls, adhere to governance measures, and enforce comprehensive security guardrails, including data leak prevention, to ensure safe and compliant model use. Javelin is tailored to empower Operational teams, enabling them to manage LLM-enabled applications and oversee their model access effectively. This blog shows how we can effectively harden RAG workflows with security, governance, and privacy controls.

Setting up a Secure RAG Workflow

Setting up a Retrieval Augmented Generation (RAG) system involves combining a retrieval mechanism with a generation model. Below is a high-level overview of the process:

1. Data Collection and Processing:

Document Dataset: Gather a comprehensive dataset, typically consisting of documents, articles, or any information the model might need to refer to. Typically, this may be internal corporate documents, knowledge bases, policies, contract documents, etc.,
Pre-process Data: Clean and format the data to ensure it’s consistent for indexing and retrieval. This may involve deciding what data you want to index and what data you want to leave out (e.g., sensitive or confidential data).

Security Tip: This is the phase where it is generally easy for sensitive or secure documents to make their way into your RAG Document Dataset inadvertently. Two key issues need to be managed:

Sensitive Sources — Before selecting large drives, directories, or mounts, ensure you are not mixing sensitive data into RAG systems.
Data Provenance — make sure you can track where data comes from. Uncertainty around sources often introduces vulnerabilities or inaccuracies.

2. Create an Embedding Space:

Encode Documents: Use an encoder (often a transformer-based model) to convert each document in the dataset into dense vector representations. While setting up a simple workflow for a handful of documents is easy, the task becomes more complex as the data volume grows.
There are a variety of embedding models, each tailored for specific data types. For instance, long-form documents with long sentences, like literature, require different embedding strategies from short sentences like chat, while software code requires a different consideration. Chunking the document data into smaller pieces also follows similar considerations.

Secure Your Model Access Using Javelin

Javelin is an LLM Gateway best suited to simply provide a single model access point from within your RAG workflows. Routes to various models may be securely provisioned on the gateway — this prevents your RAG application from having to understand, maintain, and keep track of models. This central model management also makes it easy to quickly select newer models as they become available or switch to different embedding models from different applications.

It’s easy to install the Javelin Python SDK:

$ pip install javelin_sdk

Now, you can easily create one or more routes. Each route represents an endpoint that can be centrally managed with security and governance guardrails:

from javelin_sdk import (
        JavelinClient,
        Route
    )

# create javelin client, replace with Javelin URL
    client = JavelinClient(base_url="http://localhost:8000")

# create the route json
    route_data = {
            "name": "doc_embedding",
            "model": {
                "name": "text-embedding-ada-002",
                "provider": "openai",
            },
            "config": {
                "retries": 3,
                "rate_limit": 5,
            },
        }

route = Route.parse_obj(route_data)

# create the route,
    # for async use `await client.acreate_route(route)`
    client.create_route(route)

Once you have provisioned one or more routes, replace the model URLs in your apps and instead have them use Javelin.

Love Langchain? We love it, too! Here is an easy way (~1 line of code) to do this from your Langchain-enabled apps:

from langchain.embeddings import JavelinAIGatewayEmbeddings
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = JavelinAIGatewayEmbeddings(
    gateway_uri="http://localhost:8000", # replace with Javelin endpoint URL
    route="doc_embeddings",
)

# now you can use this in your various apps
# switch to a new model is as simple as selecting a different 'route'

3. Store Embeddings:

Store these vector representations in a vector database, allowing for efficient nearest-neighbor lookups.

Security Tip: Three key security issues need to be carefully managed in this phase:

Sensitive Information Leaks — Original data can sometimes be reverse-engineered or inferred from embeddings. So always start with non-sensitive data and ensure embeddings don’t leak.
Access Control — ensure that only authorized personnel can access and generate embeddings, especially when dealing with sensitive or proprietary data.
Membership Inference Attacks — given a specific embedding, attackers may be able to infer that a specific data point was part of the document set.

Secure Model Access With Javelin — with Javelin, your model access is always secure. Since all LLM access goes through Javelin, you have complete visibility of the applications using various models, how much they use them, and what data is being passed to models.

Budget Guardrails: You can even set up cost guardrails to control budgets for large document RAG jobs!

# get the route
route1 = client.get_route("doc_embeddings")

# setup cost guardrail in $s for embeddings
route.config.budget.enabled = true
route1.config.budget.monthly = 8000 # set $8000 monthly budget
route1.config.budget.action = notify # set reject to restrict model access

# update the route, for async use `await client.aupdate_route(route)`
client.update_route(route)

Fine-tune the Generator Model (Optional):

While only sometimes necessary, if you have specific domain knowledge or need the model to have a particular style, you can fine-tune a pre-trained model on a custom dataset.

Security Tip: While security for fine-tuning is a complete article in itself, the key things to keep in mind at this point are:

Model Confidentiality — if the base model is proprietary (e.g., Open AI Models), you need to ensure that the fine-tuned model doesn’t expose details about its training.
Operational Security & Data Infrastructure — the infrastructure used for fine-tuning should be secure with considerations for secure data transfer, storage, and supply chain visibility.
Transparency and Reproducibility — for security audits and model verification, it is essential to maintain precise records of fine-tuning processes, including data sources and training logs.

Using Javelin In Model Fine-Tuning Workflows

You can use Javelin in your model fine-tuning workflows. Just enable archiving on your Javelin routes, and all model interactions are automatically enabled:

route1.config.archive = true # this will archive all model interactions

# update the route, for async use `await client.aupdate_route(route)`
client.update_route(route)

# Security Tip: Remember to setup TTL for archives, default TTL is 30 days

By enabling all data collection in a central location, you drastically reduce your Security footprint and reduce data sprawl. This is highly important from a governance and responsible model usage standpoint. You can also use Javelin Archives for compliance and audits!

RAG In Action:

Question Encoder: When a question or query is input into the RAG system, it is first passed through a question encoder to get its vector representation.
Document Retrieval: The encoded question vector is then used to retrieve relevant document embeddings from the embedding space using a similarity measure (like cosine similarity).
Combine and Generate: The retrieved documents (or their embeddings) are combined with the original query and fed into the generator model, producing the final response.

Prevent Sensitive Data Leaks — it is easy to provide security guardrails that inspect, redact, mask, or anonymize sensitive data before it ends in vector space. It is easy to restrict applications that generate embeddings versus applications that encode questions. You create two routes, one for the application that encodes questions and another for the application that generates embeddings. By passing requests through Javelin, you can set policies and automatic guardrails that prevent PII and PHI passthrough:

# get the route
route1 = client.get_route("doc_embeddings")
route2 = client.get_route("question_embeddings")

# anonymize you data before they are embedded
route1.config.dlp.enabled = true
route1.config.dlp.strategy = "anonymize"

# redact pii from questions before embedding them 
route2.config.dlp.enabled = true
route2.config.dlp.strategy = "redact"

# update the route, for async use `await client.aupdate_route(route)`
client.update_route(route)

Securing Document Retrievals — RAG workflows are exposed to the most sensitive and proprietary enterprise and user data. Often, this data is internal to enterprises and has intellectual property or other corporate secrets. Encoding your workflows with purpose-built security like Javelin is critical to secure this access.

4. Get Production Ready!

Once you are satisfied with the performance of your RAG workflow and the accuracy of your Q&A through this workflow, you are now ready to move your app for production use (either for internal users or external customers)

For RAG applications, Javelin’s Gateway Routes can be efficiently created in Javelin for various models. Setting up models like Llama2–70b-Chat, PaLm 2, or Mistral is a breeze.

# make the necessary route changes
# Set to Mistral-7B-Instruct-v0.1 etc.,
route.model.name = "meta-textgeneration-llama-2-70b-f"

Here is a Langchain example,

from langchain.chat_models import ChatJavelinAIGateway
from langchain.schema import HumanMessage, SystemMessage

messages = [
    SystemMessage(
        content="You are a helpful assistant that responds to queries"
    ),
    HumanMessage(
        content="What is the policy for sharing my sales data with prospects"
    ),
]

chat = ChatJavelinAIGateway(
    gateway_uri="http://localhost:8000", # replace with Javelin endpoint URL
    route="internaluserchat_route",
    model_name="meta-textgeneration-llama-2-70b-f"
    params={
        "temperature": 0.9
    }
)

Deployment and Scaling:

Deploy the RAG system to the desired platform or environment.
Ensure infrastructure can handle real-time retrieval and generation, especially if deploying in high-demand scenarios.

Javelin gives you a central point for production controls around model use:

# make the necessary route changes
    route.config.retries = 5  # retry model 5 times if model is overloaded
    route.config.rate_limit = 3  # limit model access to 3 request/sec

Monitoring and Maintenance:

Regularly monitor the performance of the RAG system in real-world scenarios.
Update the document dataset or retrain components to ensure the system remains current and effective.

With Javelin, you have full transparency over model access, usage, and real-time monitoring.

Iterative Improvements:

As with any AI system, gather feedback and make continuous improvements. This might involve expanding the document dataset, adjusting retrieval mechanisms, or fine-tuning the generator with additional data.

Following these steps, one can set up a RAG system tailored to specific requirements, ensuring efficient information retrieval and high-quality response generation.

Centralizing Security & Governance with Javelin

The power of Javelin lies in its capacity to streamline governance, enforce security, and set policies for accessing various model APIs. This centralized approach empowers organizations, allowing them to distribute Routes to Models, which, in turn, democratizes access to LLMs while maintaining the integrity and security of the system.

The fusion of RAG applications with the Javelin AI platform promises ease of taking generative AI applications to production with security, confidence, and efficiency. The potential is immense, and the potential of customer support and Q&A systems can be tapped responsibly.

Learn More About Javelin

Javelin is an enterprise-grade LLM gateway. It is built to be lightning-fast 🚀 and highly secure for dealing with your most sensitive data 🔒. It is built on a zero-trust security architecture 🛡️ for use in even the most regulated industries.

We have built petabyte-scale Internet systems, so we understand Data and Security. Do you have a question about designing your RAG or just want to talk security? We would love to chat!