Hosting my HR Assistant RAG app in Choreo

Published in

Choreo Tech Blog

5 min readAug 8, 2024

In a previous post, I explored creating a RAG application. This app is a self-contained executable where the user’s question must be passed as a command line parameter. The program communicates with a vector database and an LLM (both hosted locally) and outputs the response to the standard output. In this post, I will explore how to host this application in Choreo.

Right away, one of the problems with my solution is that I’m using a set of internal HR policy docs that cannot be hosted in a 3rd party SaaS DB without getting legal involved. So I had to find out a different set of publicly accessible docs or make the local setup work with Choreo, which is a SaaS.

Luckily Choreo provides all the right constructs to talk to private data stores thru encrypted channels. The final architecture looks like this.

High level diagram after hosting with Choreo

Convert the standalone app to an API

First, the standalone program has to be converted to an API. This is easily done with FastAPI since I’m using Python. The user’s query that was passed through a command line argument, will be part of the URL. Here, the query_text becomes a query parameter passed with the URL like https://domain.com/?query_text=”my query”. With FastAPI, function parameters will automatically become query parameters.

app = FastAPI()

@app.get("/")
def gen_response(query_text: str = ""):
  ...

Externalizing endpoints

Next, the endpoints of Milvus DB and the LLM API call has to change. Right now both of them are pointing to localhost. We can inject these as environment variables.

Python dotenv package makes it really easy to pass the environment variables using a .env file for local testing. Contents of the .env file are as follows,

OLLAMA_URL="http://localhost:11434/"
MILVUS_HOST="localhost"
MILVUS_PORT=19530

When the .env file is not available, we want these values to be picked up from environment variables.

from dotenv import load_dotenv

load_dotenv()

OLLAMA_URL = os.getenv('OLLAMA_URL')
MILVUS_HOST = os.getenv('MILVUS_HOST')
MILVUS_PORT = os.getenv('MILVUS_PORT')

Streaming the response back to the client

The standalone program prints the response in chunks to the standard output. This output should now be streamed back to the client.

from fastapi.responses import StreamingResponse

app = FastAPI()

@app.get("/")
def gen_response(query_text: str = ""):
  ...
  
  return StreamingResponse(rag_chain.stream(query_text))

The full function is as follows,

from langchain_milvus import Milvus
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms.ollama import Ollama
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
import os
from dotenv import load_dotenv
import argparse
from typing import Union
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import uvicorn

load_dotenv()

app = FastAPI()

OLLAMA_URL = os.getenv('OLLAMA_URL')
MILVUS_HOST = os.getenv('MILVUS_HOST')
MILVUS_PORT = os.getenv('MILVUS_PORT')

PROMPT_TEMPLATE = """
...
"""

@app.get("/")
def gen_response(query_text: str = ""):

    embedding_fn = OllamaEmbeddings(
        model="llama3.1", 
        base_url=OLLAMA_URL
    )
    db = Milvus(
        embedding_fn, 
        connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT},
        collection_name="policy_docs",        
    )
    retriever = db.as_retriever()
    prompt = PromptTemplate(
        template=PROMPT_TEMPLATE, 
        input_variables=["context", "question"]
    )
    llm = Ollama(
        model="llama3.1",
        base_url=OLLAMA_URL
    )
    rag_chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )

    return StreamingResponse(rag_chain.stream(query_text))

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8005)

In this blog, I won’t be addressing the data loading step as it’s already completed, and this part will not be hosted in Choreo.

In my setup, there are 2 components that are running locally,

The vector database (Milvus)
LLM (Ollama)

Using Tailscale VPN

Choreo supports creating VPN tunnels through Tailscale. With this setup, Choreo’s Cloud Data Plane can directly communicate with services hosted inside a private network, such as a desktop machine without a public IP. That’s incredibly cool!

Once you have Tailscale installed, the dashboard will show the connected nodes.

The first machine is my desktop, and the third is the Tailscale proxy deployed in Choreo. Ignore the second one.

All requests to the Tailscale proxy should be forwarded to the Milvus DB and Ollama hosted on my local machine. I will use the same port numbers for the Tailscale proxy.

IP/port mapping from Tailscale proxy in Choreo

On the local machine, Milvus is listening on TCP port 19530 and Ollama is on HTTP port 11434. When deploying the Tailscale proxy component in Choreo, the config.yaml file will include the IP-to-port mapping. Once the deployment is complete, we will be able to see two endpoints from this component as shown below:

Deploying the Python service

Now we can deploy the Python API. After creating the component, we have to create environment variables for MILVUS_HOST, MILVUS_PORT and OLLAMA_URL,

We can now use the public endpoint of this API to pass our query. It will route all the way back to my local GPU, generate a response, and send it back!

The total round-trip time was acceptable for an experiment like this even though there are 2 external API calls coming back to a machine with a domestic internet connection.

Summary

With the VPN feature that’s built into Choreo, it’s possible to expose publicly consumable, secure and rate limited APIs while still maintaining private vector DBs and private LLM deployments. This makes sure sensitive data never leaves your private infrastructure.