Find Your Code! Scaling a LlamaIndex and Qdrant Application with Google Kubernetes Engine

13 min readJul 1, 2024

Have you ever struggled to locate that perfect piece of code you wrote months ago? In this article, I will guide you on how to create an LLM application using LlamaIndex and Qdrant that will allow you to interact with your GitHub repositories, making it easier than ever to find forgotten code snippets. We’ll deploy the application on Google Kubernetes Engine (GKE) with Docker and FastAPI and provide an intuitive Streamlit UI for sending queries.

The complete code can be found in this repository.

Let’s find the code!

Prerequisites

Before we start, ensure you have the following:

A GCP account with one project, the service account key, and activated GKE API
One Qdrant cluster and the corresponding API Key and URL
An OpenAI API key
Gcloud CLI installed (configuration guide)
Kubernetes installed (configuration guide)
A GitHub Access Token

Step 1: Setting Up Environment Variables

Create a .env file in your project directory with the following content:

OPENAI_API_KEY=your_openai_api_key
QDRANT_API_KEY=your_qdrant_api_key
QDRANT_URL=your_qdrant_url
COLLECTION_NAME=your_collection_name
ACCESS_TOKEN=you_github_access_token
GITHUB_USERNAME=your_github_username

Step 2: Creating the Qdrant Collection

File name: create_qdrant_collection.py

Qdrant is an open-source vector database that excels in semantic similarity search tasks. Storing our data on Qdrant will allow us to find our code based on semantic search queries.
The first task will be to fetch all repositories and the corresponding files. The code is structured to retrieve only Python files with code inside, avoiding empty content fields that could affect the semantic search capabilities.

def get_code_file_list(github_token, github_username):

    try:
        # Initialize Github client
        g = Github(github_token)

        # Fetch all repositories for the user
        repos = g.get_user(github_username).get_repos()

        github_client = GithubClient(github_token=github_token, verbose=True)

        all_documents = []

        for repo in repos:
            repo_name = repo.full_name
            print(f"Loading files from {repo_name}")

            # Check if the repository belongs to the user
            if repo.owner.login != github_username:
                print(f"Skipping repository {repo_name} as it does not belong to the user.")
                continue

            try:
                # Determine the default branch
                default_branch = repo.default_branch

                # Load documents from the repository
                documents = GithubRepositoryReader(
                    github_client=github_client,
                    owner=github_username,
                    repo=repo.name,
                    use_parser=False,
                    verbose=False,
                    filter_file_extensions=(
                        [".py"],
                        GithubRepositoryReader.FilterType.INCLUDE,
                    ),
                ).load_data(branch=default_branch)

                # Ensure each document has text content
                for doc in documents:
                    if doc.text and doc.text.strip():
                        all_documents.append(doc)
                    else:
                        print(f"Skipping empty document: {doc.metadata['file_path']}")

            except Exception as e:
                print(f"Failed to load {repo_name}: {e}")

    except Exception as e:
        print(f"Error fetching repositories: {e}")

    return all_documents

Once we have all our files stored in a list, it is time to split them into nodes with LlamaIndex and save the content and metadata in a Qdrant collection. This is where our documents will be stored, and our API will search for the code you are looking for. The chunked nodes function is standard for a Qdrant collection and can be adapted and reused depending on the metadata structure of the documents. Another example can be found on a previous article I wrote for a similar application deployed on AWS.

def split_documents_into_nodes(all_documents):

    try:
        splitter = SentenceSplitter(
            chunk_size=1500,
            chunk_overlap=200
        )

        nodes = splitter.get_nodes_from_documents(all_documents)

        return nodes

    except Exception as e:
        print(f"Error splitting documents into nodes: {e}")
        return []

def create_collection_if_not_exists(client, collection_name):

    try:
        collections = client.get_collections()
        if collection_name not in [col.name for col in collections.collections]:
            client.create_collection(
                collection_name=collection_name,
                vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
            )

            print(f"Collection '{collection_name}' created.")
        else:
            print(f"Collection '{collection_name}' already exists.")
    except ResponseHandlingException as e:
        print(f"Error checking or creating collection: {e}")


def chunked_nodes(data, client, collection_name):

    chunked_nodes = []

    for item in data:
        qdrant_id = str(uuid4())
        document_id = item.id_
        code_text = item.text
        source = item.metadata["url"]
        file_name = item.metadata["file_name"]

        content_vector = embed_model.get_text_embedding(code_text)

        payload = {
            "text": code_text,
            "document_id": document_id,
            "metadata": {
                            "qdrant_id": qdrant_id,
                            "source": source,
                            "file_name": file_name,
                            }
                }


        metadata = PointStruct(id=qdrant_id, vector=content_vector, payload=payload)

        chunked_nodes.append(metadata)

    if chunked_nodes:
        client.upsert(
            collection_name=collection_name,
            wait=True,
            points=chunked_nodes
        )

    print(f"{len(chunked_nodes)} Chunked metadata upserted.")

Once you run the code, the collection will be on your Qdrant Cluster.

Step 3: Defining the FastAPI Application

File name: app.py

Now it is time to define our application that will receive a query, search the Qdrant collection for the best matches, and return the relevant code. The main components of the RAG application are:

Qdrant collection as retriever.
Prompt with instructions.
SentenceTransformerRerank for faster and more concise retrieval.

# Initialize Qdrant Vector Store
vector_store = QdrantVectorStore(client=client, collection_name=COLLECTION_NAME, embed_model=embed_model)

# Initialize Vector Store Index
index = VectorStoreIndex.from_vector_store(vector_store=vector_store, embed_model=embed_model)

# Define the prompt template for querying
qa_prompt_tmpl_str = """\
Context information is below.
---------------------
{context_str}
---------------------

Given the context information and not prior knowledge, \
answer the query. Please be concise, and complete. \
If the context does not contain an answer to the query \
respond with I don't know!

Query: {query_str}
Answer: \
"""
qa_prompt = PromptTemplate(qa_prompt_tmpl_str)

# Initialize Retriever
retriever = VectorIndexRetriever(index=index)

# Initialize Response Synthesizer
response_synthesizer = get_response_synthesizer(
    text_qa_template=qa_prompt,
)

# Initialize Sentence Reranker for query response
rerank = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-2-v2", top_n=3
)

# Initialize RetrieverQueryEngine for query processing
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[rerank]
)

This easy implementation guarantees a more accurate retrieval of our context and a better result. Additionally, the application will be deployed on a CPU instance, where speeding up the retrieval process plays a crucial role.

@app.post("/query/")
async def query_vector_store(request: QueryRequest):

    query = request.query
    response = query_engine.query(query)
    if not response:
        raise HTTPException(status_code=404, detail="No response found")

    # Remove newline characters from the response
    cleaned_response = response.response.replace("\n", "")

    return cleaned_response

Step 4: Understanding Kubernetes

Now it is time to configure our Kubernetes deployment, where we need to consider several factors like resources/memory allocation and scalability. Let’s first describe how Kubernetes works.

Source: https://kubernetes.io/docs/concepts/architecture

Kubernetes is a powerful, open-source platform designed to automate the deployment, scaling, and management of containerized applications. Its architecture is built around the following key components:

API Server: It is the central management entity.
Controller Manager: Manages controllers that handle routine tasks such as replication and scaling and makes sure the deployment is running as per specifications.
Worker Nodes: These are instances (either virtual machines or physical servers) that run the applications in the form of containers.
Scheduler: Assigns workloads to nodes based on resource availability and other constraints.
Pods: The smallest deployable units in Kubernetes, representing a single instance of a running process in a cluster. Pods can contain one or more containers that share storage, network, and a specification for how to run the containers.

The nodes have three main components:

Kubelet: An agent that ensures containers are running in a Pod.
Kube-proxy: Handles network traffic within the Kubernetes cluster.
Container Runtime: Software that runs the containers (Docker for our app).

This architecture provides robust scalability, ensuring that our app runs as per our specifications at all times. Some of the key factors that make Kubernetes a very popular choice include:

Self-healing: If a container does not work as expected, such as when a pod stops working, a new pod is automatically generated.
Rolling Updates: Automated rollouts and rollbacks allow for updating applications without downtime.
Load Balancing: Distributes load across the pods, optimizing resource utilization.

Let’s now dig deep into the configuration of our Kubernetes Cluster!

Step 5: GKE Configuration File

File name: deploy_gke.yaml

The first step here is to create our Kubernetes Cluster in GKE. Be sure to authenticate in GCP and select the corresponding Project ID:

gcloud auth login

gcloud config set project PROJECT_ID

Once authenticated, you can create the container cluster where several parameters can be configured like the number of nodes and autoscaling capabilities, which by default are not enabled.

As this is not a very heavy computational LLM application, but I wanted to test the scalability I set up the configuration based on an n1-standard-4 (4 CPUs, 15 GB memory) instance instead of the smallest n1-standard-1 (1 CPUs, 3.75 GB memory)

gcloud container clusters create llama-gke-cluster \        --zone=europe-west6-a \
        --num-nodes=5 \
        --enable-autoscaling \
        --min-nodes=2 \
        --max-nodes=10 \
        --machine-type=n1-standard-4 \
        --enable-vertical-pod-autoscaling

The flag enable-autoscaling automatically adjusts the size of the cluster based on the resource demands. It can add or remove nodes within the specified minimum (--min-nodes) and maximum (--max-nodes) limits. This is an example of cluster autoscaling, as it scales the number of nodes by adding or removing them to meet the demand.

The flag enable-vertical-pod-autoscaling automatically adjusts the CPU and memory resource for the pods based on their actual usage. If a pod needs more resources, updates will be applied to the pod’s resource requests and limits. This is an example of vertical autoscaling, as it scales the resources allocated to a pod vertically, by increasing or decreasing the CPU and memory assigned to it.

The main parts of the YAML file related to vertical autoscaling, which can be further customized, are the resources, where a minimum, a maximum (based on instance capacity) and a policy are set up.

resources:
  requests:                       # Minimum resources required.
    memory: "2Gi"
    cpu: "1"
  limits:                         # Maximum resources allowed
    memory: "12Gi"                # Maximum memory of the instance (80-90%)
    cpu: "4"                      # Maximum vCPUs of the instance


# Vertical scaling
---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: llama-gke-deploy-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment
    name:       llama-gke-deploy
  updatePolicy:                   # Policy for updating the resource requests and limits
    updateMode: "Auto"            # Automatically update the resource requests and limits

Additionally, to these types of scaling strategies, I included horizontal autoscaling which increase or decrease the number of pods in response to the workload’s CPU or memory consumption. This will apply during deployment of the app.

# Horizontal scaling
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llama-gke-deploy-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llama-gke-deploy
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource                # Type of metric
    resource:                     # Resource-based metric
      name: cpu                   # Metric name
      target:
        type: Utilization         # Type of target value
        averageUtilization: 70    # Average CPU utilization percentage to maintain.

Finally, it is also a good practice to include some health checks in the configuration. The readinessProbe checks if the pod is ready to serve traffic and the livenessProbe checks if the pod is alive. Both test the main port path where the app receives the HTTP requests.

readinessProbe:             # Check if the pod is ready to serve traffic.
  httpGet:
    scheme: HTTP
    path: /
    port: 8000              # Port for readiness probe (should match containerPort)
  initialDelaySeconds: 240  # Delay before first probe is executed
  periodSeconds: 60         # Interval between probes

livenessProbe:              # Check if the pod is alive
  httpGet:
    scheme: HTTP
    path: /
    port: 8000              # Port for liveness probe (should match containerPort)
  initialDelaySeconds: 240  # Delay before first probe is executed
  periodSeconds: 60         # Interval between probes

This configuration ensures that the application can scale efficiently and remains robust, handling both vertical and horizontal scaling needs while maintaining health checks.

Step 6: Kustomization File

File name: kustomization.yaml

This file helps manage Kubernetes resources, allowing certain customizations while leaving the original files untouched, such as our previous YAML file. Although for this app it is not strictly necessary, as I did not include any custom values (I added them in GitHub actions, see the next step), it can be used to add elements like ConfigMap key-value pairs, which are non-confidential data, or Secrets, which are passwords, API keys, or more sensitive information.

Here is an example of the kustomization.yaml file with possible additions:

apiVersion: kustomize.config.k8s.io/v1beta1  
kind: Kustomization

resources:
  - deploy_gke.yaml

# Possible additions
configMapGenerator:
  - name: app-config
    literals:
      - ENV=production

secretGenerator:
  - name: app-secret
    literals:
      - DATABASE_USER=admin
      - DATABASE_PASSWORD=secretpassword

Step 7: GitHub Actions

Before we deploy our app, we need to configure the GitHub actions adding some environmental variables. Below are the required fields. The GKE_PROJECT is the PROJECT_ID in Google Cloud and the GKE_SA_KEY is the service account key.

Step 8: App Deployment

File name: build_deploy.yaml

Next, it is time to deploy our app and test it on a production environment. For this we will need additionally the Dockerfile that will be used together with GitHub actions:

FROM python:3.10

WORKDIR /app

# Copy application code
COPY . .

# Clear pip cache
RUN pip cache purge

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Expose port
EXPOSE 8000

# Command to run the application
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

With the GitHub Actions workflow configured, the app deployment process becomes automated. If we push changes to the main branch, the workflow will:

Checkout: Checks out the repository code.
Setup gcloud CLI: Authenticates with Google Cloud using the provided service account key.
Configure Docker: Configures Docker to use Google Cloud as a credential helper.
Get GKE Credentials: Fetches the GKE cluster credentials.
Build Docker Image: Builds the Docker image for the application.
Publish Docker Image: Pushes the Docker image to Google Artifact Registry
Set up Kustomize: Downloads and sets up Kustomize.
Create or Update Secrets: Creates or updates Kubernetes secrets in the GKE cluster.
Deploy: Uses Kustomize to deploy the Docker image to the GKE cluster and checks the deployment status.

name: Build and Deploy to GKE

on:
  push:
    branches:
      - main

env:
  PROJECT_ID: ${{ secrets.GKE_PROJECT }} 
  GKE_CLUSTER: llama-gke-cluster          # Cluster Name
  GKE_ZONE: europe-west6-a                # Cluster zone 
  DEPLOYMENT_NAME: llama-gke-deploy       # Deployment name
  IMAGE: llama-app-gke-image              # Image Name

jobs:
  setup-build-publish-deploy:
    name: Setup, Build, Publish, and Deploy
    runs-on: ubuntu-latest
    environment: production

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    # Setup gcloud CLI
    - id: 'auth'
      uses: 'google-github-actions/auth@v2'
      with:
        credentials_json: '${{ secrets.GKE_SA_KEY }}'

    # Configure Docker to use the gcloud command-line tool as a credential
    # helper for authentication
    - run: |-
        gcloud --quiet auth configure-docker

    # Get the GKE credentials so we can deploy to the cluster
    - uses: google-github-actions/get-gke-credentials@db150f2cc60d1716e61922b832eae71d2a45938f
      with:
        cluster_name: ${{ env.GKE_CLUSTER }}
        location: ${{ env.GKE_ZONE }}
        credentials: ${{ secrets.GKE_SA_KEY }}

    # Build the Docker image
    - name: Build
      run: |-
        docker build \
          --tag "gcr.io/$PROJECT_ID/$IMAGE:$GITHUB_SHA" \
          --build-arg GITHUB_SHA="$GITHUB_SHA" \
          --build-arg GITHUB_REF="$GITHUB_REF" \
          .

    # Push the Docker image to Google Artifact Registry
    - name: Publish
      run: |-
        docker push "gcr.io/$PROJECT_ID/$IMAGE:$GITHUB_SHA"

    # Set up kustomize
    - name: Set up Kustomize
      run: |-
        curl -sfLo kustomize https://github.com/kubernetes-sigs/kustomize/releases/download/v3.1.0/kustomize_3.1.0_linux_amd64
        chmod u+x ./kustomize 
        
    # Create or update secrets in the GKE cluster
    - name: Create or Update Secrets
      run: |-
        kubectl delete secret openai-secret || true
        kubectl delete secret qdrant-secret || true
        kubectl create secret generic openai-secret --from-literal=OPENAI_API_KEY=${{ secrets.OPENAI_API_KEY }}
        kubectl create secret generic qdrant-secret \
          --from-literal=QDRANT_API_KEY=${{ secrets.QDRANT_API_KEY }} \
          --from-literal=COLLECTION_NAME=${{ secrets.COLLECTION_NAME }} \
          --from-literal=QDRANT_URL=${{ secrets.QDRANT_URL }}      
    # Deploy the Docker image to the GKE cluster
    - name: Deploy
      run: |-
        ./kustomize edit set image gcr.io/PROJECT_ID/IMAGE:TAG=gcr.io/$PROJECT_ID/$IMAGE:$GITHUB_SHA
        ./kustomize build . | kubectl apply -f -
        kubectl rollout status deployment/$DEPLOYMENT_NAME
        kubectl get services -o wide

This setup ensures an automated, scalable, and secure deployment process for the application on GKE.

Note that to add more GitHub repositories, you can also integrate and automate the Qdrant indexing process with GitHub Actions by using a Helm Chart to run Qdrant in your Kubernetes cluster and creating a Kubernetes Job manifest to run the create_qdrant_collection.py.

Step 9: Kubernetes Check Up

The deployment might take 10–20 minutes to finish, but we can monitor the status of the deployment with some commands, once the deploy step of the build_deploy.yaml is triggered. Some useful commands to check the status are:

# Get the Pods
kubectl get po

# Get the Nodes
kubectl get nodes

# Get the Services
kubectl get svc 

# Get the logs of a pod
kubectl logs llama-gke-deploy-668b58b455-fjwvq 

# Describe a pod
kubectl describe pod llama-gke-deploy-668b58b455-fjwvq

# Check CPU usage
kubectl top pod llama-gke-deploy-668b58b455-fjwvq

In the below screenshot, we can see some of these commands. I let the cluster run for a few hours to ensure that everything works properly. For example, we can see that at one point, a pod was not ready, and it took a few minutes to be created. Most likely, it crashed, and a new one was being generated, which shows that the cluster is performing correctly when something goes wrong.

Additionally, we can see that the memory being used (around 500Mi) fits perfectly within the specification, as we set up a minimum of 2Gi and a maximum of 12Gi, and we have enough CPU. Originally, we deployed 5 nodes and set up 2 to 10 pods for horizontal autoscaling. The system is working with 2 nodes and 2 pods, so the autoscaling works as well, as the system scaled down the nodes from 5 to 2.

Step 10: Streamlit App

File name: streamlit_app.py

Finally, let’s build a Streamlit app to interact with our GitHub code! To know the endpoint, you have to take the EXTERNAL-IP of the service, as shown in the previous screenshot, and add it into the Streamlit app file.

import streamlit as st
import requests

# Set the FastAPI endpoint
FASTAPI_ENDPOINT = "http://34.65.157.134:8000/query/"

# Streamlit app title
st.title("Find Your Code")

# Input field for the query
query = st.text_input("Query:")

# Button to submit the query
if st.button("Get Response"):
    if query:
        response = requests.post(FASTAPI_ENDPOINT, json={"query": query})
        if response.status_code == 200:
            st.write(response.text)
        else:
            st.write("Error:", response.status_code)
    else:
        st.write("Please enter a query.")

To run the Streamlit app, use the following command in the terminal:

streamlit run streamlit_app.py

Now you can interact with the app and find your code!

Step 11: Kubernetes Clean Up

Once you no longer need the app, you can delete it with the following command and remove the Docker image stored under the Artifact Registry in Google Cloud.

gcloud container clusters delete llama-gke-cluster --zone=europe-west6-a

Conclusion

In this post, we walked through building, deploying, and scaling an LLM application using Kubernetes, LlamaIndex, and Qdrant to interact with your code. We created Python and YAML scripts to extract the relevant files, upload them into a Qdrant Collection, and deploy the app with various memory, resource, and scaling specifications. Additionally, we built a Streamlit app for user interaction.

If you enjoyed reading this content you can support it by:

Clapping and following me on Medium! 👏 👏 👏
Follow my Github 🎶 🎷 🎶
Staring the repo ⭐⭐⭐
Share my content on LinkedIn! 💯💯💯
Buy me a coffee or support me on GitHub Sponsors 🚀🚀🚀

Thank you for following along, and happy coding!