Multi-Tenant Products: Rag + Bedrock + Amazon Native Kubernetes (EKS) — CodeLLaMa-7b walkthrough

7 min readOct 20, 2023

Purpose

The purpose of this blog is a quick walkthrough through some of the important generative AI concepts, followed by what we are going to be building using those. In this blog, we will focus on a top down solution, so we will ideate, create a product based solution. We will also touch base upon what I think is the future of generative AI, and that is using it to augment user actions and workflows. Before we get started, a couple of clarification items: Artificial intelligence, machine learning and consequently generative AI is going through a wave for all of the customers. This is due to three main reasons: First, there is an increase in the amount of compute capacity, the growth of data, followed by these models being sophisticated (architecture and use case of the machine learning model) and reliable.

With the rise of these models and concepts, we have seen growth in code generation, image generation, image classification, music generation video generation and so on. Let’s take a quick example of how easy it has been to generate code for example, using the Code LLaMa-7b model. Feel free to scroll down to the Amazon EKS section if you want to proceed without going over the example:

NOTE: I work at AWS, but the thoughts, ideas, and implementations on these blogs and github are my own

RAG BASED SOLUTION: CUSTOMIZING CODE-LLAMA-7B FOR YOUR USE CASE

Source Code: https://github.com/madhurprash/amazon-sagemaker-generativeai/blob/main/deploy_and_customize_codeLLaMa/customizing_codeLLaMa.ipynb

import sys
!{sys.executable} -m pip install langchain!{sys.executable} -m pip install chromadb!{sys.executable} -m pip install --upgrade boto3

Import other libraries and document loaders as well as libraries like the recursive character splitting to be able to efficiently generate code through our model

import argparse
import os
from langchain.document_loaders import DirectoryLoader
import chromadb
import json
import boto3
import time
import glob
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
Language,
)
import ast
import sys

Deploy the code LLaMa 7b model

model_id = "meta-textgeneration-llama-codellama-7b"
from sagemaker.jumpstart.model import JumpStartModel
model = JumpStartModel(model_id=model_id)
predictor = model.deploy()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
-------------------!

def query_endpoint(payload):
client = boto3.client('runtime.sagemaker')
response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType='application/json',
Body=json.dumps(payload).encode('utf-8'),
CustomAttributes="accept_eula=true",
)
response = response["Body"].read().decode("utf8")
response = json.loads(response)
return response

OUTPUT WITHOUT CUSTOMIZATION:
Given below is the output without customizing the code LLaMa-7b model:

def print_completion(prompt: str, response: str) -> None:
    bold, unbold = '\033[1m', '\033[0m'
    print(f"{bold}> Input{unbold}\n{prompt}{bold}\n> Output{unbold}\n{response['generated_text']}\n")
%%time

prompt = """\
import sagemaker# Create an HTML page about Amazon SageMaker
html_content = f'''
<!DOCTYPE html>
<html>
<head>
    <title>Amazon SageMaker</title>
</head>
<body>
    <h1>Welcome to Amazon SageMaker</h1>
    <p>Amazon SageMaker is a fully managed service for building, training, and deploying machine learning models.</p>
    <h2>Key Features</h2>
    <ul>
        <li>Easy to use</li>
        <li>Scalable</li>
        <li>End-to-end machine learning workflow</li>
    </ul>
    <p>Get started with SageMaker today and unlock the power of machine learning!</p>
</body>
</html>
'''html_content
"""payload = {"inputs": prompt, "parameters": {"max_new_tokens": 256, "temperature": 0.2, "top_p": 0.9}}
response = query_endpoint(payload)
print_completion(prompt, response)
> Input
import sagemaker# Create an HTML page about Amazon SageMaker
html_content = f'''
<!DOCTYPE html>
<html>
<head>
    <title>Amazon SageMaker</title>
</head>
<body>
    <h1>Welcome to Amazon SageMaker</h1>
    <p>Amazon SageMaker is a fully managed service for building, training, and deploying machine learning models.</p>
    <h2>Key Features</h2>
    <ul>
        <li>Easy to use</li>
        <li>Scalable</li>
        <li>End-to-end machine learning workflow</li>
    </ul>
    <p>Get started with SageMaker today and unlock the power of machine learning!</p>
</body>
</html>
'''html_content> Output# Create a SageMaker client
sagemaker_client = sagemaker.SageMakerClient()# Create a SageMaker role
role = sagemaker.get_execution_role()# Create a SageMaker notebook instance
sagemaker_notebook_instance = sagemaker.notebook_instance.NotebookInstance(
    role=role,
    instance_type='ml.t2.medium',
    instance_count=1,
    volume_size_in_gb=5,
    volume_kms_key=None,
    direct_internet_access='Disabled',
    accelerator_types=None,
    default_code_repository=None,
    additional_code_repositories=None,
    root_access=False,
    lifecycle_config_name=None,
    tags=None,
    kms_key_id=None,
    subnet_id=None,
    security_group_ids=None,
    role_arn=None,
    lifecycle_config_arn=None,
    display_nameCPU times: user 18.3 ms, sys: 1.86 ms, total: 20.2 ms
Wall time: 8.17 s

OUTPUT WITH CUSTOMIZATION USING RAG:

prompt="""Write a python code that displays how a llama model can be trained"""
context = get_context(prompt)
prompt_data = f"""Use the following pieces of related code to respond to the request.
{context}
Request: {prompt}
"""
instructions = [
{
"role": "user",
"content": prompt_data,
}
]
prompt = format_instructions(instructions)
payload = {
"inputs": prompt,
"parameters": {"max_new_tokens": 1000, "temperature": 0.2, "top_p": 0.9}
}
response = query_endpoint(payload)
print_instructions(prompt, response)

Creating embedding for question
Found 3 context docs> Output

        "\"\"\"\n",
        "\n",
        "    return tokenize(full_prompt)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5f7K7Q7p1CSK"
      },
      "source": [
        "### 5. Dataset\n",
        "Now that I have a tokenize function, I can create a dataset that I can use to train the model:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "61-605r_-1CSK"
      },
      "outputs": [],
      "source": [
        "dataset = Dataset.from_pandas(train_df)\n",
        "dataset = dataset.map(generate_and_tokenize_prompt, batched=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2q284RQf1CSK"
      },
      "source": [
        "### 6. Model\n",
        "Now that I have a dataset, I can create a model:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8338f-_-1CSK"
      },
      "outputs": [],
      "source": [
        "model = AutoModelForSeq2SeqLM.from_pretrained(\n",
        "    \"EleutherAI/gpt-neo-125M\",\n",
        "    return_dict=True,\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Q5JQW-_-1CSK"
      },
      "source": [
        "### 7. Training\n",
        "Now that I have a model, I can train it:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "A5JQW-_-1CSK"
      },
      "outputs": [],
      "source": [
        "trainer = Seq2SeqTrainer(\n",
        "    model,\n",
        "    args,\n",
        "    train_dataset=dataset,\n",
        "    eval_dataset=dataset,\n",
        "    tokenizer=tokenizer,\n",
        "    data_collator=default_data_collator,\n",
        "    compute_metrics=compute_metrics,\n",
        "    callbacks=[EarlyStoppingCallback(early_stopping_patience=10)],\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fYJQW-_-1CSK"
      },
      "source": [
        "### 8. Training\n",
        "Now that I have a trainer, I can train it:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "GYJQW-_-1CSK"
      },
      "outputs": [],
      "source": [
        "trainer.train()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HYJQW-_-1CSK"
      },
      "source": [
        "### 9. Evaluation\n",
        "Now that I have trained the model, I can evaluate it:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id

Now, this is a clear example of how generative AI can customize use cases and go above and beyond. We can think of further extensions here to use agents on these customized models with RAG to be able to be your own personal mini software engineer/frontend developer. Now without further ado, let’s talk about amazon EKS in this context:

Amazon EKS: Use In Generative AI Business Solutions

Customers have variety of use cases. Once the customer has made the choice of using containers, Kubernetes is a natural escalation for orchestrating these containers. EKS is the best solution that AWS provides in terms of giving managed properties that fits in perfectly in cases that use generative AI and models that have high compute capacity, usage and data growth. It provides automated scaling, increased resiliency, isolates resources and is the most cost effective solution.

Take an example of a customer use case — say a customer needs to be able to train a massive model to execute a generative AI solution, and for this, it comes with a whole lot of provisioning needs. Training at inference gives back a lot of training compute, cost, and infrastructure needed, and for this EKS solves each of these challenges, by using the built in native Kubernetes capabilities to automate scaling and increasing resiliency. In the case of automated scaling, it is an out of the box solution that EKS provides where you can scale based on the parameters configurable and provide your workflows. There are auto scaling groups that can be leveraged.

In the case of resiliency, it is a native functionality where a desired stated configured for the cluster. Kubernetes has controllers to monitor and keep the infrastructure up to date. One thing to highlight here, is that sometimes the customers want to be able to use the same cluster across multiple platforms, organizational units and kubernetes provides native workload isolation that enables sharing a cluster around. Another benefit that we mentioned is this is extremely cost effective. Here, we can use any EC2 instance (spot for most cost benefit). This depends on high throughput and low latency.

To view more on RAG, take a look at my blogs here:
https://medium.com/@madhur.prashant7/some-dynamic-rag-implementation-non-hallucinating-fine-tuned-models-ed13b46f6a6d
https://medium.com/@madhur.prashant7/some-dynamic-rag-implementation-non-hallucinating-fine-tuned-models-ed13b46f6a6d

RAG (Retrieval Augmented Generation): Retrieval Augmented Generation is the process of selecting the most relevant documents or information in this context from a large database of pre existing text. Here the prompts are augmented since the most relevant pieces of information are given back to the user. Now, essentially what happens is this: We provide our generative AI model architecture with some data, in either CSV, PDF, and so on, and the model is then tuned ton take that data, chunk it using langchain, create embeddings of those data in chunks, and then store it in a vectorDB for further indexing. A user prompt is then entered which is also created into embeddings and then brings out the most relevant documents or answers based on the user query from the database in order to successfully perform RAG.

Multi Tenant Generative AI Powered Chatbots in Existence

The reason why we are starting off talking about chatbots are because 90% of the customers use generative AI to build common use cases via chatbots, for example, customer support, product Q&A, billing/financial inquiries, payments and refunds, medical assistance and so on.

Use of Multi Tenancy → Transitioning to Agent LLMs

Multi tenancy in the context for generative AI solutions is the smoothest way to transition into the nearest future → augmenting user actions through agent LLMs. Now, we can use multi tenancy here in this case with chatbots across various domains to specialize it on tenant-specific contextual data. Tenant specific contextual data combined with different generative AI models, such as Anthropic’s claude V2, and titan, or codeLLaMa-7b can provide something that is extremely important: customer personalizations that leads to increase in product engagement, enhanced customer experience and retention.

Let’s take an example of how really we can make for example, Claude specialize in a specific domain (CODE WALKTHROUGH)

Now, here, the product we are aiming to create is a mini product partner to generative AI product managers that can intake data on generative AI products (dummy data used), and be able to display in a graph and codes for PMs to be able to create and set execution growth without any operational overhead:

To view the code, take a look at the **product partner blog on my profile**