Deploy Llama2–7B on AWS (Follow Along)
This blog follows the easiest flow to set and maintain any Llama2 model on the cloud, This one features the 7B one, but you can follow the same steps for 13B or 70B. It is divided into two sections
Section — 1: Deploy model on AWS Sagemaker
Section — 2: Run as an API in your application
Llama 2 is a collection of pre-trained and fine-tuned generative text models developed by Meta. These models range in scale from 7 billion to 70 billion parameters and are designed for various text-generation tasks. The models in the Llama 2 family, particularly the Llama-2-Chat variations, are optimized for dialogue use cases, outperforming open-source chat models in most benchmarks and being on par with some popular closed-source models like ChatGPT and PaLM in terms of helpfulness and safety.
Key Details:
- Training Data: Pretraining data includes a mix of publicly available online data while fine-tuning data includes instruction datasets and new human-annotated examples.
- Training Period: Trained between January 2023 and July 2023.
- Data Freshness: Pretraining data is up to September 2022, and some fine-tuning data is more recent, up to July 2023.
Evaluation Results:
Llama 2 models show improved performance compared to Llama 1 models on various evaluation benchmarks, including commonsense reasoning, world knowledge, reading comprehension, math, and other linguistic tasks.
1. Deploying on AWS Sagemaker
You need to have an AWS Account with administrator privileges to be able to run and deploy the Llama-2–7B model, first login, and head to the Amazon Sagemaker console (Try to be on the us-east-1, N. Virginia region).
Request Quota:
The resources in Amazon Sagemaker are not always granted so one should make a quick check
Search for these service quotas in Sagemaker,
- Total domains
- Maximum number of Studio user profiles allowed per account
- ml.g5.2xlarge for endpoint usage
- Maximum number of running Studio apps allowed per account
If the applied quota value is 0 for any of these services you need to request for quota increase, You can track the requests in the quota request history, it can take up to 2 days at times.
Create Domain:
The very first task is to create a domain if you don’t have any (you will not, if this is your first time at Sagemaker)
- Select Quick Setup
- Choose a domain name
- You can keep the user profile name as default or change it if you want
- You will need to create a role if you don’t have any.
- choose “Any S3 bucket” and hit create.
This is how it should look, hit submit to create the domain.
If there was an error during the creation of your domain, it probably stems from issues with user permissions or VPC configuration.
Launch Studio and Deploy Model
After you successfully create your domain and user profile, launch sagemaker studio
Go to Jumpstart and search for Llama2–7b-chat
You can leave all configs to default, ml.g5.2xlarge is the least model required to run llama2–7b, it costs $1.515/hr, $36.36/day if you leave it running :-)
Click deploy to deploy the model as an endpoint, You will need to accept the license agreement, the deployment will take a few minutes.
At this point your model is deployed you can run inference (queries) with it, by opening the notebook from the llama-7b-chat model page, and test the model
2. Run as an API
Create IAM role for AWS Lambda
Go to IAM > Roles > create role
Select AWS Service and lambda service and click Next.
Search for these two policies, and click Next
- CloudWatchFullAccess
- AmazonSageMakerFullAccess
These are probably overkill for the task at hand but take away the complexity.
Add your role name and description (optional), and verify the policies you selected are added as permissions to the role.
Click Create role to create.
Create Lambda function
Go to Lambda > Create function
- Author from scratch
- Give it a name
- Select runtime as Python 3.11
- change default execution role > choose an existing role, select the role you just created
Click on the create function (leave the advanced setting as default).
import os
import io
import boto3
import json
# grab environment variables
ENDPOINT_NAME = os.environ['ENDPOINT_NAME']
runtime= boto3.client('runtime.sagemaker')
def get_payload(query: str, prompt: str | None = None, max_new_tokens: int = 4000, top_p: int = 0.9, temperature: int = 0.01) -> dict:
if prompt:
inputs = [
{"role": "system", "content": prompt},
{"role": "user", "content": query}]
else:
inputs = [{"role": "user", "content": query}]
payload = {
"inputs": [inputs],
"parameters": {"max_new_tokens": max_new_tokens, "top_p": top_p, "temperature": temperature}
}
return payload
def lambda_handler(event, context):
query = event["query"]
if "prompt" in event:
prompt = event["prompt"]
payload = get_payload(query, prompt)
else:
payload = get_payload(query)
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='application/json',
Body=json.dumps(payload),
CustomAttributes="accept_eula=true")
result = json.loads(response['Body'].read().decode())[0]
output = result['generation']['content']
print(result)
return {
"statusCode": 200,
"body": output
}
Copy this code to your lambda function, and go to configurations
on general configuration, click edit and change timeout from 3 sec to 1 min 3 sec (the max is 15 mins, but we don’t need that much)
Edit environment variables and add your ENDPOINT_NAME (it was on the deployment page)
You can find it again on Sagemaker > inference > Endpoints (or from the studio deployment page, if you still have running)
After that, you can deploy your lambda and run a quick test,
If everything went well this should be your output response
Rest API with API Gateway
Go to API gateway, From APIs > Rest API > Build > New API > Create API
Go to Actions > Create Method > Post
Click Save
Finally go to Actions > API Actions > Deploy API
Save changes, scroll up to copy the invoke URL (you can find it on you lambda function from the triggers section), and there you have it.
import requests
def llama_chain(query):
api_url = 'https://n0f3c5se9l.execute-api.us-east-1.amazonaws.com/prod/' # Replace this with your apigw URL
prompt = "You are an expert mathematician given a user query do a step by step reasoning, and then generate an answer"
json = {"query": query, "prompt": prompt}
r = requests.post(api_url, json = json)
answer = r.json()["body"].strip()
return answer
llama_chain("what is 2 + 2")
You can run this function to call your API gateway (the prompt field is optional in this JSON). Delete the endpoint if you are no longer using it either from the Sagemaker studio deployment page or from Sagemaker > inference > endpoints/models/endpoint configuration
Comment out, if you face any issues. I plan to create an app on top of this API for RAG (chat with your data) using Langchain and pinecone/chroma.
Also I, along with my team have compiled some of the most advanced features in modern-day AI and RAG, to cater to niche domains and complex applications.
Need to further optimize your LLM app performance with the support of top LLM experts or start from scratch?
Contact me on LinkedIn
Or set up a one-to-one call