Deploy Llama2–7B on AWS (Follow Along)

7 min readAug 25, 2023

This blog follows the easiest flow to set and maintain any Llama2 model on the cloud, This one features the 7B one, but you can follow the same steps for 13B or 70B. It is divided into two sections

Section — 1: Deploy model on AWS Sagemaker

Section — 2: Run as an API in your application

Llama 2 is a collection of pre-trained and fine-tuned generative text models developed by Meta. These models range in scale from 7 billion to 70 billion parameters and are designed for various text-generation tasks. The models in the Llama 2 family, particularly the Llama-2-Chat variations, are optimized for dialogue use cases, outperforming open-source chat models in most benchmarks and being on par with some popular closed-source models like ChatGPT and PaLM in terms of helpfulness and safety.

Key Details:

Training Data: Pretraining data includes a mix of publicly available online data while fine-tuning data includes instruction datasets and new human-annotated examples.
Training Period: Trained between January 2023 and July 2023.
Data Freshness: Pretraining data is up to September 2022, and some fine-tuning data is more recent, up to July 2023.

Evaluation Results:

Llama 2 models show improved performance compared to Llama 1 models on various evaluation benchmarks, including commonsense reasoning, world knowledge, reading comprehension, math, and other linguistic tasks.

1. Deploying on AWS Sagemaker

You need to have an AWS Account with administrator privileges to be able to run and deploy the Llama-2–7B model, first login, and head to the Amazon Sagemaker console (Try to be on the us-east-1, N. Virginia region).

Request Quota:

The resources in Amazon Sagemaker are not always granted so one should make a quick check

Search for these service quotas in Sagemaker,

If the applied quota value is 0 for any of these services you need to request for quota increase, You can track the requests in the quota request history, it can take up to 2 days at times.

Create Domain:

The very first task is to create a domain if you don’t have any (you will not, if this is your first time at Sagemaker)

Select Quick Setup
Choose a domain name
You can keep the user profile name as default or change it if you want
You will need to create a role if you don’t have any.

choose “Any S3 bucket” and hit create.

This is how it should look, hit submit to create the domain.

If there was an error during the creation of your domain, it probably stems from issues with user permissions or VPC configuration.

Launch Studio and Deploy Model

After you successfully create your domain and user profile, launch sagemaker studio

The user profile should be the one you just created in your domain

Go to Jumpstart and search for Llama2–7b-chat

You can leave all configs to default, ml.g5.2xlarge is the least model required to run llama2–7b, it costs $1.515/hr, $36.36/day if you leave it running :-)

Click deploy to deploy the model as an endpoint, You will need to accept the license agreement, the deployment will take a few minutes.

meta’s EULA before using any llama2 models

At this point your model is deployed you can run inference (queries) with it, by opening the notebook from the llama-7b-chat model page, and test the model

2. Run as an API

Create IAM role for AWS Lambda

Go to IAM > Roles > create role

Select AWS Service and lambda service and click Next.

Search for these two policies, and click Next

CloudWatchFullAccess
AmazonSageMakerFullAccess

These are probably overkill for the task at hand but take away the complexity.

Add your role name and description (optional), and verify the policies you selected are added as permissions to the role.

Click Create role to create.

Create Lambda function

Go to Lambda > Create function

Author from scratch
Give it a name
Select runtime as Python 3.11
change default execution role > choose an existing role, select the role you just created

Click on the create function (leave the advanced setting as default).

import os
import io
import boto3
import json

# grab environment variables
ENDPOINT_NAME = os.environ['ENDPOINT_NAME']
runtime= boto3.client('runtime.sagemaker')

def get_payload(query: str, prompt: str | None = None, max_new_tokens: int = 4000, top_p: int = 0.9, temperature: int = 0.01) -> dict:
    if prompt:
        inputs = [
            {"role": "system", "content": prompt},
            {"role": "user", "content": query}]
    else:
        inputs = [{"role": "user", "content": query}]
    payload = {
        "inputs": [inputs],
        "parameters": {"max_new_tokens": max_new_tokens, "top_p": top_p, "temperature": temperature}
    }
    return payload

def lambda_handler(event, context):
    query = event["query"]
    if "prompt" in event:
        prompt = event["prompt"]
        payload = get_payload(query, prompt)
    else:
        payload = get_payload(query)
    
    response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
                                       ContentType='application/json',
                                       Body=json.dumps(payload),
                                       CustomAttributes="accept_eula=true")
    
    result = json.loads(response['Body'].read().decode())[0]
    output = result['generation']['content']
    
    print(result)
    
    return {
        "statusCode": 200,
        "body": output
    }

Copy this code to your lambda function, and go to configurations

on general configuration, click edit and change timeout from 3 sec to 1 min 3 sec (the max is 15 mins, but we don’t need that much)

Edit environment variables and add your ENDPOINT_NAME (it was on the deployment page)

You can find it again on Sagemaker > inference > Endpoints (or from the studio deployment page, if you still have running)

After that, you can deploy your lambda and run a quick test,

If everything went well this should be your output response

Rest API with API Gateway

Go to API gateway, From APIs > Rest API > Build > New API > Create API

Go to Actions > Create Method > Post

Click Save

Finally go to Actions > API Actions > Deploy API

Save changes, scroll up to copy the invoke URL (you can find it on you lambda function from the triggers section), and there you have it.

import requests

def llama_chain(query):
  
  api_url = 'https://n0f3c5se9l.execute-api.us-east-1.amazonaws.com/prod/' # Replace this with your apigw URL
  
  prompt = "You are an expert mathematician given a user query do a step by step reasoning, and then generate an answer"
  json = {"query": query, "prompt": prompt}

  r = requests.post(api_url, json = json)

  answer = r.json()["body"].strip()

  return answer

llama_chain("what is 2 + 2")

You can run this function to call your API gateway (the prompt field is optional in this JSON). Delete the endpoint if you are no longer using it either from the Sagemaker studio deployment page or from Sagemaker > inference > endpoints/models/endpoint configuration

Comment out, if you face any issues. I plan to create an app on top of this API for RAG (chat with your data) using Langchain and pinecone/chroma.

Also I, along with my team have compiled some of the most advanced features in modern-day AI and RAG, to cater to niche domains and complex applications.

Need to further optimize your LLM app performance with the support of top LLM experts or start from scratch?

Contact me on LinkedIn

Or set up a one-to-one call