How to run Llama Model 🦙 with Chat-UI 💬 on Amazon EC2

Build your own HuggingChat experience using Open Source 🦙 💬

6 min readSep 20, 2023

Stable Diffusion AI Art (Stable Diffusion XL)

In this article, we’ll explore how to deploy a Chat-UI and Llama model on Amazon EC2 for your own customized HuggingChat experience using open source tools. We’ll cover the steps to set up the environment, install the necessary software, and run local inference with your choice of Llama model. By the end of this guide, you’ll have a fully functional HuggingChat with options to customize according to your preference.

Special credits and thanks to Trelis Research on publishing excellent contents — “Deploy Llama 2 for your Entire Organisation” here 👈

Overview

Launch EC2 instance using AWS CloudFormation template
Run Lllama model in TGI container using Docker and Quantization
Install and run Chat-UI

Step 1 — Create an Amazon EC2 Instance

1-A. Create Amazon EC2 g4dn.2xlarge instance using AWS CloudFormation

Region: us-east-1
AMI: “ami-0649417d1ede3c91a”
Instance: g4dn.2xlarge
EBS volume: 512 GB

👉 g4dn.2xlarge (16G GPU): $0.752 On-Demand Price/hr

AWS CloudFormation Template — chat-ui.yaml

AWSTemplateFormatVersion: '2010-09-09'
Description: EC2 Instance
Parameters:
 KeyName:
   Description: Name of an existing EC2 KeyPair to enable SSH access to the instance
   Type: AWS::EC2::KeyPair::KeyName
   ConstraintDescription: must be the name of an existing EC2 KeyPair.

Mappings:
  RegionToAmiId:
    us-east-1:
      AMI: ami-0649417d1ede3c91a

Resources:
  SecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupName: !Sub ${AWS::StackName}-sg
      GroupDescription: Security group for EC2 instance
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 22
          ToPort: 22
          CidrIp: 0.0.0.0/0
  EC2Instance:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: g4dn.2xlarge
      ImageId: !FindInMap [RegionToAmiId, !Ref AWS::Region, AMI]
      KeyName: !Ref KeyName
      BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
            VolumeSize: 512
            VolumeType: gp2
      "Tags" : [
        {"Key" : "Name", "Value" : "chatui-instance"},
      ]
      SecurityGroups:
        - Ref: SecurityGroup

Outputs:
  PublicDNS:
    Description: Public DNSName of the newly created EC2 instance
    Value: !GetAtt [EC2Instance, PublicDnsName]
  PublicIP:
    Description: Public IP address of the newly created EC2 instance
    Value: !GetAtt [EC2Instance, PublicIp]

AWS CloudFormation > Create stack

AWS CloudFormation — Step 1 Create stack

Upload your template file and Next.

AWS CloudFormation — Step 2 Specify stack details

Specify Stack name and KeyName and Next.

AWS CloudFormation — Step 3 Configure stack options

Use default settings and Next.

AWS CloudFormation — Step 4 Review and Submit.

1-B. SSH to Amazon EC2 instance

# Terminal 1: SSH to Amazon EC2 instance
ssh -i "us-east-1-key.pem" ubuntu@ec2-###-##-##-###.compute-1.amazonaws.com

1-C. SSH port forwarding to access Chat-UI

# Terminal 2: SSH local port forwarding to Chat-UI
ssh -i "us-east-1-key.pem" -N -L 5173:localhost:5173 ubuntu@ec2-###-##-##-###.compute-1.amazonaws.com

How to connect Amazon EC2 using SSH Local Port Forwarding

How to connect Amazon EC2 using SSH Local Port Forwardingmedium.com

Step 2 — Run Lllama model in TGI container using Docker and Quantization

Text Generation Inference (TGI) — The easiest way of getting started is using the official Docker container.
Source: https://huggingface.co/docs/text-generation-inference/quicktour

We will use Docker to run TGI container with Bits and Bytes quantization.

Option 1 — Running Llama 2 7B/13B Model (Gated Model)

# model=meta-llama/Llama-2-7b-chat-hf
model=meta-llama/Llama-2-13b-chat-hf
token="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE"
volume=$PWD/data

docker run --gpus all \
--shm-size 1g \
-e HUGGING_FACE_HUB_TOKEN=$token \
-p 8080:80 \
-v $volume:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id $model \
--quantize bitsandbytes-nf4 \
--max-input-length 2048 \
--max-total-tokens 4096

👉 Hugging Face User Access Token: https://huggingface.co/docs/hub/en/security-tokens

TGI Serving Private & Gated Models

Option 2 — Running Code Llama 7B/13B Model

model=codellama/CodeLlama-7b-Instruct-hf
# model=codellama/CodeLlama-13b-Instruct-hf
volume=$PWD/data

docker run --gpus all \
--shm-size 1g \
-p 8080:80 \
-v $volume:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id $model \
--quantize bitsandbytes-nf4 \
--max-input-length 2048 \
--max-total-tokens 4096

👉 Quantization with bitsandbytes

bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. Unlike GPTQ quantization, bitsandbytes doesn’t require a calibration dataset or any post-processing — weights are automatically quantized on load.
Source: TGI Quantization with bitsandbytes

# 8 bit quantization
--quantize bitsandbytes

# 4 bit quantization
--quantize bitsandbytes-nf4
--quantize bitsandbytes-fp4

Step 3 — Install and run Chat-UI

Chat UI — Open source codebase powering the HuggingChat app
Github repo: https://github.com/huggingface/chat-ui

HuggingChat

The first open source alternative to ChatGPT. 💪

huggingface.co

3-A. Chat-UI Installation

# Clone the repo
git clone https://github.com/huggingface/chat-ui

# Start a Mongo Database
docker run -d -p 27017:27017 --name mongo-chatui mongo:latest

# install nvm & npm

wget https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh
bash install.sh

# Close and reopen your terminal to start using nvm or run the following to use it now:
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"  # This loads nvm
[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion"  # This loads nvm bash_completion

# nvm install node
nvm install node
npm --version
10.1.0

# npm install
cd chat-ui
npm install

👉 You can use MongoDB Atlas free version as well.

3-B. Customize Chat-UI setting

Create .env.local file with the the example setting here. 👇

# .env.local: custom setting
MONGODB_URL=mongodb://localhost:27017/
PUBLIC_APP_NAME="🦙 Llama Chat UI 💬"
PUBLIC_APP_ASSETS=chatui
PUBLIC_APP_COLOR=green

MODELS=`[
  {
    "name": "meta-llama/Llama-2-7b-chat-hf",
    "datasetName": "meta-llama/Llama-2-7b-chat-hf",
    "description": "Llama 7B Chat",
    "websiteUrl": "https://ai.meta.com/llama/",
    "userMessageToken": "[INST]",
    "assistantMessageToken": "[/INST]",
    "messageEndToken": "</s>",
    "preprompt": "[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.<</SYS>>\n\n[/INST]",
    "parameters": {
      "temperature": 0.6,
      "top_p": 0.95,
      "repetition_penalty": 1.2,
      "top_k": 50,
      "truncate": 1000,
      "max_new_tokens": 1024
    },
    "preprompt": "Below are a series of dialogues between various people and an AI assistant. The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable. The assistant is happy to help with almost anything, and will do its best to understand exactly what is needed. It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer. That said, the assistant is practical and really does its best, and doesn't let caution get too much in the way of being useful.\n-----\n",
    "promptExamples": [
      {
        "title": "Write an email from bullet list",
        "prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
      }, {
        "title": "Code a snake game",
        "prompt": "Code a basic snake game in python, give explanations for each step."
      }, {
        "title": "Assist in a task",
        "prompt": "How do I make a delicious lemon cheesecake?"
      }
    ],
    "endpoints": [{
      "url": "http://127.0.0.1:8080"
    }]
  }
]`

3-C. Run Chat-UI

# run Chat-UI
npm run dev

# Open local URL in your web browser
http://localhost:5173/

Chat-UI WebSearch 2.0

🆕 Websearch 2.0, now with RAG & sources 👍

You can enable the web search by adding either SERPER_API_KEY (serper.dev) or SERPAPI_KEY (serpapi.com) to your .env.local.
Source: https://github.com/huggingface/chat-ui

SERPER_API_KEY="YOUR_API_KEY_HERE"

or 

SERPAPI_KEY="YOUR_API_KEY_HERE"

How to run Mistral 7B Model with Chat-UI💬 on Amazon EC2

⭐️[Bonus] Run Zephyr-7B-alpha Model with Chat-UI

medium.com

How to run Llama Model 🦙 with Chat-UI 💬 on Amazon EC2

Build your own HuggingChat experience using Open Source 🦙 💬

Overview

Step 1 — Create an Amazon EC2 Instance

1-A. Create Amazon EC2 g4dn.2xlarge instance using AWS CloudFormation

AWS CloudFormation Template — chat-ui.yaml

1-B. SSH to Amazon EC2 instance

1-C. SSH port forwarding to access Chat-UI

How to connect Amazon EC2 using SSH Local Port Forwarding

How to connect Amazon EC2 using SSH Local Port Forwarding

Step 2 — Run Lllama model in TGI container using Docker and Quantization

Option 1 — Running Llama 2 7B/13B Model (Gated Model)

Option 2 — Running Code Llama 7B/13B Model

👉 Quantization with bitsandbytes

Step 3 — Install and run Chat-UI

HuggingChat

The first open source alternative to ChatGPT. 💪

3-A. Chat-UI Installation

3-B. Customize Chat-UI setting

3-C. Run Chat-UI

Chat-UI WebSearch 2.0

Related Article

How to run Mistral 7B Model with Chat-UI💬 on Amazon EC2

⭐️[Bonus] Run Zephyr-7B-alpha Model with Chat-UI

Useful Links

Written by David Min

Responses (3)