How to run Llama Model 🦙 with Chat-UI 💬 on Amazon EC2
In this article, we’ll explore how to deploy a Chat-UI and Llama model on Amazon EC2 for your own customized HuggingChat experience using open source tools. We’ll cover the steps to set up the environment, install the necessary software, and run local inference with your choice of Llama model. By the end of this guide, you’ll have a fully functional HuggingChat with options to customize according to your preference.
- Special credits and thanks to Trelis Research on publishing excellent contents — “Deploy Llama 2 for your Entire Organisation” here 👈
Overview
- Launch EC2 instance using AWS CloudFormation template
- Run Lllama model in TGI container using Docker and Quantization
- Install and run Chat-UI
Step 1 — Create an Amazon EC2 Instance
1-A. Create Amazon EC2 g4dn.2xlarge instance using AWS CloudFormation
- Region: us-east-1
- AMI: “ami-0649417d1ede3c91a”
- Instance: g4dn.2xlarge
- EBS volume: 512 GB
👉 g4dn.2xlarge (16G GPU): $0.752 On-Demand Price/hr
AWS CloudFormation Template — chat-ui.yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: EC2 Instance
Parameters:
KeyName:
Description: Name of an existing EC2 KeyPair to enable SSH access to the instance
Type: AWS::EC2::KeyPair::KeyName
ConstraintDescription: must be the name of an existing EC2 KeyPair.
Mappings:
RegionToAmiId:
us-east-1:
AMI: ami-0649417d1ede3c91a
Resources:
SecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupName: !Sub ${AWS::StackName}-sg
GroupDescription: Security group for EC2 instance
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 22
ToPort: 22
CidrIp: 0.0.0.0/0
EC2Instance:
Type: AWS::EC2::Instance
Properties:
InstanceType: g4dn.2xlarge
ImageId: !FindInMap [RegionToAmiId, !Ref AWS::Region, AMI]
KeyName: !Ref KeyName
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 512
VolumeType: gp2
"Tags" : [
{"Key" : "Name", "Value" : "chatui-instance"},
]
SecurityGroups:
- Ref: SecurityGroup
Outputs:
PublicDNS:
Description: Public DNSName of the newly created EC2 instance
Value: !GetAtt [EC2Instance, PublicDnsName]
PublicIP:
Description: Public IP address of the newly created EC2 instance
Value: !GetAtt [EC2Instance, PublicIp]
AWS CloudFormation > Create stack
AWS CloudFormation — Step 1 Create stack
Upload your template file and Next.
AWS CloudFormation — Step 2 Specify stack details
Specify Stack name and KeyName and Next.
AWS CloudFormation — Step 3 Configure stack options
Use default settings and Next.
AWS CloudFormation — Step 4 Review and Submit.
1-B. SSH to Amazon EC2 instance
# Terminal 1: SSH to Amazon EC2 instance
ssh -i "us-east-1-key.pem" ubuntu@ec2-###-##-##-###.compute-1.amazonaws.com
1-C. SSH port forwarding to access Chat-UI
# Terminal 2: SSH local port forwarding to Chat-UI
ssh -i "us-east-1-key.pem" -N -L 5173:localhost:5173 ubuntu@ec2-###-##-##-###.compute-1.amazonaws.com
Step 2 — Run Lllama model in TGI container using Docker and Quantization
Text Generation Inference (TGI) — The easiest way of getting started is using the official Docker container.
Source: https://huggingface.co/docs/text-generation-inference/quicktour
We will use Docker to run TGI container with Bits and Bytes quantization.
Option 1 — Running Llama 2 7B/13B Model (Gated Model)
# model=meta-llama/Llama-2-7b-chat-hf
model=meta-llama/Llama-2-13b-chat-hf
token="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE"
volume=$PWD/data
docker run --gpus all \
--shm-size 1g \
-e HUGGING_FACE_HUB_TOKEN=$token \
-p 8080:80 \
-v $volume:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id $model \
--quantize bitsandbytes-nf4 \
--max-input-length 2048 \
--max-total-tokens 4096
👉 Hugging Face User Access Token: https://huggingface.co/docs/hub/en/security-tokens
Option 2 — Running Code Llama 7B/13B Model
model=codellama/CodeLlama-7b-Instruct-hf
# model=codellama/CodeLlama-13b-Instruct-hf
volume=$PWD/data
docker run --gpus all \
--shm-size 1g \
-p 8080:80 \
-v $volume:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id $model \
--quantize bitsandbytes-nf4 \
--max-input-length 2048 \
--max-total-tokens 4096
👉 Quantization with bitsandbytes
bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. Unlike GPTQ quantization, bitsandbytes doesn’t require a calibration dataset or any post-processing — weights are automatically quantized on load.
# 8 bit quantization
--quantize bitsandbytes
# 4 bit quantization
--quantize bitsandbytes-nf4
--quantize bitsandbytes-fp4
Step 3 — Install and run Chat-UI
Chat UI — Open source codebase powering the HuggingChat app
Github repo: https://github.com/huggingface/chat-ui
3-A. Chat-UI Installation
# Clone the repo
git clone https://github.com/huggingface/chat-ui
# Start a Mongo Database
docker run -d -p 27017:27017 --name mongo-chatui mongo:latest
# install nvm & npm
wget https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh
bash install.sh
# Close and reopen your terminal to start using nvm or run the following to use it now:
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" # This loads nvm
[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion" # This loads nvm bash_completion
# nvm install node
nvm install node
npm --version
10.1.0
# npm install
cd chat-ui
npm install
👉 You can use MongoDB Atlas free version as well.
3-B. Customize Chat-UI setting
Create .env.local file with the the example setting here. 👇
# .env.local: custom setting
MONGODB_URL=mongodb://localhost:27017/
PUBLIC_APP_NAME="🦙 Llama Chat UI 💬"
PUBLIC_APP_ASSETS=chatui
PUBLIC_APP_COLOR=green
MODELS=`[
{
"name": "meta-llama/Llama-2-7b-chat-hf",
"datasetName": "meta-llama/Llama-2-7b-chat-hf",
"description": "Llama 7B Chat",
"websiteUrl": "https://ai.meta.com/llama/",
"userMessageToken": "[INST]",
"assistantMessageToken": "[/INST]",
"messageEndToken": "</s>",
"preprompt": "[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.<</SYS>>\n\n[/INST]",
"parameters": {
"temperature": 0.6,
"top_p": 0.95,
"repetition_penalty": 1.2,
"top_k": 50,
"truncate": 1000,
"max_new_tokens": 1024
},
"preprompt": "Below are a series of dialogues between various people and an AI assistant. The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable. The assistant is happy to help with almost anything, and will do its best to understand exactly what is needed. It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer. That said, the assistant is practical and really does its best, and doesn't let caution get too much in the way of being useful.\n-----\n",
"promptExamples": [
{
"title": "Write an email from bullet list",
"prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
}, {
"title": "Code a snake game",
"prompt": "Code a basic snake game in python, give explanations for each step."
}, {
"title": "Assist in a task",
"prompt": "How do I make a delicious lemon cheesecake?"
}
],
"endpoints": [{
"url": "http://127.0.0.1:8080"
}]
}
]`
3-C. Run Chat-UI
# run Chat-UI
npm run dev
# Open local URL in your web browser
http://localhost:5173/
Chat-UI WebSearch 2.0
🆕 Websearch 2.0, now with RAG & sources 👍
You can enable the web search by adding either
SERPER_API_KEY
(serper.dev) orSERPAPI_KEY
(serpapi.com) to your.env.local
.
SERPER_API_KEY="YOUR_API_KEY_HERE"
or
SERPAPI_KEY="YOUR_API_KEY_HERE"