Self host LLM with EC2, vLLM, Langchain, FastAPI, LLM cache and huggingFace model

11 min readNov 22, 2023

Are you excited about recent advancement in AI, Large Language Model and related technologies and wants to get hands on experience in building LLM APIs, playing with LLM, getting LLM to production, serving real world traffic and also do it in most efficient way? Then you are at the right place!

This tutorial will walk you through steps on how to host LLM model using AWS EC2 instance, vLLM, Langchain, serve LLM inference using FastAPI, use LLM caching mechanism to cache LLM requests for faster serving, and use base model or fine tune model from hugging face. Lets get started…

There are many 3rd party providers which provides LLM and AI functionalities through their API and serves LLM responses back to you. The models such as OpenAI’s GPT4, Anthropic’s claude2.1, AWS bedrock’s Titan are all amazing models and provides excellent results for text generation but 3rd party providers lack of flexibility in accuracy and latency improvements and optimizations and cost of calling API each time. Self hosting your own LLM model and serving API for LLM inference comes with its own set of challenges but it provides advantages:

LLM hosting is cost effective than calling provider API
Apply latency optimizations by using latency open source frameworks and tools. Provider APIs do not provide any way to apply optimizations
Improve model performance and accuracy by fine-tuning the model for specific use case and repeat the process until you have LLM model as per your needs

Here is quick look at how pricing looks for 3rd party LLM providers:

It works based on number of tokens and based on how many requests per seconds your customers make, it can easily go up to millions of dollars for LLM responses.

Lets start the technical discussion on how to host LLM on you own :)

EC2

First thing, you need EC2 instance, a hardware to host LLM model and since LLMs need GPUs to run efficiently(though with some tricks, they can run on CPUs as well) and powerful hardware to serve LLM faster and efficiently to customers. I would recommend go through AWS g5 instances and P4 instances which provides good performance for ML and LLM specifically.

We will use g5 instance for our tutorial since they are readily available. Here are your options for g5 instances and the price/hr:

I would recommend to go with multi GPU instance, they are proven to give better peformance than single GPU instances. Lets select g5.12xlarge instance

Go to your AWS account and select region based on your needs (us-east-1, us-east-2 etc.) from dropdown and go to Launch instance

For application and OS images, you can go with Amazon Linux or Ubuntu images since they come with some of the pre build softwares and are command friendly.

For Amazon Machine Image , select either pytorch or tensor image which has these ML frameworks already installed. These images has CUDA, python, pytorch, tensorflow frameworks installed which are required for LLM hosting and inference.

I am going to select Deep Learning AMI GPU TensorFlow 2.13from the dropdown. Once you select, you can see it can only run on specific instances

Description
Supported EC2 instances: G3, P3, P3dn, P4d, G5, G4dn.

Select instance type as g5.12xlarge

Create new key value pair and store new key .pem safely to ssh into your new instance.

For network settings, check `Allow HTTP traffic from the internet` to expose LLM API using IP address which can be called from any client.

For storage, select atleast 100Gb-200Gb for the root volume. LLM models are big in sizes and require huge space to host the model on the instance.

No need to change anything in advance settings. Now click onLaunch instance and your instance should be ready in few minutes.

SSH into EC2 instance

Once EC2 instance is up and running and also completes health check, SSH into EC2 instance.

From your local terminal, execute below commands

Locate your private key file. The key used to launch this instance is XYZ.pem(example)

Run this command, if necessary, to ensure your key is not publicly viewable.
 chmod 400 XYZ.pem

ssh -i "XYZ.pem" ubuntu@ec2-107-23-122-111.compute-1.amazonaws.com

It should look like this:

Install dependencies on EC2

Make sure CUDA version is 11.8 as shown in above image, some of the dependencies which we are going to install only works with 11.8 version fo CUDA for now(11/22/2023)

Currently vLLM framework provides excellent latency and throughput performance for LLM with paged attention: https://blog.vllm.ai/2023/06/20/vllm.html and in my evaluation, this is best LLM serving framework as of now which provides excellent latency performance.

vLLM can be installed directly through git codebase or through Langchain. I am suggesting to install vLLM through Langchain and use Langchain as main orchestrator for LLM hosting, context aware applications and caching since event caching libraries are provided through Langchain. (vLLM only works with NVIDIA hardware)

Run below commands for installing pip libraries:

pip install openai==v0.28.1
pip install langchain
pip install vllm
pip install gptcache

openai has backward compatibility bug with their latest release and thus installing specific version of openai library to work with Langchain.

It should look something like this after successful installation of vLLM:

Collecting mpmath>=0.19 (from sympy->torch>=2.1.0->vllm)
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 4.2 MB/s eta 0:00:00
Requirement already satisfied: sniffio>=1.1 in /opt/tensorflow/lib/python3.10/site-packages (from anyio<5,>=3.4.0->starlette<0.28.0,>=0.27.0->fastapi->vllm) (1.3.0)
Requirement already satisfied: exceptiongroup in /opt/tensorflow/lib/python3.10/site-packages (from anyio<5,>=3.4.0->starlette<0.28.0,>=0.27.0->fastapi->vllm) (1.1.3)
Downloading vllm-0.2.2-cp310-cp310-manylinux1_x86_64.whl (29.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 29.0/29.0 MB 13.6 MB/s eta 0:00:00
Downloading ray-2.8.0-cp310-cp310-manylinux2014_x86_64.whl (62.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.5/62.5 MB 8.9 MB/s eta 0:00:00
Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 731.7/731.7 MB 619.8 kB/s eta 0:00:00
Downloading triton-2.1.0-0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89.2/89.2 MB 3.6 MB/s eta 0:00:00
Downloading transformers-4.35.2-py3-none-any.whl (7.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.9/7.9 MB 9.3 MB/s eta 0:00:00
Downloading xformers-0.0.22.post7-cp310-cp310-manylinux2014_x86_64.whl (211.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.8/211.8 MB 1.7 MB/s eta 0:00:00
Downloading torch-2.1.0-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 670.2/670.2 MB 924.3 kB/s eta 0:00:00
Downloading einops-0.7.0-py3-none-any.whl (44 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 kB 384.9 kB/s eta 0:00:00
Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.2/307.2 kB 1.6 MB/s eta 0:00:00
Downloading pyarrow-14.0.1-cp310-cp310-manylinux_2_28_x86_64.whl (38.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.0/38.0 MB 13.3 MB/s eta 0:00:00
Downloading httptools-0.6.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (341 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 341.4/341.4 kB 1.2 MB/s eta 0:00:00
Downloading huggingface_hub-0.19.4-py3-none-any.whl (311 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 311.7/311.7 kB 675.3 kB/s eta 0:00:00
Downloading msgpack-1.0.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (530 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 530.8/530.8 kB 2.3 MB/s eta 0:00:00
Downloading regex-2023.10.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (773 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 773.9/773.9 kB 1.2 MB/s eta 0:00:00
Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 2.7 MB/s eta 0:00:00
Downloading tokenizers-0.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.8/3.8 MB 6.0 MB/s eta 0:00:00
Downloading uvloop-0.19.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 5.0 MB/s eta 0:00:00
Downloading watchfiles-0.21.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 3.3 MB/s eta 0:00:00
Downloading websockets-12.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (130 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 130.2/130.2 kB 410.7 kB/s eta 0:00:00
Downloading networkx-3.2.1-py3-none-any.whl (1.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 2.9 MB/s eta 0:00:00
Downloading uvicorn-0.24.0.post1-py3-none-any.whl (59 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.7/59.7 kB 929.9 kB/s eta 0:00:00
Downloading nvidia_nvjitlink_cu12-12.3.101-py3-none-manylinux1_x86_64.whl (20.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.5/20.5 MB 43.3 MB/s eta 0:00:00
Installing collected packages: sentencepiece, ninja, mpmath, websockets, uvloop, triton, sympy, safetensors, regex, python-dotenv, pyarrow, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, msgpack, httptools, h11, einops, watchfiles, uvicorn, nvidia-cusparse-cu12, nvidia-cudnn-cu12, huggingface-hub, tokenizers, nvidia-cusolver-cu12, transformers, torch, ray, xformers, vllm
Successfully installed einops-0.7.0 h11-0.14.0 httptools-0.6.1 huggingface-hub-0.19.4 mpmath-1.3.0 msgpack-1.0.7 networkx-3.2.1 ninja-1.11.1.1 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.18.1 nvidia-nvjitlink-cu12-12.3.101 nvidia-nvtx-cu12-12.1.105 pyarrow-14.0.1 python-dotenv-1.0.0 ray-2.8.0 regex-2023.10.3 safetensors-0.4.0 sentencepiece-0.1.99 sympy-1.12 tokenizers-0.15.0 torch-2.1.0 transformers-4.35.2 triton-2.1.0 uvicorn-0.24.0.post1 uvloop-0.19.0 vllm-0.2.2 watchfiles-0.21.0 websockets-12.0 xformers-0.0.22.post7

now all dependencies have been installed, lets download the model.

Offline LLM serving

HuggingFace is leading platform to host all open sourced models and provides APIs to deploy LLM models, serve models, inference APIs, integration with AWS, etc.

For this tutorial, I am going to show how to deploy Falcon 7b instruct model to EC2 : https://huggingface.co/tiiuae/falcon-7b-instruct

This is one of the Instruct model from Falcon which is good at following instructions through prompt.

Lets start with doing offline LLM serving first and deploy model. Deployment of LLM model from HuggingFace only happen once and any subsequence requests will use the same model for responding.

from langchain.llms import VLLM
import time
import csv


llm = VLLM(model="tiiuae/falcon-7b-instruct",
           trust_remote_code=True,  # mandatory for hf models
           max_new_tokens=50,
           temperature=0.6
)


start_time = time.time()
output = llm("Who is president of US?")
end_time = time.time()
latency = end_time - start_time
print(f"Latency: {latency} seconds")
print("Generated text:", output)

inside VLLM function, you can play around with LLM parameters like max_new_tokens, temperature, tensor_parallel_degree. I will not go into details of each parameters, but you can find information online regarding these parameters.

Above code, will download the falcon 7b instruct model on your EC2 instance and will respond to question Who is president of US? This might take 10–15 min to download LLM model weights.

It should look like this:

INFO 11-22 23:06:06 llm_engine.py:207] # GPU blocks: 44055, # CPU blocks: 32768
Processed prompts: 100%|██████████████████████████████████████████| 1/1 [00:00<00:00,  1.65it/s]
Latency: 0.610914945602417 seconds
Generated text:
As of 2021, the current president of the United States is Joe Biden.

As you can see, we got LLM response as As of 2021, the current president of United States is Joe Biden. and latency on g5.12xlarge instance is 0.61 seconds for this question, which is not bad at all! Wohoo, we got our 1st LLM response, great work if you have reached this stage!

LLM inference using FastAPI

Now that we have deployed the model, tried offline serving, lets start fastAPI powered API which will serve requests and respond with LLM response using deployed model.

Below code will start FastAPI python application and host LLM model on `/v1/generateText` API and it starts the API on port 5001. Here it will not re download the LLM model if you have already done in previous step during offline serving. vLLM from langchain internally uses fastAPI, openAI to make request — response style same as openai requests.

from typing import Union
from fastapi import BackgroundTasks, FastAPI, Request
from fastapi.responses import JSONResponse, Response, StreamingResponse
from fastapi import FastAPI
import langchain
from langchain.llms import VLLM
import time
import uvicorn

app = FastAPI()

llm = VLLM(model="tiiuae/falcon-7b-instruct",
           trust_remote_code=True,  # mandatory for hf models
           max_new_tokens=50,
           temperature=0.6
)

@app.get("/")
def read_root():
    return {"Hello": "World"}


@app.post("/v1/generateText")
async def generateText(request: Request) -> Response:
    request_dict = await request.json()
    prompt = request_dict.pop("prompt")
    print(prompt)
    output = llm(prompt)
    print("Generated text:", output)
    ret = {"text": output}
    return JSONResponse(ret)

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=5001)

Run above python application and it should start LLM server like this:

ubuntu@ip-172-31-31-137:~$ python fastAPIWithoutCache.py

WARNING: You are currently loading Falcon using legacy code contained in the model repository. Falcon has now been fully ported into the Hugging Face transformers library. For the most up-to-date and high-performance version of the Falcon model code, please update to the latest version of transformers and then load the model without the trust_remote_code=True argument.

WARNING 11-22 23:08:53 config.py:433] The model's config.json does not contain any of the following keys to determine the original maximum length of the model: ['max_position_embeddings', 'n_positions', 'max_seq_len', 'seq_length', 'max_sequence_length', 'max_seq_length', 'seq_len']. Assuming the model's maximum length is 2048.
INFO 11-22 23:08:53 llm_engine.py:72] Initializing an LLM engine with config: model='tiiuae/falcon-7b-instruct', tokenizer='tiiuae/falcon-7b-instruct', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 11-22 23:09:05 llm_engine.py:207] # GPU blocks: 44055, # CPU blocks: 32768
INFO:     Started server process [26925]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:5001 (Press CTRL+C to quit)

and if you want to send request to this API hosted on 5001 port, you can try sending it from same EC2 instance using below code:

import requests
import json
import time

# Define the API endpoint
url = "http://localhost:5001/v1/generateText"

headers = {"Content-Type": "application/json"}
data = {"prompt": "Who is president of US?"}

start_time = time.time()
# Make the POST request
response = requests.post(url, headers=headers, data=json.dumps(data))
end_time = time.time()
latency = end_time - start_time
print(f"Latency: {latency} seconds")

print("LLM response: " + response.text)

it should see response for above request like this:

ubuntu@ip-172-31-31-137:~$ python callAPI.py
Latency: 0.6146674156188965 seconds
LLM response: {"text":"\nAs of 2021, the current president of the United States is Joe Biden."}

Using LLM caching

Measuring latency for same input request for the second time using text only caching or semantic caching. This is server side LLM caching, so the latency would consists of only network API call latency. There are below LLM caching integrations which are available with Langchain and LLM:

In Memory cache
SQLite cache
Redis cache
GPT cache

Redis and GPT cache provides two options:

Exact text based cache
Semantic similarity cache(Interesting to evaluate)

All cache integrations provide similar performance benefits for LLM inference API, reducing latency to 0.1–0.2 second. LLM cache reduces tp50 latency of LLM API significantly. For this tutorial, I am going to use gptcache.

import langchain
from fastapi import BackgroundTasks, FastAPI, Request
from fastapi.responses import JSONResponse, Response, StreamingResponse
from fastapi import FastAPI
from langchain.llms import VLLM
import time
from gptcache import Cache
from gptcache.manager.factory import manager_factory
from gptcache.processor.pre import get_prompt
from langchain.cache import GPTCache
import hashlib
import hashlib
import uvicorn

app = FastAPI()


def get_hashed_name(name):
    return hashlib.sha256(name.encode()).hexdigest()


def init_gptcache(cache_obj: Cache, llm: str):
    hashed_llm = get_hashed_name(llm)
    cache_obj.init(
        pre_embedding_func=get_prompt,
        data_manager=manager_factory(manager="map", data_dir=f"map_cache_{hashed_llm}"),
    )


langchain.llm_cache = GPTCache(init_gptcache)
llm = VLLM(model="tiiuae/falcon-7b-instruct",
           trust_remote_code=True,  # mandatory for hf models
           max_new_tokens=50,
           temperature=0.6
)

@app.get("/")
def read_root():
    return {"Hello": "World"}


@app.post("/v1/generateText")
async def generateText(request: Request) -> Response:
    request_dict = await request.json()
    prompt = request_dict.pop("prompt")
    print(prompt)
    output = llm(prompt)
    print("Generated text:", output)
    ret = {"text": output}
    return JSONResponse(ret)

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=5001)

now lets try the same request when caching is enabled for LLM. Start the LLM API using above code with gptcache and then send the request to this API.

ubuntu@ip-172-31-31-137:~$ python callAPI.py
Latency: 0.0044176578521728516 seconds
LLM response: {"text":"\nAs of 2021, the current president of the United States is Joe Biden."}
ubuntu@ip-172-31-31-137:~$ python callAPI.py
Latency: 0.0030639171600341797 seconds
LLM response: {"text":"\nAs of 2021, the current president of the United States is Joe Biden."}

Huge huge improvement in latency since its using caching now. Same requests are now being served in 0.003–0.004 seconds reducing tp50 latency by big margin for the LLM API.

If you want to expose this API to outside world, you can use reverse proxy mechanism to forward traffic from port 80 to 5001 port which is running LLM inference and get LLM response in less than 0.8–1 second.

Happy coding and AI !