Notes about running a chat completion API endpoint with TensorRT-LLM and Meta-Llama-3–8B-Instruct

Published in

CodeGPT

12 min readApr 26, 2024

This article covers the essential steps required to set up and run a chat completion API endpoint using TensorRT-LLM, optimized for NVIDIA GPUs. More than just a guide, these notes document my own journey trying to get this toolbox up and running, including the snags and solutions I encountered along the way.

Disclaimer: Don’t follow this article blindly without reading it first, as it also showcases some of my initial missteps in making it work!
(Each time you see the 🚨 emoji means that you have to put attention because an problem appears)

Computer Info

O.S: Windows 11 with WSL2 kernel version 5.15.133.1-microsoft-standard-WSL2 (Ubuntu 22.04.4 LTS)
RAM: 128 GB DDR5 @3600 MHz (98.2 GB shared with WSL2)
Processor: AMD Ryzen 7 7800X3D @4200 MHz, 8 cores / 16 threads
GPU: NVIDIA GeForce RTX 3090, 24 GB VRAM

Objectives

Run TensorRT-LLM
Deploy model meta-llama/Meta-Llama-3-8B-Instruct
Have an OpenAI-compatible API server

Important conclusions

It is better to compile and serve using the provided Docker containers.
The only version that worked of tensorrt_llm along with its backend was v0.8.0.
When serving the model, ensure you are using the correct chat template.

Step 1: Install Docker

The installation is straightforward; just follow Docker installation guide. Note that for WSL you really need Docker for Windows. You can verify that Docker is installed by running:

$> docker --version
Docker version 24.0.5, build 24.0.5-0ubuntu1~22.04.1

Step 2: Install CUDA

The installation is also simple, just follow the CUDA installation guide. If everything is okay, you can run nvidia-smi to check your GPU:

$> nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.01              Driver Version: 546.01       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:01:00.0  On |                  N/A |
|  0%   49C    P8              33W / 350W |   1271MiB / 24576MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

From now on, things get tricky 😶‍🌫️.

Step 2: Download the model you want to use

You need to download the model weights of your preference before running TensorRT-LLM. In my case, I wanted to try Meta-Llama-3-8B-Instruct. To download it, you will need to accept the Terms and Conditions from Meta and use your HuggingFace token since it is a gated model. After that, you need to install Git Large File Storage and clone the hf repo:

git lfs install
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

A new folder Meta-Llama-3-8B-Instruct will appear:

Meta-Llama-3-8B-Instruct
├── LICENSE
├── README.md
├── USE_POLICY.md
├── config.json
├── generation_config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json
├── original
│   ├── consolidated.00.pth
│   ├── params.json
│   └── tokenizer.model
├── special_tokens_map.json
├── tokenizer.json
└── tokenizer_config.json

🚨Note: some users noted that the eos_token in tokenizer_config.json, which helps you to correctly separate the response of the LLM, was incorrectly defined. Instead of <|end_of_text|> should be <|eot_id|> , so make sure to change it.

Step 4: Running TensorRT-LLM to compile the model

TensorRT-LLM is a toolbox that compiles a LLM in order to accelerate and optimize the inference performance on NVIDIA GPUs. It supports many models and also different quantization techniques such as AWQ and GPTQ (see TensorRT-LLM support matrix).

To install TensorRT-LLM, you can start with the official repository NVIDIA TensorRT-LLM. I noticed that there are two ways to make TensorRT work:

Install tensorrt_llm directly on your machine.
Use a Docker container to run it.

1 Direct Installation (it doesn't go well)

Regarding point 1, you will first need to install the NVIDIA CUDA toolkit and add it to your path (for example in your ~/.bashrc file):

export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"

You can check if it's correctly installed by running nvcc -V:

$> nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Now you can install the necessary dependencies along with the tensorrt_llm package, note that the Python version must be 3.10:

# Install dependencies
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev

# Install latest version
pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com

If you are using conda to manage your environments, then this process will likely not work 🚨:

tensorrt_llm depends on mpi4py, which installation doesn't work via pip. You can fix this by using conda directly conda install mpi4py.
Even after that, the installation of tensorrt_llm still failed. You can check the installation by running python -c "import tensorrt_llm", which in my case throws the error ModuleNotFoundError: No module named 'tensorrt_llm.bindings'. There is a closed issue with the same error, and one of the maintainers recommended using Docker instead.

In this part, I stepped back and tried the Docker installation. Anyway, you can try a direct installation; I found a comment from a user who said is not using Docker.

2 Using a Docker container

The documentation from TensorRT-LLM recommends using the nvidia/cuda:12.1.0-devel-ubuntu22.04 image. But before running it, you need to consider two things:

The container needs access to the downloaded model.
You will need the NVIDIA TensorRT-LLM repository to easily compile the model.

Given that, the best approach is to clone the repo, move the downloaded model to the repo folder, and mount the repo as a volume inside the container. Indeed, that is what NVIDIA uses in this guide that I also follow 😄: Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server. Inside that guide, the repo is cloned using the v0.8.0 branch, and this is one of the first discrepancies with the official documentation 🚨 which indicates installing the latest version by just cloning the main branch. If you access the repository tags, you will notice that the latest version (as of this date 2024-04-26) is the v0.9.0, so I choose that one since it is the latest stable version.

To clone the repository and move the model folder just use:

git clone -b v0.9.0 https://github.com/NVIDIA/TensorRT-LLM.git
mv Meta-Llama-3-8B-Instruct TensorRT-LLM/

Now, you can start the docker container:

docker run --rm --runtime=nvidia --gpus all --volume ${PWD}/TensorRT-LLM:/TensorRT-LLM --entrypoint /bin/bash -it --workdir /TensorRT-LLM nvidia/cuda:12.1.0-devel-ubuntu22.04

Note that TensorRT-LLM is mounted in a folder with the same name and also used as the workdir of the container. In my case, here appears a problem 🚨 with the usage of --runtime=nvidia, despite having installed the NVIDIA Container Toolkit:

docker: Error response from daemon: unknown or invalid runtime name: nvidia.

To my surprise, removing that part was enough to make it work. After that, I found this comment from a NVIDIA forum's moderator which said that for x86 architectures the flag --gpus is enough. However, this definitely is an error related to using Docker inside WSL2.

Once inside the container, you need to install the dependencies along with the tensorrt_llm specific package version that you want to use:

# Install dependencies
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev

# Install v0.9.0
pip3 install tensorrt_llm==0.9.0 -U --extra-index-url https://pypi.nvidia.com

This will take a while. Once finished, check that everything is working:

$> python3 -c "import tensorrt_llm"
[TensorRT-LLM] TensorRT-LLM version: 0.9.0

Nice! 😄, now it's time to finally compile the model. For that, the TensorRT-LLM repo contains a folder full of scripts to do that for different model architectures. Particularly, I use the one for llama models:

#  Build the Llama 8B model using a single GPU and BF16.

# 1 Convert the checkpoint
python3 examples/llama/convert_checkpoint.py --model_dir ./Meta-Llama-3-8B-Instruct \
            --output_dir ./tllm_checkpoint_1gpu_bf16 \
            --dtype bfloat16

# 2 Compile model            
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \
            --output_dir ./tmp/llama/8B/trt_engines/bf16/1-gpu \
            --gpt_attention_plugin bfloat16 \
            --gemm_plugin bfloat16

The first step took 40 seconds, while the second one took 24 seconds and shows that the Peak memory usage during Engine building and serialization was 34.8GB of RAM, which is a huge amount.

Now, I can try the compiled model:

$> python3 examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/bf16/1-gpu \
    --max_output_len 100 --tokenizer_dir ./Meta-Llama-3-8B-Instruct \
    --input_text "How do I count to nine in French?"

# Model Output
" Counting to nine in French is easy and fun. Here's how you can do it:
One: Un
Two: Deux
Three: Trois
Four: Quatre
Five: Cinq
Six: Six
Seven: Sept
Eight: Huit
Nine: Neuf
That's it! You can now count to nine in French. Just remember that the numbers one to five are similar to their English counterparts, but the numbers six to nine have different pronunciations"

Step 5: Deploy with Triton Inference Server

Triton Inference Server is a production-ready backend for TensorRT-LLM. To configure it, I use the same NVIDIA's guide mentioned before, but with the version v0.9.0 (since this was the version used in the previous step):

# Clone v0.9.0
git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
# Move to the folder
cd tensorrtllm_backend
# Copy the compiled model
cp ../TensorRT-LLM/tmp/llama/8B/trt_engines/bf16/1-gpu/* all_models/inflight_batcher_llm/tensorrt_llm/1/

Note that I want to use in-flight batching to improve the serving throughput. Following the guide, now it's time to update the configuration files. Here, I note that decoupled_mode should be set as True if I want streaming:

# The location of the tokenizer, VERY IMPORTANT
HF_LLAMA_MODEL=TensorRT-LLM/Meta-Llama-3-8B-Instruct
# The location of the compiled model
ENGINE_PATH=tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt \
        tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt \
        tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt \
        triton_max_batch_size:64,decoupled_mode:True,bls_instance_count:1,accumulate_tokens:False

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt \
        triton_max_batch_size:64

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt \
        triton_max_batch_size:64,decoupled_mode:True,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

Then I finally can serve the model:

#Change to base working directory
cd ..
# Set the workspace as the same folder we are in
docker run -it --rm --gpus all --network host --shm-size=1g \
-v $(pwd):/workspace \
--workdir /workspace \
nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3

# Install python dependencies
pip install sentencepiece protobuf

# Launch Server
python3 tensorrtllm_backend/scripts/launch_triton_server.py \
    --model_repo tensorrtllm_backend/all_models/inflight_batcher_llm \
    --world_size 1

Where world_size must be the same as the number of GPUs used to compile the model, which in my case was just one.

Here appears another problem 🚨:

...
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[12977,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Checking the log, I noticed this particular error:

[TensorRT-LLM][ERROR] 6: The engine plan file is not compatible with this version of TensorRT, expecting library version 9.2.0.5 got 9.3.0.1, please rebuild.

That was weird, could it be that I'm not using the correct tritonserver image? Looking at the container images website, I noticed that 24.03 was exactly the latest version 🤔. Then I found this issue from Dec 6, 2023, when a user had a similar error. The mentioned solution was to build the Docker container setting the TensorRT version to 9.3.0.1. However, that was taking too much time (>30 mins) and I really wanted to deploy the model! The faster solution? Go back to version v0.8.0 ⚠️.

So I repeated Step 4 but using the tag v0.8.0 and installing tensorrt_llm==0.8.0, and also going to the same tag in Step 5. Then I launched the model again:

I0426 05:17:29.577423 114 server.cc:677]
+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| ensemble         | 1       | READY  |
| postprocessing   | 1       | READY  |
| preprocessing    | 1       | READY  |
| tensorrt_llm     | 1       | READY  |
| tensorrt_llm_bls | 1       | READY  |
+------------------+---------+--------+

I0426 05:17:29.595616 114 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090
I0426 05:17:29.597070 114 metrics.cc:770] Collecting CPU metrics
I0426 05:17:29.597206 114 tritonserver.cc:2538]
+----------------------------------+--------------------------------------------------------------------------------------+
| Option                           | Value                                                                                |
+----------------------------------+--------------------------------------------------------------------------------------+
| server_id                        | triton                                                                               |
| server_version                   | 2.44.0                                                                               |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedul |
|                                  | e_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_d |
|                                  | ata parameters statistics trace logging                                              |
| model_repository_path[0]         | tensorrtllm_backend/all_models/inflight_batcher_llm                                  |
| model_control_mode               | MODE_NONE                                                                            |
| strict_model_config              | 1                                                                                    |
| rate_limit                       | OFF                                                                                  |
| pinned_memory_pool_byte_size     | 268435456                                                                            |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                             |
| min_supported_compute_capability | 6.0                                                                                  |
| strict_readiness                 | 1                                                                                    |
| exit_timeout                     | 30                                                                                   |
| cache_enabled                    | 0                                                                                    |
+----------------------------------+--------------------------------------------------------------------------------------+

I0426 05:17:29.604773 114 grpc_server.cc:2466] Started GRPCInferenceService at 0.0.0.0:8001
I0426 05:17:29.604929 114 http_server.cc:4636] Started HTTPService at 0.0.0.0:8000
I0426 05:17:29.682582 114 http_server.cc:320] Started Metrics Service at 0.0.0.0:8002

And now it’s working 😊. Let’s make a request:

$> curl-X POST localhost:8000/v2/models/ensemble/generate -d \
'{
    "text_input": "Tell me a short joke about llamas",
    "parameters": {
      "max_tokens": 128,
      "stop_words":["<|eot_id|>"]
    }
}'

# text_output
".
Here's one: Why did the llama go to the party? Because it was a hair-raising good time! (get it? hair-raising? haha) Reply Delete
  2. Ahahaha, that's a great one! I love puns, and that one is particularly clever. Thanks for sharing! Reply Delete
  3. I'm glad you enjoyed it! I have a few more llama puns up my sleeve if you're interested. Reply Delete
  4. Oh, absolutely! I'd love to hear more llama puns. Go ahead and share them! Reply Delete
  "

To enable streaming, the endpoint must be localhost:8000/v2/models/ensemble/generate_stream and you have to set "stream": true:

Now, observing the output of our request, you can notice that the model is not stopping correctly, even it appears a Reply Delete part which makes no sense 🤨. The reason for that is that I'm not applying the chat template correctly. If you check the tokenizer_config.json from the HuggingFace model you will notice a chat_template, which is something like this:

<|start_header_id|>ROLE<|end_header_id|>
CONTENT<|eot_id|>

Where ROLE and CONTENT are the same from the typical message structure that OpenAI made popular [{"role": ROLE, "content": CONTENT}]. Then, how can we apply this template? We need to format the request before passing it to the endpoint 🤔.

Suppose we overcome this obstacle, now I need to integrate this into my application which works very well with the OpenAI API. Here are two possible solutions:

Define ways to handle the stream/no stream requests to our endpoint and update our code such that it can work with that and also with the OpenAI API.
Make a clone of the OpenAI API that points to our endpoint.

I consider option 2 more interesting because it makes the integration easier due to there being a lot of things built over the OpenAI API. And, if you check the competitors of TensorRT-LLM such as Text-Generation-Inference and vLLM, both have an OpenAI-compatible API server.

So that was my option, having an OpenAI-compatible API for TensorRT-LLM, and that was exactly what I found in this repository: https://github.com/npuichigo/openai_trtllm.

Step 6: Setting an OpenAI-compatible API

To use this repo you have to do the following:

# Clone the repo
git clone https://github.com/npuichigo/openai_trtllm
cd openai_trtllm

# Install Rust
curl -sSf https://static.rust-lang.org/rustup.sh | sh

# Build the source code
cargo build --release

(The authors also include a Docker compose file to work with, but I couldn’t make it work 😟)

After building the source code of the project, a new folder target will appear which contains the compiled code. Now I just need to use ./target/release/openai_trtllm:

./target/release/openai_trtllm --help
Usage: openai_trtllm [OPTIONS]

Options:
  -H, --host <HOST>
          Host to bind to [default: 0.0.0.0]
  -p, --port <PORT>
          Port to bind to [default: 3000]
  -t, --triton-endpoint <TRITON_ENDPOINT>
          Triton gRPC endpoint [default: http://localhost:8001]
  -o, --otlp-endpoint <OTLP_ENDPOINT>
          Endpoint of OpenTelemetry collector
      --history-template <HISTORY_TEMPLATE>
          Template for converting OpenAI message history to prompt
      --history-template-file <HISTORY_TEMPLATE_FILE>
          File containing the history template string
  -h, --help
          Print help

As said in the README of the project, openai_trtllm communicates with triton over gRPC, so the --triton-endpoint should be the gRPC port (that's why the endpoint has the port 8001 and not the 8000). Note the options about the history-template, that was exactly what I was talking about before about the chat history. The repository already contains a template for Llama 3 at templates/history_template_llama3.liquid, so I just need to run ./target/release/openai_trtllm --history-template-file templates/history_template_llama3.liquid:

$> ./target/release/openai_trtllm \
      --history-template-file templates/history_template_llama3.liquid
{"timestamp":"2024-04-26T06:15:12.490748Z","level":"INFO","message":"Connecting to triton endpoint: http://localhost:8001","target":"openai_trtllm::startup"}
{"timestamp":"2024-04-26T06:15:12.497127Z","level":"INFO","message":"Starting server at 0.0.0.0:3000","target":"openai_trtllm::startup"}

Let's make a demo with Streamlit!

import streamlit as st
from openai import OpenAI

st.title("TensorRT-LLM Demo")

client = OpenAI(base_url="http://localhost:3000/v1", api_key="None")

if "messages" not in st.session_state:
    st.session_state["messages"] = []

prompt = st.chat_input("Say something")
if prompt:
    st.session_state["messages"].append({"role": "user", "content": prompt})
    for message in st.session_state["messages"]:
        st.chat_message(message["role"]).write(message["content"])
    container = st.empty()
    chat_completion = client.chat.completions.create(
        stream=True,
        messages=st.session_state["messages"],
        model="ensemble", # Must be ensemble
        max_tokens=256
    )
    response = ""
    for event in chat_completion:
        content = event.choices[0].delta.content
        if content:
            response += content
        container.chat_message("assistant").write(response)
    st.session_state["messages"].append({"role": "assistant", "content": response})

Works incredibly well! 😄😄

Let’s try it with the CodeGPT VSCode extension!

The only things that you have to do are:

Install the extension
Select Custom as the Provider and then Set Connection
Use the generated chat/completion endpoint http://localhost:3000/v1/chat/completions as your custom link, and write anything as ApiKey
Writeensemble as the model to use
Done!