Deploying a Large Language Model (LLM) with TensorRT-LLM on Triton Inference Server: A Step-by-Step Guide

Published in

Trendyol Tech

13 min readMar 29, 2024

This image were generated using Microsoft’s Image Creator service, in compliance with the terms outlined in the Microsoft Services Agreement and the Image Creator Terms of Use.

Hello, in this article, I will discuss how to perform inference from Large Language Models (LLMs) and how to deploy the Trendyol LLM v1.0 model, which we, as the Trendyol NLP team, have contributed to the open-source community, using the TensorRT-LLM framework onto the Triton Inference Server.

Various frameworks such as vLLM, TGI, and TensorRT-LLM have been developed for performing inference from LLMs. Among these tools, TensorRT-LLM is an important framework that enables effective usage of models in production environments.

The Triton Inference Server is an inference tool developed by Nvidia that supports the infrastructure of various machine learning frameworks. This server is designed particularly for applications requiring large-scale model deployment and processing. By supporting various deep learning frameworks such as TensorFlow, PyTorch, ONNX Runtime, etc., the Triton Inference Server offers the capability to run these models quickly and efficiently in production environments. Additionally, Triton enables parallel computation on GPUs, facilitating the fast and efficient processing of large-scale models. As a result, developers can easily manage and deploy complex model architectures and large datasets on the Triton Inference Server.

The Triton Inference Server also supports the TensorRT-LLM framework with the tensorrtllm_backend backend. Thanks to the tensorrtllm_backend, we can deploy models running on the TensorRT-LLM framework on the Triton Inference Server. As seen in the diagram below, it is evident that the tensorrtllm_backend backend is built on top of a TensorRT-LLM framework and TensorRT environment.

Tensorrtllm_backend backend’s dependencies

In TensorRT-LLM, models are not used in their raw form. In order to use the models, they need to be compiled specifically for the respective GPU, resulting in a model file called an engine. During the creation of an engine from the raw model, various optimization and quantization techniques supported by the GPU can be applied. Additionally, if parallelization of the model is required, it is specified during this compilation stage, and the number of GPUs on which the model will be deployed must also be specified at this stage. Since each GPU model requires a specific compilation process, the same model should be compiled on a GPU of the same model for inference. This is where the speed of TensorRT-LLM originates. The steps in compiling a model are shown in the flow below. As seen in the flow, first, the model to be used with the TensorRT-LLM framework is compiled in a GPU environment to create a TensorRT-LLM checkpoint. Then, TensorRT Engine files are created using this checkpoint.

The generated TensorRT Engine files are stored and used for inference with the TensorRT-LLM ModelRunner or with the tensorrtllm_backend backends, which I will discuss in this article, on the Triton Inference Server.

Now, let’s move on to the steps for performing inference on our Trendyol LLM model using the tensorrtllm_backend backend on the Triton Inference Server, based on this information.

Creating the TensorRT Engine

Step 1 The GPU model on which the Triton Inference Server will be installed for inference is determined. We will use the A100 GPU.

Step 2 The version of the Triton Inference Server to be installed is determined. We will use the “nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3” image, hence version 24.02.

Step 3 The support matrix for the “24.02” version of the Triton Inference Server is reviewed. The compatible versions of tensorrtllm_backend and TensorRT-LLM are determined. Since version 24.02 is compatible with TensorRT-LLM v0.8.0, we will use version v0.8.0 of tensorrtllm_backend.

Step 4 We have a server equipped with an A100 GPU, and it needs to have the necessary driver and compatible CUDA version installed for the A100. If you’re going to proceed with Docker, you’ll need to set up the necessary installations for GPU usage within the Docker environment. If you’re unsure how to do this, you can refer to this article.

nvidia-smi

Step 5 A container is created from the “nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3” image in either Docker or Kubernetes environment. We utilized this image and created a Dockerfile to install JupyterLab within it. Since our GPU resources are on Kubernetes, we built an image with this Dockerfile and ran JupyterLab through the relevant service by defining the necessary service configurations. If you can connect to the server in via SSH, you can work by creating an image with the following Dockerfile, using either JupyterLab or the terminal as you prefer.

Dockerfile

FROM nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
WORKDIR /
COPY . .
RUN pip install protobuf SentencePiece torch
RUN pip install jupyterlab
EXPOSE 8888
EXPOSE 8000
EXPOSE 8001
EXPOSE 8002
CMD ["/bin/sh", "-c", "jupyter lab --LabApp.token='password' --LabApp.ip='0.0.0.0' --LabApp.allow_root=True"]

Step 6 After creating a container from the Dockerfile and opening a terminal via JupyterLab or connecting to the container shell using the “docker exec -it container-name sh” command, we will install Git since we will pull our model and tensorrtllm_backend using Git. For this purpose, a folder named /tensorrt is created in the root directory. We will gather all the necessary files under this folder.

mkdir /tensorrt
cd /tensorrt

After creating the /tensorrt directory and navigating into it, the following files are created:

“1.install_git_and_lfs.sh” : If Git and Git LFS installations have not been performed, we will carry out these installations with this file.

#!/bin/bash
# Author: Murat Tezgider
# Date: 2024-03-18
# Description: This script automates the check for Git and Git LFS installations, installing them if not already installed.
set -e
# Function to check if Git is installed
is_git_installed() {
    if command -v git &>/dev/null; then
        return 0
    else
        return 1
    fi
}
# Function to check if Git LFS is installed
is_git_lfs_installed() {
    if command -v git-lfs &>/dev/null; then
        return 0
    else
        return 1
    fi
}
# Check if Git is already installed
if is_git_installed; then
    echo "Git is already installed."
    git --version
    # Check if Git LFS is installed
    if is_git_lfs_installed; then
        echo "Git LFS is already installed."
        git-lfs --version
    else
        echo "Installing Git LFS..."
        apt-get update && apt-get install git-lfs -y || { echo "Failed to install Git LFS"; exit 1; }
        echo "Git LFS has been installed successfully."
    fi
else
    # Update package lists and install Git
    apt-get update && apt-get install git -y || { echo "Failed to install Git"; exit 1; }
    # Install Git LFS
    apt-get install git-lfs -y || { echo "Failed to install Git LFS"; exit 1; }
    echo "Git and Git LFS have been installed successfully."
fi

“2.install_tensorrt_llm.sh” : We pull the tensorrtllm_backend project along with its submodules (such as tensorrt_llm, etc.), ensuring compatibility with Triton through this file. Since the tensorrtllm_backend version compatible with the Triton version we are using is v0.8.0, we specified TENSORRT_BACKEND_LLM_VERSION=v0.8.0 in the file.

#!/bin/bash
# Author: Murat Tezgider
# Date: 2024-03-18
# Description: This script automates the installation process for TensorRT-LLM. Prior to running this script, ensure that Git and Git LFS ('apt-get install git-lfs') are installed.
# Step 1: Defining folder path and version
echo "Step 1: Defining folder path and version"
TENSORRT_BACKEND_LLM_VERSION=v0.8.0
TENSORRT_DIR="/tensorrt/$TENSORRT_BACKEND_LLM_VERSION"
# Step 2: Enter the installation folder and clone
echo "Step 2: Enter the installation folder and clone"
[ ! -d "$TENSORRT_DIR" ] && mkdir -p "$TENSORRT_DIR"
cd "$TENSORRT_DIR" || { echo "Failed to change directory to $TENSORRT_DIR"; exit 1; }
git clone -b "$TENSORRT_BACKEND_LLM_VERSION" https://github.com/triton-inference-server/tensorrtllm_backend.git --progress --verbose || { echo "Failed to clone repository"; exit 1; }
cd "$TENSORRT_DIR"/tensorrtllm_backend || { echo "Failed to change directory to $TENSORRT_DIR/tensorrtllm_backend"; exit 1; }
git submodule update --init --recursive || { echo "Failed to update submodules"; exit 1; }
git lfs install || { echo "Failed to install Git LFS"; exit 1; }
git lfs pull || { echo "Failed to pull Git LFS files"; exit 1; }
# Step 3: Enter the backend folder and Install backend related dependencies
echo "Step 3: Enter the backend folder and Install backend related dependencies"
cd "$TENSORRT_DIR"/tensorrtllm_backend || { echo "Failed to change directory to $TENSORRT_DIR/tensorrtllm_backend"; exit 1; }
apt-get update && apt-get install -y --no-install-recommends rapidjson-dev python-is-python3 || { echo "Failed to install dependencies"; exit 1; }
pip3 install -r requirements.txt --extra-index-url https://pypi.ngc.nvidia.com || { echo "Failed to install Python dependencies"; exit 1; }
# Step 4: Install tensorrt-llm library
echo "Step 4: Install tensorrt-llm library"
pip install tensorrt_llm=="$TENSORRT_BACKEND_LLM_VERSION" -U --pre --extra-index-url https://pypi.nvidia.com || { echo "Failed to install tensorrt-llm library"; exit 1; }

“3.trendyol_llm_tensorrt_engine_build_and_test.sh” : By running this file, we download the Trendyol/Trendyol-LLM-7b-chat-v1.0 model from Hugging Face, and convert it into a TensorRT-LLM checkpoint. Then, we create a TensorRT-LLM engine from this checkpoint. Finally, we test the created engine by running it.

#!/bin/bash
# Author: Murat Tezgider
# Date: 2024-03-18
# Description: This script automates the installation and inference process for a Hugging Face model using TensorRT-LLM. Ensure that Git and Git LFS ('apt-get install git-lfs') are installed before running this script. Before running this script, run the following scripts sequentially: 1. install_git_and_lfs.sh 2. install_tensorrt_llm.sh
HF_MODEL_NAME="Trendyol-LLM-7b-chat-v1.0"
HF_MODEL_PATH="Trendyol/Trendyol-LLM-7b-chat-v1.0"
# Clone the Hugging Face model repository
mkdir -p /tensorrt/models && cd /tensorrt/models && git clone https://huggingface.co/$HF_MODEL_PATH
# Convert the model checkpoint to TensorRT format
python /tensorrt/v0.8.0/tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir /tensorrt/models/$HF_MODEL_NAME \
    --output_dir /tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-checkpoints/fp16/1-gpu/ \
    --dtype float16
# Build TensorRT engine
trtllm-build --checkpoint_dir /tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-checkpoints/fp16/1-gpu/ \
    --output_dir /tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-engines/fp16/1-gpu/ \
    --remove_input_padding enable \
    --context_fmha enable \
    --gemm_plugin float16 \
    --max_input_len 32768 \
    --strongly_typed
# Run inference with the TensorRT engine
python3 /tensorrt/v0.8.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
    --max_output_len=250 \
    --tokenizer_dir /tensorrt/models/$HF_MODEL_NAME \
    --engine_dir=/tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-engines/fp16/1-gpu/ \
    --max_attention_window_size=4096 \
    --temperature=0.3 \
    --top_k=50 \
    --top_p=0.9 \
    --repetition_penalty=1.2 \
    --input_text="[INST] Sen yardımsever bir asistansın ve sana verilen talimatlar doğrultusunda en iyi cevabı üretmeye çalışacaksın.\n\nTürkiye'nin doğusunda ne var? [/INST]"

Step 7 After creating the files, we grant execution permission to them as follows.

chmod +x 1.install_git_and_lfs.sh 2.install_tensorrt_llm.sh 3.trendyol_llm_tensorrt_engine_build_and_test.sh

Step 8 Now, we can run the bash files named “1.install_git_and_lfs.sh”, “2.install_tensorrt_llm.sh”, and “3.trendyol_llm_tensorrt_engine_build_and_test.sh” respectively.

./1.install_git_and_lfs.sh && ./2.install_tensorrt_llm.sh && ./3.trendyol_llm_tensorrt_engine_build_and_test.sh

After executing all the steps in the files, you should see an LLM result output like the following at the end of the terminal screen.

When all scripts have been successfully executed, you should see a folder structure under the /tensorrt directory as shown below.

Now that we have successfully compiled the model and created and run the TensorRT-LLM engine files, we can proceed to the step of deploying these TensorRT-LLM engine files onto the Triton Inference Server.

Steps for Model Deployment with Triton Inference Server

The Triton Inference Server stores models in a directory called ‘repository’. Under this directory, each model is located in a folder named after the model, along with a configuration file named ‘config.pbtxt’. Example models have been created under the directory ‘tensorrtllm_backend/all_models/inflight_batcher_llm’. In these example model definitions, we need to update some variables within the ‘config.pbtxt’ files according to our needs. To easily update these parameters, the ‘tensorrtllm_backend/tools/fill_template.py’ python tool is used. We used this tool to update our ‘config.pbtxt’ files.

Step 1 We create a directory to store our model definitions, and then copy the example model definitions located under ‘/tensorrt/v0.8.0/tensorrtllm_backend/all_models/inflight_batcher_llm’ to this newly created directory ‘/tensorrt/triton-repos/trtibf-Trendyol-LLM-7b-chat-v1.0/’.

mkdir -p  /tensorrt/triton-repos/trtibf-Trendyol-LLM-7b-chat-v1.0/
cp /tensorrt/v0.8.0/tensorrtllm_backend/all_models/inflight_batcher_llm/* /tensorrt/triton-repos/trtibf-Trendyol-LLM-7b-chat-v1.0/ -r

triton-repos
└── trtibf-Trendyol-LLM-7b-chat-v1.0
    ├── ensemble
    │   ├── 1
    │   └── config.pbtxt
    ├── postprocessing
    │   ├── 1
    │   │   └── model.py
    │   └── config.pbtxt
    ├── preprocessing
    │   ├── 1
    │   │   └── model.py
    │   └── config.pbtxt
    ├── tensorrt_llm
    │   ├── 1
    │   └── config.pbtxt
    └── tensorrt_llm_bls
        ├── 1
        │   └── model.py
        └── config.pbtxt

Step 2 Subsequently, in the ‘config.pbtxt’ file, we update the parameters necessary for our model to function correctly, such as ‘tokenizer’ and ‘engine_dir’, with the following values.

python3 /tensorrt/v0.8.0/tensorrtllm_backend/tools/fill_template.py -i /tensorrt/triton-repos/trtibf-Trendyol-LLM-7b-chat-v1.0/preprocessing/config.pbtxt tokenizer_dir:/tensorrt/models/Trendyol-LLM-7b-chat-v1.0,tokenizer_type:llama,triton_max_batch_size:64,preprocessing_instance_count:1
python3 /tensorrt/v0.8.0/tensorrtllm_backend/tools/fill_template.py -i /tensorrt/triton-repos/trtibf-Trendyol-LLM-7b-chat-v1.0/postprocessing/config.pbtxt tokenizer_dir:/tensorrt/models/Trendyol-LLM-7b-chat-v1.0,tokenizer_type:llama,triton_max_batch_size:64,postprocessing_instance_count:1
python3 /tensorrt/v0.8.0/tensorrtllm_backend/tools/fill_template.py -i /tensorrt/triton-repos/trtibf-Trendyol-LLM-7b-chat-v1.0/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 /tensorrt/v0.8.0/tensorrtllm_backend/tools/fill_template.py -i /tensorrt/triton-repos/trtibf-Trendyol-LLM-7b-chat-v1.0/ensemble/config.pbtxt triton_max_batch_size:64
python3 /tensorrt/v0.8.0/tensorrtllm_backend/tools/fill_template.py -i /tensorrt/triton-repos/trtibf-Trendyol-LLM-7b-chat-v1.0/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:/tensorrt/tensorrt-models/Trendyol-LLM-7b-chat-v1.0/v0.8.0/trt-engines/fp16/1-gpu/,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600

Step 3 Now, our Triton repository is ready, and we run the Triton server to read the models from this repository.

tritonserver --model-repository=/tensorrt/triton-repos/trtibf-Trendyol-LLM-7b-chat-v1.0 --model-control-mode=explicit --load-model=preprocessing --load-model=postprocessing --load-model=tensorrt_llm --load-model=tensorrt_llm_bls --load-model=ensemble  --log-verbose=2 --log-info=1 --log-warning=1 --log-error=1

For the Triton server to start up without any issues, all models need to be in the ‘ready’ state. When you see the output as shown below, the Triton server has started up smoothly.

I0320 11:44:44.456547 15296 model_lifecycle.cc:273] ModelStates()
I0320 11:44:44.456580 15296 server.cc:677] 
+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| ensemble         | 1       | READY  |
| postprocessing   | 1       | READY  |
| preprocessing    | 1       | READY  |
| tensorrt_llm     | 1       | READY  |
| tensorrt_llm_bls | 1       | READY  |
+------------------+---------+--------+
I0320 11:44:44.474142 15296 metrics.cc:877] Collecting metrics for GPU 0: GRID A100D-80C
I0320 11:44:44.475350 15296 metrics.cc:770] Collecting CPU metrics
I0320 11:44:44.475507 15296 tritonserver.cc:2508] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                          |
+----------------------------------+----------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                         |
| server_version                   | 2.43.0                                                                                                         |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configurati |
|                                  | on system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging              |
| model_repository_path[0]         | /tensorrt/triton-repos/trtibf-Trendyol-LLM-7b-chat-v1.0                                                        |
| model_control_mode               | MODE_EXPLICIT                                                                                                  |
| startup_models_0                 | ensemble                                                                                                       |
| startup_models_1                 | postprocessing                                                                                                 |
| startup_models_2                 | preprocessing                                                                                                  |
| startup_models_3                 | tensorrt_llm                                                                                                   |
| startup_models_4                 | tensorrt_llm_bls                                                                                               |
| strict_model_config              | 0                                                                                                              |
| rate_limit                       | OFF                                                                                                            |
| pinned_memory_pool_byte_size     | 268435456                                                                                                      |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                       |
| min_supported_compute_capability | 6.0                                                                                                            |
| strict_readiness                 | 1                                                                                                              |
| exit_timeout                     | 30                                                                                                             |
| cache_enabled                    | 0                                                                                                              |
+----------------------------------+----------------------------------------------------------------------------------------------------------------+
I0320 11:44:44.476247 15296 grpc_server.cc:2426] 
+----------------------------------------------+---------+
| GRPC KeepAlive Option                        | Value   |
+----------------------------------------------+---------+
| keepalive_time_ms                            | 7200000 |
| keepalive_timeout_ms                         | 20000   |
| keepalive_permit_without_calls               | 0       |
| http2_max_pings_without_data                 | 2       |
| http2_min_recv_ping_interval_without_data_ms | 300000  |
| http2_max_ping_strikes                       | 2       |
+----------------------------------------------+---------+
I0320 11:44:44.476757 15296 grpc_server.cc:102] Ready for RPC 'Check', 0
I0320 11:44:44.476795 15296 grpc_server.cc:102] Ready for RPC 'ServerLive', 0
I0320 11:44:44.476810 15296 grpc_server.cc:102] Ready for RPC 'ServerReady', 0
I0320 11:44:44.476824 15296 grpc_server.cc:102] Ready for RPC 'ModelReady', 0
I0320 11:44:44.476838 15296 grpc_server.cc:102] Ready for RPC 'ServerMetadata', 0
I0320 11:44:44.476851 15296 grpc_server.cc:102] Ready for RPC 'ModelMetadata', 0
I0320 11:44:44.476865 15296 grpc_server.cc:102] Ready for RPC 'ModelConfig', 0
I0320 11:44:44.476881 15296 grpc_server.cc:102] Ready for RPC 'SystemSharedMemoryStatus', 0
I0320 11:44:44.476895 15296 grpc_server.cc:102] Ready for RPC 'SystemSharedMemoryRegister', 0
I0320 11:44:44.476910 15296 grpc_server.cc:102] Ready for RPC 'SystemSharedMemoryUnregister', 0
I0320 11:44:44.476923 15296 grpc_server.cc:102] Ready for RPC 'CudaSharedMemoryStatus', 0
I0320 11:44:44.476936 15296 grpc_server.cc:102] Ready for RPC 'CudaSharedMemoryRegister', 0
I0320 11:44:44.476949 15296 grpc_server.cc:102] Ready for RPC 'CudaSharedMemoryUnregister', 0
I0320 11:44:44.476963 15296 grpc_server.cc:102] Ready for RPC 'RepositoryIndex', 0
I0320 11:44:44.476977 15296 grpc_server.cc:102] Ready for RPC 'RepositoryModelLoad', 0
I0320 11:44:44.476989 15296 grpc_server.cc:102] Ready for RPC 'RepositoryModelUnload', 0
I0320 11:44:44.477003 15296 grpc_server.cc:102] Ready for RPC 'ModelStatistics', 0
I0320 11:44:44.477018 15296 grpc_server.cc:102] Ready for RPC 'Trace', 0
I0320 11:44:44.477031 15296 grpc_server.cc:102] Ready for RPC 'Logging', 0
I0320 11:44:44.477060 15296 grpc_server.cc:359] Thread started for CommonHandler
I0320 11:44:44.477204 15296 infer_handler.h:1188] StateNew, 0 Step START
I0320 11:44:44.477245 15296 infer_handler.cc:680] New request handler for ModelInferHandler, 0
I0320 11:44:44.477276 15296 infer_handler.h:1312] Thread started for ModelInferHandler
I0320 11:44:44.477385 15296 infer_handler.h:1188] StateNew, 0 Step START
I0320 11:44:44.477418 15296 infer_handler.cc:680] New request handler for ModelInferHandler, 0
I0320 11:44:44.477443 15296 infer_handler.h:1312] Thread started for ModelInferHandler
I0320 11:44:44.477550 15296 infer_handler.h:1188] StateNew, 0 Step START
I0320 11:44:44.477582 15296 stream_infer_handler.cc:128] New request handler for ModelStreamInferHandler, 0
I0320 11:44:44.477608 15296 infer_handler.h:1312] Thread started for ModelStreamInferHandler
I0320 11:44:44.477622 15296 grpc_server.cc:2519] Started GRPCInferenceService at 0.0.0.0:8001
I0320 11:44:44.477845 15296 http_server.cc:4637] Started HTTPService at 0.0.0.0:8000
I0320 11:44:44.518903 15296 http_server.cc:320] Started Metrics Service at 0.0.0.0:8002

Step 4 Now, you can open a new terminal and send requests to our models using curl.

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "Türkiye nin doğusunda ne var?", "max_tokens": 200, "bad_words": "", "stop_words": ""}'

Result :

{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\nYani, Türkiye'nin doğusunda İran, Irak ve Suriye var."}

If you’ve made it this far without any issues, it means you have successfully completed the deployment of an LLM model with Triton step by step. In the next step, it remains to move the engine, tokenizer, and Triton repository model definition files to a storage area to enable Triton to read from this source. To prevent the article from becoming too long, I’ll end it here. I hope you found it helpful!

I plan to address in a future article how to transfer these files to a storage area and provide details on how they will be used. Stay healthy and happy!

You can access this GitHub repository where the step-by-step procedures performed so far are listed from here.

Conclusion

In this article, we’ve outlined the streamlined deployment process of an LLM model using the Triton Inference Server. By preparing the model, creating TensorRT-LLM engine files, and deploying them onto Triton, we’ve demonstrated an efficient approach to serving LLM models in production. Future work may focus on transferring these files to storage and detailing their usage. Overall, Triton offers a robust platform for deploying LLM models effectively.

Join Us

Want to be a part of our growing company? We’re hiring! Check out our open positions from the links below.

Home - Trendyol Careers

We believe in the power of an inclusive workplace. Our platform is for everyone, and so is our workplace. Each and…

careers.trendyol.com