Deploying Deep Learning Models at Scale — Triton Inference Server 0 to 100

Alex Razvant
Decoding ML
Published in
9 min readFeb 7, 2024

--

NVIDIA’s Triton Inference Server is an open-source software that provides a high-performance serving system for machine learning models. It’s designed to optimize and serve models for inference in production environments, ensuring efficient utilization of GPU and CPU resources.

Plus, it has been the go-to framework I’ve been using professionally to deploy and manage computer vision at scale to handle multiple concurrent inference requests across multiple servers.

NVIDIA Triton Server. Image by Author.

Let’s outline the topics we’ll go over through:

  • What is NVIDIA Triton Inference Server (T.I.S)?
  • How Triton Works, When to use it?
  • Real-world results
  • Installation & Sample Project (Quick 10min test)
  • Recap & Next

What is NVIDIA Triton Inference Server (T.I.S)

It started as a part of the NVIDIA Deep Learning SDK to help developers encapsulate their models on the NVIDIA software kit. It further branched out and was called TensorRT Server which focused on serving models optimized as TensorRT engines and further became NVIDIA Triton Inference Server — a powerful tool designed for deploying models in production environments.

--

--

Alex Razvant
Decoding ML

Senior ML Engineer @ Everseen | Join the NeuralBits Newsletter, to uncover Deep Learning Systems one bit at a time. https://neuralbits.substack.com