A Minimalistic Guide to Setting Up Your Own NVIDIA Triton Inference Server

Published in

Salesforce Engineering

2 min readApr 21, 2020

Full post by Nitish Shirish Keskar on einstein.ai

We investigate NVIDIA’s Triton (TensorRT) Inference Server as a way of hosting Transformer Language Models. The blog is roughly divided into two parts: (i) instructions for setting up your own inference server, and (ii) benchmarking experiments.

The Serving Problem

Let us begin by discussing what the serving problem is and establish a common vocabulary. For the purpose of this blog post, we will primarily focus on Transformer Language Models for text classification. Serving such a model essentially equates to providing an endpoint that is efficient and secure. The lifecycle of model serving typically involves multiple personas: scientist(s) will train the model on available data and engineer(s) will assist in the integration of the solution into the model repository while emphasizing maintainability, efficiency, security, and reusability. Stereotypically, but not always, there is a disconnect between how these two groups function. Scientists often use bleeding-edge technology in their attempts to squeeze performance which may jeopardize aspects important to engineers such as maintainability. On the other hand, engineering requirements can often seem cumbersome to scientists who are trying to solve novel problems with uncertain exploration landscapes. Consider, for instance, a serving solution that relies on TensorFlow, e.g., TFServing. This choice either forces model exploration to fit well in the TensorFlow paradigm, or requires an (often expensive) code translation from another framework after an acceptable model has been found. Finding a solution that satisfies both scientists and engineers is the heart of the serving problem we wish to investigate.

Summarily, an ideal solution is one that allows for efficient and maintainable serving without severely restricting scientists.

Click through to the Einstein blog to read the full setup instructions, benchmarking details, and experiment results.

A Minimalistic Guide to Setting Up Your Own NVIDIA Triton Inference Server

The Serving Problem

Written by @SalesforceEng