How to deploy (almost) any Hugging face model 🤗 on NVIDIA’s Triton Inference Server with an application to Zero-Shot-Learning for Text Classification

Sachin Sharma
NVIDIA
Published in
9 min readOct 11, 2020

SUMMARY

In this blog post, We examine Nvidia’s Triton Inference Server (formerly known as TensorRT Inference Server) which simplifies the deployment of AI models at scale in production. For the purpose of this examination, we mainly focus on hosting Transformer Language Models like BERT, GPT2, BART, RoBerta , etc. Afterward, to solve the problem of zero-shot-text-classification, we will be using Hugging’s Face RoBerta (Multilingual Natural Language Inferencing Model) model for deployment on the Triton server, once deployed we can make inference requests and can get back the predictions. For, setting up the Triton inference server we generally need to pass two hurdles: 1) Set up our own inference server, and 2) After that, we have to write a python client-side script which can communicate with the inference server to send requests (in our case text) and get back predictions or text feature embeddings.

REQUIREMENTS

  1. Nvidia CUDA enabled GPU: For, this blog post I am using GeForce RTX 2080 Nvidia GPU having a memory size of around 12 Gb.
  2. Nvidia Docker
  3. Triton Client libraries for communication with Triton inference server
  4. PyTorch
  5. Hugging Face Library

Basic Introduction (Why do we need Nvidia’s Triton Inference Server)

Image depicting the capability of Triton server to host Multiple heterogeneous deep learning frameworks (src: https://developer.nvidia.com/nvidia-triton-inference-server)

The one thing which attracted all of us (AI team of Define Media) the most is the capability of the Triton inference server to host/deploy trained models from any framework (whether it is a TensorFlow, TensorRT, PyTorch, Caffe, ONNX, Runtime, or some custom framework) from local storage or Google Cloud Platform or AWS S3 on any GPU- or CPU-based infrastructure (cloud, data center, or edge). In Nvidia’s triton framework, model checkpoints are optimized/compressed (Quantization and Pruning in case of PyTorch models) before serving which…

Sachin Sharma
NVIDIA
Writer for

Graph Machine Learning Research Engineer @ArangoDB Gmbh | Former AI/Machine Learning Scientist & Engineer @DefineMedia Gmbh | Former Research Intern @DFKI KL