How to deploy (almost) any Hugging face model 🤗 on NVIDIA’s Triton Inference Server with an application to Zero-Shot-Learning for Text Classification

Published in

NVIDIA

9 min readOct 11, 2020

SUMMARY

In this blog post, We examine Nvidia’s Triton Inference Server (formerly known as TensorRT Inference Server) which simplifies the deployment of AI models at scale in production. For the purpose of this examination, we mainly focus on hosting Transformer Language Models like BERT, GPT2, BART, RoBerta , etc. Afterward, to solve the problem of zero-shot-text-classification, we will be using Hugging’s Face RoBerta (Multilingual Natural Language Inferencing Model) model for deployment on the Triton server, once deployed we can make inference requests and can get back the predictions. For, setting up the Triton inference server we generally need to pass two hurdles: 1) Set up our own inference server, and 2) After that, we have to write a python client-side script which can communicate with the inference server to send requests (in our case text) and get back predictions or text feature embeddings.

REQUIREMENTS

Nvidia CUDA enabled GPU: For, this blog post I am using GeForce RTX 2080 Nvidia GPU having a memory size of around 12 Gb.
Nvidia Docker
Triton Client libraries for…

How to deploy (almost) any Hugging face model 🤗 on NVIDIA’s Triton Inference Server with an application to Zero-Shot-Learning for Text Classification

SUMMARY

REQUIREMENTS

Written by Sachin Sharma