How to deploy (almost) any Hugging face model 🤗 on NVIDIA’s Triton Inference Server with an application to Zero-Shot-Learning for Text Classification
SUMMARY
In this blog post, We examine Nvidia’s Triton Inference Server (formerly known as TensorRT Inference Server) which simplifies the deployment of AI models at scale in production. For the purpose of this examination, we mainly focus on hosting Transformer Language Models like BERT, GPT2, BART, RoBerta , etc. Afterward, to solve the problem of zero-shot-text-classification, we will be using Hugging’s Face RoBerta (Multilingual Natural Language Inferencing Model) model for deployment on the Triton server, once deployed we can make inference requests and can get back the predictions. For, setting up the Triton inference server we generally need to pass two hurdles: 1) Set up our own inference server, and 2) After that, we have to write a python client-side script which can communicate with the inference server to send requests (in our case text) and get back predictions or text feature embeddings.
REQUIREMENTS
- Nvidia CUDA enabled GPU: For, this blog post I am using GeForce RTX 2080 Nvidia GPU having a memory size of around 12 Gb.
- Nvidia Docker
- Triton Client libraries for…