Deploy Hugging Face Text Generation Inference on Azure Container Instance

Published in

The Deep Hub

5 min readMar 19, 2024

“HuggingFace with Azure on Cloud” (Source: Pixlr)

The Text Generation Inference (TGI) by Hugging Face is a gRPC- based inference engine written in Rust and Python for fast text-generation. It is the backend serving engine for various production use-cases at Hugging Face like the Hugging Face Chat, their Inference API, and Inference Endpoint. It facilitates efficient deployment and serving of Large Language Models (LLMs), supporting popular open-source LLMs like Llama, Falcon, StarCoder, BLOOM, and GPT-NeoX.

TGI offers numerous features including a simple launcher for serving LLMs, distributed tracing and Prometheus metrics, tensor parallelism for faster inference on multiple GPUs, and token streaming using Server-Sent Events (SSE). It also supports fine-tuning, quantization, stop sequences and speculative decoding and is compatible with various hardware platforms like Nvidia, AMD, Intel, etc.

In this article, we’ll dive into the process of deploying Hugging Face TGI on Azure Container Instance (ACI).

For end-to-end deployment guide on Azure Kubernetes service including frontend and backend, check out my other article.

What’s Azure Container Instance?

Azure Container Instance (ACI) is a serverless container service that enables quick and easy deployment of containerized applications without the need for managing underlying infrastructure.

Step-by-Step Deployment Guide

Step 1: Accessing Azure Cloud Shell

Log in to the Azure Portal.
Navigate to the Cloud Shell icon in the top right corner and start a Bash session. If this is your first time accessing Cloud Shell, it may prompt you to set-up a Storage Account.

Step 2: Creating YAML Configuration File

Using your preferred text editor (such as vi), create a YAML file named container_config.yml.
Populate the YAML file with the necessary configuration settings, including container properties, environment variables, and volume mounts. Here,
Here’s a sample template to get you started. Note that I am using TinyLlama/TinyLlama-1.1B-Chat-v1.0 model for my deployment. You can use any model available on the HuggingFace Hub. If you decide to use any of the gated models like the Meta’s Llama models, you’ll have to follow the appropriate access request procedure as outlined in the Model Card on HuggingFace. Once you’re provided access, you need to add another environment variable named HUGGING_FACE_HUB_TOKEN in the YAML file to verify your access. You can getting your personal access token from here.

name: container-group-name
apiVersion: '2021-10-01'
location: eastus
tags: {}

properties:
  containers:
    - name: container-name
      properties:
        image: ghcr.io/huggingface/text-generation-inference:1.4
        ports:
          - protocol: TCP
            port: 80
        environmentVariables:
          - name: MODEL_ID
            value: 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'
        resources:
          requests:
            memoryInGB: 16
            cpu: 4
          limits:
            memoryInGB: 16
            cpu: 4
        volumeMounts:
        - name: data-volume
          mountPath: /data
          readOnly: false
  imageRegistryCredentials: []
  restartPolicy: Always
  ipAddress:
    ports:
      - protocol: TCP
        port: 80
    type: Public
  osType: Linux
  volumes:
    - name: data-volume
      azureFile:
        shareName: modeldata
        readOnly: false
        storageAccountName: YOUR_STORAGE_ACCOUNT_NAME
        storageAccountKey: YOUR_STORAGE_ACCOUNT_KEY
  sku: Standard

Step 3: Setting Up Azure File Share

To prevent TGI service from downloading model weights from Hugging Face on every run, we mount a volume to our container to store the model files. In Azure, one way to achieve this is to create an Azure File Share resource and mount it to the container instance as follows:

In the Azure Portal, navigate to the Storage Accounts service from the search bar.
Create a new storage account or select an existing one.
Within the storage account settings, navigate to the File Shares section.
Create a new file share (e.g., ‘modeldata’).
Head back to your Storage Account page and under the Keys section, copy any of the keys and paste into the YAML template.

Step 4: Deploying Hugging Face TGI on ACI

In Cloud Shell, issue the following Azure CLI command, specifying the resource group and the file path to the YAML configuration file.

az container create - resource-group YOUR_RESOURCE_GROUP_NAME file container_config.yml

Expected output if your YAML is valid

You’ll then see a JSON output mentioned “provisioningState” as “succeeded” which confirms you did not exceed your quota or your compute requirements can indeed be met in the Azure region you specified in the YAML.

Depending on your model selection and compute specification, things can still go wrong. Search for “Container instances” and open the container group that you just created. In the Overview page, you’ll see your CPU, Memory and Network usage ramping up as your container downloads model weights from HuggingFace. If your container crashes half-way through the process, most likely this means you need more RAM and you’ll have to delete your current container, increase RAM in YAML specification and re-do the above process.

Azure Container Instance downloads the model weights and warms up for inference

In Azure Container service, there are different ways of retrieving logs that you can view to review the deployment status, diagnostic events and the stdout for a running container.

az container logs --resource-group YOUR_RESOURCE_GROUP --name container_group_name

# OR
az container attach --resource-group YOUR_RESOURCE_GROUP --name container_group_name

# OR
az container show --resource-group YOUR_RESOURCE_GROUP --name container_group_name

Once your logs show the “message: “Connected”, this means your server is ready to serve requests.

Copy the IP address of the container from the Overview tab and issue a cURL to query the model:

curl YOUR_IP_ADDRESS/generate -X POST -d '{"inputs":"What is a Large Language Model?","parameters":{"max_new_tokens":100}}' -H 'Content-Type: application/json'

HuggingFace TGI Server Output on cURL request

Here’s how your mounted Azure File share will look like once model weights are downloaded:

ACI downloads model weights to the mounted Azure File Share.

The TinyLlama-1.1B model used in this tutorial is a pretty light-weight model and works just okay in a CPU-only environment for testing purposes. For most production use-cases, you would and should be using GPUs to leverage the most out of the LLM’s capabilities and TGI’s offerings.

You now have 2 options. Either use ACI with GPUs (which is a preview feature at the time of writing and only available in a pay-as-you-go subscription), or use Azure Kubernetes Service (AKS) with GPU-powered clusters.

To add a set of GPUs in ACI, specify the requirements in the YAML requests and limits sections as follows:

gpu:
  count: 2
  sku: v100

Conclusion

In this article, we went over the process of deploying Hugging Face’s Text Generation Inference engine on Azure Container Instance. ACI offers a seamless and efficient solution for deploying containerized applications in the cloud. With the simplicity of ACI and the power of Hugging Face TGI, developers can effortlessly deploy and scale their Generative AI applications.

Stay tuned for a follow-up article where I go over the process of deploying inference engine for LLMs using Azure Kubernetes Service.

Thanks for reading!