Deploying custom fine-tuned LLMs on Vertex AI

6 min readOct 7, 2023

With the democratization of Open Source large language models, enterprises are increasingly utilizing these LLMs for their internal applications and use cases. They prefer deploying these models within their own infrastructure due to security reasons and the need to fine-tune them for specific use cases.

Google Vertex AI offers an end-to-end solution for training, fine-tuning, deploying, and serving these Open Source LLMs. In this article, I will explain how to deploy a fine-tuned model on Vertex AI.

Vertex AI: Model Garden, Model Registry, Model deployment and Endpoints

Model Garden provides a curated collection of foundation models including enterprise-ready foundation model APIs, open source models, and task-specific models.These can be easily deployed and used for inference.

Model Registry contains custom registered models and instances from Model garden. These are containerized applications which exposes models via endpoints. (Discussed later)

Models in the Registry are deployed on GCP infrastructure ; one or more GPUs are generally used for faster inference.

Endpoints expose models for inference. A single endpoint can expose multiple models, and traffic-split can be configured. Endpoints are secured by GCP IAM.

Custom-container for fine-tuned model

A custom-container contains an application that exposes two endpoints: One for inference ( /predict ) and another for health checks ( /health). Usually this is implemented using a framework like Flask or FastAPI.

“predict” endpoint should follow the specification documented here. “health check” endpoint should also follow this.

Model loading and inference is done using HuggingFace Transformers . For low-latency and high-throughput inference, usually a model optimization framework like vLLM or MII are used. This article provides a comprehensive list of optimization and serving frameworks.

Fine-tuned models are typically saved in different formats including PyTorch , Tensorflow or Safestensors .This model can be either packaged in the container or it can be retrieved from an Object store. (eg: GCS,S3)

During deployment, VM instance type, as well as the Accelerator type (GPU) and count as be configured. During the runtime the model is loaded into one or more GPUs (depend on the deployment options).

Fine-tuned model, Serving application and Dockerfile

For demonstration purposes, a fine-tuned FLAN5 model for text-summarization is used. You can find the notebook used for fine-tuning here .

Following code demonstrates a sample Serving application with “predict” and “health” endpoints:

import os
from os.path import dirname

from flask import Flask, request, jsonify

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer

device = "cuda:0" if torch.cuda.is_available() else "cpu"

#download the fine-tuned model files and copy into 'flan5' folder
model_path = f'{dirname(__file__)}/flan5/'

#load the model
loaded_model = AutoModelForSeq2SeqLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="auto").to(
    device)
#get the tokeziner
loaded_tokenizer = AutoTokenizer.from_pretrained(model_path)

app = Flask(__name__)


# Define the /health GET endpoint
@app.route('/health', methods=['GET'])
def is_alive():
    return '', 200  # HTTP 200 OK response


def batch_inference(prompt: [str]) -> [str]:
    res: [str] = []
    input_ids = loaded_tokenizer(prompt, padding=True, return_tensors="pt").input_ids.to(device)
    loaded_model_outputs = loaded_model.generate(input_ids=input_ids,
                                                 generation_config=GenerationConfig(max_new_tokens=200))
    for model_output in loaded_model_outputs:
        loaded_model_text_output = loaded_tokenizer.decode(model_output, skip_special_tokens=True)
        res.append(loaded_model_text_output)
    return res


# Define the /predict POST endpoint
@app.route('/predict', methods=['POST'])
def predict():
    try:
        data = request.get_json()
        instances = data['instances']
        if not instances:
            raise Exception("No instances found")

        prompts = [value['prompt'] for value in instances]
        completions = batch_inference(prompts)
        prediction = {"predictions": completions} 

        return jsonify(prediction), 200
    except Exception as e:
        error_message = {'error': 'Invalid request format or missing fields'}
        print(str(e))
        return jsonify(error_message), 400  # HTTP 400 Bad Request


if __name__ == '__main__':
    port = int(os.environ.get('PORT', 8080))
    app.run(debug=True, host='0.0.0.0', port=port, use_reloader=False)

Dockerfile:

FROM nvidia/cuda:12.2.0-devel-ubuntu20.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
    apt-get install -y nginx python3 python3-dev python3-pip && \
    rm -rf /var/lib/apt/lists/*

RUN mkdir /model
COPY ./flan5 ./flan5
COPY ./main.py .
RUN pip3 install transformers==4.33.2 accelerate==0.23.0  flask==2.3.3

EXPOSE 8080
ENTRYPOINT [ "python3" ]
CMD ["main.py" ]

Registering model into the Model Registry

Build and push the Docker image to the GCR:

docker build .
docker tag b50497c28d5e gcr.io/my-gcp-project-id/ll-custom
docker push gcr.io/my-gcp-project-id/ll-custom

Container should be visible in the GCR console.Next the model is registered to the Model Registry as follows:

gcloud ai models upload \
  --container-ports=8080 \
  --container-predict-route="/predict" \
  --container-health-route="/health" \
  --region=asia-northeast1 \
  --display-name=ash-llm-custom-batch-inf \
  --container-image-uri=gcr.io/my-gcp-project-id/ll-custom

Model should be visible in the Model Registry .The model is assigned a long ID, such as 6667306569438330880.

Creating the endpoint and deploy the model

First, an endpoint needs to be created. (Documentation)

gcloud ai endpoints create \
  --project=my-gcp-project-id \
  --region=asia-northeast1 \
  --display-name=ash-llm-endpoint

When the endpoint is created, an endpoint ID is returned. This ID a long number such as 3871485994515562496.

Next the registered model is deployed attaching to the created endpoint.This will take 5–10mins depend on the configured resources.

gcloud ai endpoints deploy-model 3871485994515562496\
  --project=my-gcp-project-id \
  --region=asia-northeast1 \
  --model=6667306569438330880 \
  --accelerator=type=nvidia-tesla-t4,count=1 \
  --machine-type="n1-highmem-2" \
  --display-name=fine-tuned-flan5

Details of all the options can be found in the documentation. Deployed model and the endpoint can be viewed in the console:

Connecting to the endpoint and performing inference

Following code demonstrates how to connect to the deployed endpoint and perform inferencing.In the example , we send multiple prompts for batch-inference.

from google.cloud import aiplatform

project = 'my-gcp-project-id' 
location = 'asia-northeast1'
endpoint_id = '3871485994515562496'

aiplatform.init(project=project,
                location=location)
endpoint = aiplatform.Endpoint("projects/"+project+"/locations/"+location+"/endpoints/"+endpoint_id)

prompt1 = """
Summarize the following conversation.

#Person1#: You're finally here! What took so long?
#Person2#: I got stuck in traffic again. There was a terrible traffic jam near the Carrefour intersection.
#Person1#: It's always rather congested down there during rush hour. Maybe you should try to find a different route to get home.
#Person2#: I don't think it can be avoided, to be honest.
#Person1#: perhaps it would be better if you started taking public transport system to work.
#Person2#: I think it's something that I'll have to consider. The public transport system is pretty good.
#Person1#: It would be better for the environment, too.
#Person2#: I know. I feel bad about how much my car is adding to the pollution problem in this city.
#Person1#: Taking the subway would be a lot less stressful than driving as well.
#Person2#: The only problem is that I'm going to really miss having the freedom that you have with a car.
#Person1#: Well, when it's nicer outside, you can start biking to work. That will give you just as much freedom as your car usually provides.
#Person2#: That's true. I could certainly use the exercise!
#Person1#: So, are you going to quit driving to work then?
#Person2#: Yes, it's not good for me or for the environment.
"""

prompt2 = """
Summarize the following conversation.

#Person1" : iPhone15 is bad, you should not buy one.
#Person2" : I agree.
"""

# prepare payload

instances = [
      {
         "prompt": prompt1
      },
      {
         "prompt": prompt2
      }
   ]

completions = endpoint.predict(instances=instances)
print(completions)
for p in completions.predictions:
  print("completion: " +p)

Enhancing inference speed using vLLM

Above demonstrated Serving application only uses HuggingFace model classes and tokenizer classes. This is because currently vLLM doesn't support encoder-decoder based models like FLAN5. Following code snippet shows how to load a fine-tuned LLama2 mode using vLLM for high-throughput and improved batch inferencing.

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

num_of_gpus = 1

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="./llama-ft/merged_model/",tensor_parallel_size=num_of_gpus)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

References:

FLAN5 Fine-tune notebook is based on this Coursera Course
Frameworks for Serving LLMs