Deploying Llama 2 in Vertex AI from Model Garden: A Step-by-Step Guide

Zabir Al Nazi Nabil
4 min readOct 31, 2023

--

Introduction: Google’s Vertex AI, in collaboration with Model Garden, provides a convenient platform for deploying machine learning models. In this article, we will explore how to deploy the Llama 2 model in Vertex AI from Model Garden. Llama 2 is a versatile model that can be used for various natural language processing tasks.

Step 1: Accessing Vertex AI Model Garden The first step is to navigate to the Vertex AI Model Garden. Here, you can find a wide range of pre-trained models, including Llama 2, ready for deployment.

Llama 2 in Model Garden

Step 2: Selecting Llama 2 Once you are in the Model Garden, locate and select the Llama 2 model. Clicking on it will take you to the deployment configuration page. Now, click on “Deploy”.

Llama 2 Deployment

Step 3: Configuring Deployment In the deployment configuration page, you can specify the settings for deploying Llama 2. Here are the steps to follow:

  1. Select the model that you need, for example: I chose Llama-13B-chat. Select the region, make sure you have enough quota for GPUs.
  2. Select Machine Configuration: Choose a machine type that suits your needs. For example, you can opt for a machine with 64 VCPUs, 57.6 GB of RAM, and 4 Tesla T4 GPUs. This selection depends on your specific use case and resource requirements.
  3. Define Deployment Targets: Specify the deployment targets, which can include the Google Cloud region where you want to host your model.
  4. Scaling Options: Vertex AI offers options for automatic scaling based on the number of requests. You can configure this based on your expected workload.
  5. Advanced Settings: Explore advanced settings such as network configuration and access control to align with your project’s requirements.
Llama 2 13B Chat Deployment
Llama 2 Deployment Settings
Llama 2 Endpoint

Step 4: The deployment will take a few minutes. Once deployed, you can find the model in “Online prediction”.

Now, it’s time to write a notebook to test our deployed Llama 2 model.

! pip3 install --upgrade google-cloud-aiplatform
! pip3 install ipython pandas[output_formatting] google-cloud-language==2.10.0
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
from google.colab import auth as google_auth
google_auth.authenticate_user()
from typing import Dict, List, Union

from google.cloud import aiplatform
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value


def predict_custom_trained_model_sample(
project: str,
endpoint_id: str,
instances: Union[Dict, List[Dict]],
location: str = "us-central1",
):
"""
`instances` can be either single instance of type dict or a list
of instances.
"""
api_endpoint = f"{location}-aiplatform.googleapis.com"
# The AI Platform services require regional API endpoints.
client_options = {"api_endpoint": api_endpoint}
# Initialize client that will be used to create and send requests.
# This client only needs to be created once, and can be reused for multiple requests.
client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)
# The format of each instance should conform to the deployed model's prediction input schema.
instances = instances if isinstance(instances, list) else [instances]
instances = [
json_format.ParseDict(instance_dict, Value()) for instance_dict in instances
]
parameters_dict = {}
parameters = json_format.ParseDict(parameters_dict, Value())
endpoint = client.endpoint_path(
project=project, location=location, endpoint=endpoint_id
)
response = client.predict(
endpoint=endpoint, instances=instances, parameters=parameters
)
print("Response")
print("Deployed Model ID:", response.deployed_model_id)
# The predictions are a google.protobuf.Value representation of the model's predictions.
predictions = response.predictions
for prediction in predictions:
print("prediction:", prediction)
query = """
My name is John. My father's name is Peter, Peter's younger son is Rob. Peter has 2 sons and one daughter. What's my brother's name?
"""
predict_custom_trained_model_sample(
project="1099296904724",
endpoint_id="7858948475528937472",
location="us-east1",
instances = [
{
"prompt": query, "temperature": 0.0
},
]
)
Response
Deployed Model ID: 938921357469548544
prediction: Prompt:
My name is John. My father's name is Peter, Peter's younger son is Rob. Peter has 2 sons and one daughter. What's my brother's name?
Output:
Answer: Rob.

--

--