Using Llama 3.1 API with Vertex AI

saxenashikha
Google Cloud - Community
5 min readAug 22, 2024

--

On July 24th,2024 Google Cloud announced the addition of the Llama 3.1 family of models, including a new 405B model — Meta’s most powerful and versatile model to date — to Vertex AI Model Garden.

Llama is a collection of open models developed by Meta that you can fine-tune and deploy on Vertex AI. Llama offers pre-trained and instruction-tuned generative text models for assistant-like chat. You can deploy Llama 3.1, Llama 3, and Llama 2 models on Vertex AI.

Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pre-trained and instruction-tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text-only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks. Vertex AI provides a curated collection of first-party, open-source, and third-party models, many of which — including the new Llama models — can be delivered as fully-managed Model-as-a-service (MaaS) offerings. With MaaS, you can choose the foundation model that fits your requirements, access it simply via an API, tailor it with robust development tools, and deploy on our fully-managed infrastructure — all with the simplicity of a single bill and hassle-free infrastructure.

Llama models on Vertex AI offer fully managed and serverless models as APIs. To use a Llama model on Vertex AI, send a request directly to the Vertex AI API endpoint. Because Llama models use a managed API, there’s no need to provision or manage infrastructure.

You can stream your responses to reduce the end-user latency perception. A streamed response uses server-sent events (SSE) to incrementally stream the response.

Llama 3.1 405B is Meta’s most powerful and versatile model to date. It’s the largest openly available foundation model, providing capabilities from synthetic data generation to model distillation, steerability, math, tool use, multilingual translation, and more. For more information, see Meta’s Llama 3.1 site.

Llama 3.1 405B is optimized for the following use cases:

  • Enterprise-level applications
  • Research and development
  • Synthetic data generation and model distillation

Lets Start

In this guide, I ‘ll walk you through the process of setting up and utilizing Llama3.1 LLM on GCP through APIs.

Pre- implementation steps

  1. Enable Vertex AI APIs

To use Llama models with Vertex AI, you must perform the following steps. The Vertex AI API (aiplatform.googleapis.com) must be enabled to use Vertex AI. If you already have an existing project with the Vertex AI API enabled, you can use that project instead of creating a new project.

2. Provide permissions

The following roles and permissions are required to use partner models:

  • You must have the Consumer Procurement Entitlement Manager Identity and Access Management (IAM) role. Anyone who’s been granted this role can enable partner models in Model Garden.
  • You must have the aiplatform.endpoints.predict permission. This permission is included in the Vertex AI User IAM role. For more information, see Vertex AI User and Access control.

3. Set the organization policy

To enable partner models, your organization policy must allow the following APIs:

  • Cloud Commerce Consumer Procurement API — cloudcommerceconsumerprocurement.googleapis.com
  • Commerce Agreement API — commerceagreement.googleapis.com

If your organization sets an organization policy to restrict service usage, then an organization administrator must verify that cloudcommerceconsumerprocurement.googleapis.com and commerceagreement.googleapis.com are allowed by setting the organization policy.

4. Billing

Make sure that billing is enabled for your Google Cloud Project.

Implementation Steps

  1. Go to the project where you would like to use Llama model
  2. Navigate to Vertex AI -> Model Garden-> Llama 3.1

3. Enable Llama 3.1 AP service

3. Once enabled, you will be able to see this

Enabled API

4. Once enabled you can try out Llama in the same screen

You. will get a response as below

You could set the variables as required

Using Llama 3.1 API Service

Llama 3.1 API service can be directly called from your application. You can try the APIs CLI

To use Llama 3.1 API service with the command line interface (CLI), do the following:

  1. Open Cloud Shell or a local terminal window with the gcloud CLI installed.
  2. Configure environment variables by entering the following. Replace YOUR_PROJECT_ID with the ID of your Google Cloud project
ENDPOINT=us-central1-aiplatform.googleapis.com
REGION=us-central1
PROJECT_ID="YOUR_PROJECT_ID"

3. Send a prompt request by entering the following curl command:

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" https://${ENDPOINT}/v1beta1/projects/${PROJECT_ID}/locations/${REGION}/endpoints/openapi/chat/completions \
-d '{"model":"meta/llama3-405b-instruct-maas", "stream":true, "messages":[{"role": "user", "content": "Summer travel plan to Paris"}]}'

How to use Llama 3.1 in your application

  1. Install Vertex AI SDK for Python and other required packages
! pip3 install - upgrade - quiet google-cloud-aiplatform[langchain] openai
! pip3 install - upgrade - quiet langchain-openai

2. Use the following code

You change the “candidateCount”,“maxOutputTokens”, “temperature”, “topP”, “topK” as per your requirements .

from google.cloud import aiplatform
from google.cloud.aiplatform.gapic.schema import predict
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value
# The AI Platform services ¯require regional API endpoints.
client_options = {"api_endpoint": "us-central1-aiplatform.googleapis.com"}
# Initialize client that will be used to create and send requests.
# This client only needs to be created once, and can be reused for multiple requests.
client = aiplatform.gapic.PredictionServiceClient(
client_options=client_options
)
instance_dict = {
"content": "You are a physician advisor . what advice can i give a patient to avoid getting ringworm ? please provide detailed treatment options"
}
instance = json_format.ParseDict(instance_dict, Value())
instances = [instance]
parameters_dict = {
"candidateCount": 1,
"maxOutputTokens": 500,
"temperature": 0.2,
"topP": 0.8,
"topK": 40
}
parameters = json_format.ParseDict(parameters_dict, Value())
response = client.predict(
endpoint="projects/test-project/locations/us-central1/publishers/google/models/medpalm2@preview", instances=instances, parameters=parameters
)
print("response")
predictions = response.predictions
for prediction in predictions:
print(" prediction:", dict(prediction))


*Prevention Advice:**\n\nTo avoid getting ringworm, I advise patients to take the following precautions:\n\nKeep yourself and your surroundings clean. Regularly wash your hands, especially after coming into contact with someone who has ringworm.\nAvoid sharing personal items such as towels, clothing, or hair accessories, as the fungus can be transmitted through contaminated items.\nWear shoes in public areas, particularly in locker rooms, showers, or around swimming pools, to reduce the risk of infection.\nChange your socks and underwear at least once a day to prevent moisture buildup that can facilitate fungal growth.\n\n**Symptoms

Congratulations! You’ve successfully deployed Llama3.1 API LLM on Google Cloud Platform and are ready to use its powerful language capabilities for your projects.

--

--

saxenashikha
Google Cloud - Community

Passionate about technology and is fascinated by the intesection of technology and creativity. I m a technologist by trade with a creative streak.