The Chronicles of Llama: The new Llama 405b on Vertex AI!
Co-authored with Nikita Namjoshi, Pavel Jbanov and Harish Muppalla
The Llama 3.1 family of models, including the new 405B model — Meta’s most powerful and versatile model to date — are now available in Vertex AI Model Garden. The 405B model is the largest openly available foundation model to date, unlocking an array of new possibilities, from generating synthetic data and powering complex reasoning tasks, in more than 5 languages.
In this article, you’ll see a conversational application built using the Llama 3.1 405B model with Firebase GenKit. Then you’ll learn how to use Vertex AI as an end to end platform to help you experiment, prototype, evaluate, and deploy GenAI applications with Llama 3.1 models. By the end of this reading, you’ll have a better understanding of how to access Llama 3.1 models on Vertex AI and start using them to build powerful GenAI applications.
“Chat with Llama 3.1 405B” application using Genkit
Imagine that you want to build a conversational GenAI app leveraging the Llama 3.1 405B model on Vertex AI.
This Chat with Llama 3.1 application allows users to ask the Llama 3.1 405B model various coding questions, such as writing functions and generating test cases. And if you need to explain this code to a colleague or student in Italian, the model can seamlessly switch between languages, providing a clear explanation of the English-based code in Italian.
One way to build this application is using Firebase Genkit. Genkit is an open source framework that helps you to build, deploy, and monitor AI-powered apps. Genkit is built for developers, making it easier to integrate LLMs such as Llama 3.1 405B hosted on Vertex AI into your apps. Genkit provides plugins that give you access to LLMs with a simple and flexible interface that makes it easy to integrate any model API. All you need to change is a single line of configuration as shown below.
import { vertexAI, llama3 } from '@genkit-ai/vertexai';
configureGenkit({
plugins: [
// Configure Vertex AI plugin
vertexAI({
location: 'us-central1',
modelGarden: {
models: [llama3],
},
evaluation: {
metrics: [
VertexAIEvaluationMetricType.SAFETY,
VertexAIEvaluationMetricType.FLUENCY,
],
},
}),
],
logLevel: 'debug',
enableTracingAndMetrics: true,
});
// Generate content from Llama3
const llmResponse = await generate({
model: llama3,
prompt: 'What is Vertex AI?',
});
Additionally, Genkit is particularly powerful during the prototyping phase. The Genkit Developer UI lets you test, evaluate, and debug your end-to-end flows including any custom code. For example, in this case, the conversational application integrates the Vertex AI Rapid Eval API which lets you evaluate your large language models (LLMs) across several metrics. You can find the source code of the chatbot app that we’ve built here.
Note that Genkit is just one approach to developing GenAI applications with Llama 3.1 models. You can also use Llama 3.1 models on Vertex AI by leveraging both the OpenAI libraries for Python and Rest, as well as the Vertex AI SDK.
Prototype with Llama 3.1 models on Vertex AI
On Vertex AI, you can access Llama 3.1 models through Vertex AI Model Garden in just a few clicks using Model-as-a-Service (MaaS) without any setup or infrastructure hassles. Alternatively, you can access Llama 3.1 models for self-service in Vertex AI Model Garden, giving you the flexibility to choose your preferred infrastructure. Below you can find an overview of self-hosted and model-as-serving deployment options.
Assuming that you already have a conversational application, you can use the OpenAI libraries for Python to chat with Llama 3.1 405B through Model-as-a-Service. This means you can switch between calling different models to compare output, cost, and scalability, without changing your existing code.
Below you can see the conversation API code with Llama 3.1 405B on MaaS.
# Import libraries
import openai
from google.auth import default, transport
# Set some parameters
temperature = 1.0
max_tokens = 500
top_p = 1.0
# Get credentials
credentials, _ = default()
auth_request = transport.requests.Request()
credentials.refresh(auth_request)
# Initialize the OpenAI client
client = openai.OpenAI(
base_url = f'https://{LOCATION}-aiplatform.googleapis.com/v1beta1/projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/openapi/chat/completions?',
api_key = credentials.token)
# Submit a model request
response = client.chat.completions.create(
model='meta/llama3-405b-instruct-maas',
messages=[
{"role": "user", "content": "What is Vertex AI?"},
{"role": "assistant", "content": "Sure, Vertex AI is:"}
],
temperature=temperature,
max_tokens=max_tokens,
top_p=top_p,
)
# Get the response
print(response.choices[0].message.content)
To use the OpenAI SDK for the Llama 3.1 Chat Completions API on Vertex AI, you need to request the access token and initialize the client pointing to the Llama 3.1 Model-as-a-Service (MaaS) endpoint. You can request an access token from the default credentials for the current environment.
Evaluate Llama 3.1 models on Vertex AI
In this scenario, the newly launched Model-as-a-Service deployment offers seamless access to the advanced 405B Llama model. However, prior to integrating Llama 405B into your application, you may want to evaluate it against a smaller Llama model. Vertex AI offers AutoSxS, a model-based evaluation tool that utilizes an LLM model known as autorater to compare responses from two other LLMs and determine which one provides a better response to a prompt. This tool allows you to make informed decisions regarding model selection and deployment.
Below you can see the Autorater’s judgments to evaluate responses from two different Llama models and provide an explanation for the preferred choice.
AutoSxS also provides win-rate metrics. In this particular case, the Autorater model preferred the answers provided by LLaMA 3.1 405b (model_b) over LLaMA 3 70b (model_a), helping increase confidence that 405b is the right model for this question-answering task.
Below you can see the evaluation pipeline with the win-rate metrics in the Vertex AI Pipeline’s UI.
Build with Llama 3.1 models on Vertex AI
After you verify that the Llama 3.1 model is the right model for your GenAI applications, using the Open AI SDK facilitates the integration of Llama models on Vertex AI into existing GenAI applications and the GenAI ecosystem, including LlamaIndex and LangChain. For example, you can use the SDK in combination with LlamaIndex on Vertex AI to deploy a RAG application.
LlamaIndex on Vertex AI helps you with the end-to-end process of building and deploying context-augmented large language model (LLM) applications including retrieval-augmented generation (RAG), from ingesting data from various sources, to transforming it for indexing, and creating numerical representations (embeddings) for semantic understanding. Then, when a user provides a query, LlamaIndex on Vertex AI retrieves relevant information and uses it as context to generate accurate and relevant responses.
Below you see how to use LlamaIndex on Vertex AI to retrieve relevant information from a RAG corpus and pass it in the Open AI Chat Completions API on Vertex AI to generate a better answer about llama spitting!
question = "What about llama spitting?"
context = " ".join([context.text for context in rag.retrieval_query(
rag_resources=[
rag.RagResource(
rag_corpus=rag_corpus.name,
)
],
text=question,
similarity_top_k=1,
vector_distance_threshold=0.5,
).contexts.contexts])
response = client.chat.completions.create(
model=MODEL_ID,
messages=[{'role': 'system', 'content': '''You are an AI assistant. Your goal is to answer questions using the pieces of context. If you don't know the answer, say that you don't know.'''},
{'role': 'user', 'content': question},
{'role': 'assistant', 'content': context}])
print(response.choices[0].message.content)
Of course, RAG is just one example of what you can build with Llama 3.1 405B. of the GenAI application. By leveraging the Open AI Chat Completions API on Vertex AI you can build LangChain chains and even Agents and deploy them on Vertex AI Reasoning Engine!
Conclusions
The Llama 3.1 family of models, including the new 405B model are now available in Vertex AI Model Garden.
This article shows how to use Vertex AI as an end to end platform to help you experiment, prototype, evaluate, and deploy GenAI applications with Llama 3.1 models.
If you want to learn more, you’ll find sample code and notebooks in to Vertex AI Model Garden to get started building GenAI applications today.
What’s Next
Do you want to know more about Llama 3.1 and how to use it? Check out the following resources!
Documentation
Github samples
- Get started with Llama 3.1 notebook
- Evaluate Llama 3 models with Vertex AI AutoSxS notebook
- Get started with RAG using Llama 3.1 on LlamaIndex on Vertex AI notebook
- Synthetic Data Generation using Llama 3.1 notebook
- Firebase Genkit Chat with Llama 3.1 app
YouTube videos
- Llama 3.1 405B on Vertex AI Model Garden jumpstart
Thanks for reading
I hope you enjoyed the article. If so, 𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲, 👏 this article or leave comments. Also let’s connect on LinkedIn or X to share feedback and questions 🤗 about Vertex AI.
Huge thanks to Alok Pattani, Lavi Nigam and Eric Doug for support and feedback!