Getting Started with Mistral 7B model on GCP Vertex AI

11 min readOct 11, 2023

On the 27th of September, Mistral AI released their first open source model : Mistral-7B v0.1, a lightweight 7 billions parameters model.

Mistral-7B is released under Apache 2.0 license and comes with weights and sources, permitting full customisation through Fine Tuning.

The raw model weights are downloadable from the documentation and on Hugging Face.

Today, we have released a Vertex AI notebook which allows users to deploy an end to end workflow to experiment and deploy Mistral-7B on Vertex AI.

First you will be able to experiment locally on the notebook and secondly you will be able to deploy the Mistral model on a Vertex AI endpoint and send prompt requests.

Running the Notebook experimentation will take approximately 25 min, you will deploy the following architecture:

Note: Mistral-7B is a pre-trained base model and therefore does not have any moderation mechanisms. An API such as Perspective API might help you in that regards.

Summary

Prerequisites & About this Notebook
Getting started with the Notebook in your environment (Opening, Dependencies, Variables)
Running Mistral model locally for testing and experimenting
Deploy Pre-built Mistral Model with vLLM on Vertex AI registry and endpoint
Cleaning up the resources
Conclusion

Prerequisites & About this Notebook

Requirement for User-Managed notebook instance (Running mistral locally):

Because you will be first experimenting with Mistral-7B locally on the notebook (not on the Vertex AI endpoint), you need to fullfill the requirements of 16GB vRAM for the GPU attached to the notebook.

The minimum configuration to run Mistral-7B locally (GPU attached to an instance or notebook), are the following:

AI Accelerator requirements to run Mistral-7B with no inference stack (not with vLLM image)

Requirement for Vertex AI endpoint (that you will configure in the notebook python code):

In this notebook you will deploy a vLLM image on an endpoint (further explanations below). To run the image, you need a VM with at least 24GB of vRAM for good throughput and float16 weights. The space taken by the vLLM kv_cache explains the difference between the 16GB vs 24GB of vRAM.

On GCP you can leverage the following minimum configurations for GPUs to fit a vLLM image serving Mistral-7B:

1 L4 GPU (24GB per GPU) — g2-standard-8
1 A100 GPU (40GB) — a2-highgpu-1g
2 V100 GPUs (16GB per GPU) — n1-highmem-8 or n1-standard-16
2 T4 GPUs (16GB per GPU) — n1-highmem-8 or n1-standard-16

The cast into float16 and GPUs selection will be demonstrated below in the blog post and in the notebook.

Discovering the notebook

To get started, start by opening the Vertex AI Notebook in model garden repository on Github, you will be prompted to Run it in Colab or to open it in Vertex AI Workbench.

Google Colab is a free cloud-based Jupyter Notebook environment that allows developers to write and run Python code using their browser.

VertexAI Workbench is a JupyterLab-based development environment for the entire data science workflow, from data preparation and exploration to model training and deployment, one of the key benefits is the integration with other services and data sources in your GCP environment.

Getting started — Setting up the environment

In this blog post we will be leveraging VertexAI Workbench to run the notebook, when clicking on “Open in Vertex AI Workbench”, users will be redirected to the Google Cloud Vertex AI Console.

They need to select “Create a new notebook” and configure the machine and environment as needed.

A Python 3 environment is recommended to run the mistral notebook.

If you decide to run the Mistral model locally on the managed notebook instance, you will need to select a GPU and install the NVIDIA Drivers (see Requirement for User-Managed notebook instance)

Find below a working configuration to run inference locally on the notebook:

OS: Debian 11
Environment: Python3 + CUDA latest versions
Machine Type: N1-highmem-8 (Define a machine type that allows you to attach a GPU, it would depend on the experiment you want to run)
GPU: Nvidia V100
Number of GPUs: 1

After selecting your machine type and environment, you can click create and the Jupyter Managed notebook will be online after approximately 2–3 min.

Once the notebook is up and running you will be able to click on Open and Confirm the Deployment to the notebook server.

You will have the “model_garden_pytorch_mistral.ipynb” file already opened in your Jupyterlab environment.

Installing the Dependencies

First, in order to avoid some errors such as “KeyError: ‘mistral’”, make sure you are installing the latest version of transformer (4.34.0)

! pip3 install transformers==4.34.0
! pip3 install accelerate==0.23.0

Define your environment variables

# Cloud project id.
PROJECT_ID = "your_project_id"  # @param {type:"string"}

# The region you want to launch jobs in.
# Select region based on the accelerators and regions supported by Vertex AI Prediction
# https://cloud.google.com/vertex-ai/docs/predictions/configure-compute.
REGION = "europe-west4"  # @param {type:"string"}

After setting your “Project ID”, you will need to set up the “Region” in which you will deploy your Vertex AI resources and the staging bucket URI.

# The Cloud Storage bucket for storing experiments output.
# Start with gs:// prefix, e.g. gs://foo_bucket.
BUCKET_URI = "gs://experiment-mistral"  # @param {type:"string"}

! gcloud config set project $PROJECT_ID

import os

STAGING_BUCKET = os.path.join(BUCKET_URI, "temporal")

This bucket will be used to stage artifacts when making API calls (In the form gs://).

If you have not created the bucket for storing artifacts, go to the GCP Cloud storage console and create a new bucket with a unique name.

Cloud storage creation page : Selecting the same region as the notebook might help limit costs

Finally, you can enter the service account that this notebook should use to deploy the Mistral model on Vertex AI. If you need to create the service account, make sure to create it with `Vertex AI User` and `Storage Object Admin` roles.

# The service account looks like:
# '@.iam.gserviceaccount.com'
# Please go to https://cloud.google.com/iam/docs/service-accounts-create#iam-service-accounts-create-console
# and create service account with `Vertex AI User` and `Storage Object Admin` roles.
# The service account for deploying fine tuned model.
SERVICE_ACCOUNT = ""  # @param {type:"string"}

Define Constants

In this notebook, we are leveraging a vLLM pytorch image that can be used as an inference stack to run the mistral model.

vLLM is an open-source Python library for fast and easy LLM inference and serving, with state-of-the-art serving throughput.

# The pre-built serving docker image with vLLM
VLLM_DOCKER_URI = (
    "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve"
)

By leveraging the model garden vLLM image, users do not have to manage dependencies or build a docker image to deploy an inference endpoint.

def deploy_model_vllm(
    model_name,
    model_id,
    service_account,
    machine_type="g2-standard-8",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
):
    """Deploys trained models with vLLM into Vertex AI."""
    endpoint = aiplatform.Endpoint.create(display_name=f"{model_name}-endpoint")

    dtype = "bfloat16"
    if accelerator_type in ["NVIDIA_TESLA_T4", "NVIDIA_TESLA_V100"]:
        dtype = "float16"

    vllm_args = [
        "--host=0.0.0.0",
        "--port=7080",
        f"--model={model_id}",
        f"--tensor-parallel-size={accelerator_count}",
        "--swap-space=16",
        f"--dtype={dtype}",
        "--gpu-memory-utilization=0.9",
        "--disable-log-stats",
    ]
    model = aiplatform.Model.upload(
        display_name=model_name,
        serving_container_image_uri=VLLM_DOCKER_URI,
        serving_container_command=["python", "-m", "vllm.entrypoints.api_server"],
        serving_container_args=vllm_args,
        serving_container_ports=[7080],
        serving_container_predict_route="/generate",
        serving_container_health_route="/ping",
    )

    model.deploy(
        endpoint=endpoint,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        deploy_request_timeout=1800,
        service_account=service_account,
    )
    return model, endpoint

When using vLLM you can pass some arguments to configure your inference stack. For example, here are two important variables for vLLM in this code above.

–tensor-parallel-size: It defines the model parallelism setup, helping you define how many GPUs you want the model to run on. Indeed, Mistral-7B model is light but might not fit in a single GPU so you need to distribute it across GPUs for inference. Currently, vLLM support Megatron-LM’s tensor parallel algorithm; the distributed runtime is managed with Ray.
–dtype: As listed in Mistral Model weights in Hugging Face repository, pytorch model is configured to use Bfloat16. Bfloat16 requires Nvidia GPUs of compute capability superior to 8. As listed in the Nvidia Data center products list, T4 and V100 GPU are providing a compute capability below 8. This is why you can see a cast of bfloat16 into float16 so you can run Mistral’s model on 2 V100 GPUs or 2 T4 GPUs.

Running Mistral-7B locally for testing and experimenting

The first few lines of code import the necessary libraries and set the device to load the model onto (cuda).

Note: In this example we are leveraging the “Mistral-7B-v0.1” model ; for chat use cases, Mistral is providing “Mistral 7B Instruct v0.1”. This model is based on the foundational Mistral 7B v0.1 model and has been fine-tuned for conversation and question answering.

%%time
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"  # the device to load the model onto
model_name = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map="auto", return_dict=True, torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)

prompt = "My favourite condiment is"

sequences = pipeline(
    prompt,
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

If you want to build some chat experiments, you could explore FastChat library to manage the conversation, that would provide better outputs than the code above.

The AutoModelForCausalLM class from Transformers is used to load the Mistral-7B-v0.1 model.

It downloads the model from the Hugging Face Hub which is why you will see some progress bar displayed in your notebook.

The AutoTokenizer class is used to load the tokenizer associated with the model (a tokenizer is used to split words into tokens).

Make sure to precise a torch_dtype in float16 if you are using a GPU with compute capabilities below 8 (T4, V100)

Deploy Pre-built Mistral-7B with vLLM on Vertex AI endpoint

Just before deploying the Mistral-7B model, you will define the variables to define the endpoint : machine_type (which machine), accelerator_type (which GPUs) and the accelerator count (how many GPUs).

Please refer to Requirement for Vertex AI endpoint at the beginning of the blog .

In this example, we are deploying the vLLM image on a n1-highmem-8 with 2 v100.

# Find Vertex AI prediction supported accelerators and regions in
# https://cloud.google.com/vertex-ai/docs/predictions/configure-compute.

# Sets V100 (16GB) to deploy Mistral 7B - Need 2 GPUs.
machine_type = "n1-highmem-8"
accelerator_type = "NVIDIA_TESLA_V100"
accelerator_count = 2

# Sets L4 to deploy Mistral 7B.
# machine_type = "g2-standard-8"
# accelerator_type = "NVIDIA_L4"
# accelerator_count = 1

# Sets T4 to deploy Mistral 7B.
# machine_type = "n1-standard-16"
# accelerator_type = "NVIDIA_TESLA_T4"
# accelerator_count = 2

# Sets A100 (40G) to deploy Mistral 7B.
# machine_type = "a2-highgpu-1g"
# accelerator_type = "NVIDIA_TESLA_A100"
# accelerator_count = 1

model, endpoint = deploy_model_vllm(
    model_name=get_job_name_with_datetime(prefix="mistral-serve-vllm"),
    model_id=prebuilt_model_id,
    service_account=SERVICE_ACCOUNT,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
)

Note: Make sure to select a “n1-highmem” configuration instead of a “n1-standard” if you select 2 V100 GPUs.

Running the Notebook cell with the deploy_model_vllm function will be uploading the model to the Vertex Model Registry and deploying a Vertex AI Endpoint for inference.

You will need to wait approximately 15 min to have the endpoint up and running, ready to receive inference requests. To check the availability of your endpoint, you can go to Online Prediction page in Vertex AI and check if your endpoint has a count of 1 in the field “Models”

Note: If you receive “InternalServerError: 500 System error” during the deployment, most likely the operation failed due to unavailability of resources. Either retry in another region or try with different accelerator type.

instance = {
    "prompt": "My favourite condiment is",
    "n": 1,
    "max_tokens": 200,
}
response = endpoint.predict(instances=[instance])
print(response.predictions[0])

The cell above will be sending a prompt on the inference endpoint deployed previously. This will be sending a single prompt, the next cell is offering you an option for streaming prediction.

Vertex AI Model Registry is a central repository where you can manage the lifecycle of your ML models, you can have an overview of your models so you can better organize, track, and train new versions (leveraging aliases).

Vertex AI Model Registry — Main page : See your models and their deployment status

Once your model is deployed in the registry, you can click on your model in the Model Registry main page and navigate to Evaluate, Deploy & Test, Version details.

In the Deploy & Test page, you can check the deployment status, the version and you can send a JSON request.

Vertex AI Model Registry — Deploy and Test page

In the Version details page you can see the details regarding your hosted model:

Vertex AI Model Registry — Version Details page: You will be able to check the container image deployed and its arguments

While the model registry is useful for hosting the container and managing the model lifecycle, you need an inference endpoint to receive requests. This is the purpose of the Vertex AI endpoint.

To see your deployment, you can click on Online Prediction in Vertex AI and click on your specific endpoint deployment.

By doing so you will be able to see the details of your inference endpoint, its performance (Predictions seconds, Error percentage, latency) and its resource usage (GPU memory usage, CPU usage, Network bytes in):

For further analysis: you can View logs and check for errors or outputs from your python code

Cleaning up the resources

If you have created a new project for the purpose of testing this notebook, you can delete it. Otherwise you can delete the resources individually.

Before being able to delete the endpoint you will need to un-deploy it/them. Make sure to set “delete_endpoint = True” if you want to do so.

You can delete the Cloud storage bucket used for storing Vertex AI artifacts. Make sure to set “delete_bucket = True” if you want to delete the bucket.

To avoid additional cost, you can stop or delete the notebook after you have finished your experimentation with Mistral-7B.

Conclusion

As stated by Philip Moyer, GCP Global AI VP : “Organizations need open AI ecosystems in which data interconnectivity and open infrastructure are possible”, with Vertex AI, users will be able to run Mistral AI open source models on GCP to run inference and fine-tuning workloads.

Google’s commitment to an open AI ecosystem can be translated into the GCP portfolio with services such as Ray on GKE, Ray on Vertex, GPU on GKE, TPU on GKE, Vertex AI pipeline with Kubeflow.

To learn more about Mistral AI performances and features, check their blog post and their reference implementation.