GKE Orchestration : Deploy your Gemma LLM

5 min readApr 18, 2024

Google announced in February 2024, some lightweight open models named Gemma, same technologies used to develop Gemini. Those models are available worldwide and it’s designed for developers and researchers. Google makes it available in several sizes and with several developer tools and Google Cloud Services.

Why use the potential of GKE to deploy your Gemma model?

Deploying your Gemma model on Google Kubernetes Engine (GKE) will unlocks all the advantages we already know about GKE:

Scalability
Reliability
Resource efficiency

Additionally, deploying your model on GKE empowers platform independence. It will let you leverage GenAI even without VertexAI like on S3NS or Google Distributed Cloud Hosted (not available before the end of 2025).

What are the use cases of Gemma, Google’s new open LLM ?

Google’s new open LLM will help you to leverage their generative capabilities for tasks like:

Text generation: by creating different creative text formats, like poems, code, scripts, email, etc.
Question answering: by answering your questions in an informative way, potentially being useful in chatbots or virtual assistants.

This article will cover step-by-step, how to deploy your Gemma model on GKE Autopilot

Prerequisites:

Create a GCP project
Set your billing project
Activate the GKE API
Verify your IAM roles
Create an account on HuggingFace

Let’s start by getting your model before deploying it on GKE

To ensure responsible use of Gemma models, signing a consent contract is the first step. This agreement protects the rights of both the creators and users (like yourself) by promoting:

Transparency: Clear understanding of how the model can be used and any limitations.
Responsible Use: Ensuring the model is employed for ethical purposes and doesn’t cause harm.

Access Kaggle.com to request access

2. Use your HuggingFace account you created before to sign the contract

3. Accept the agreement of the model

To tap into the capabilities of Gemma models, you’ll need a Hugging Face access token (since Hugging Face hosts the model). Here’s how to generate one in a few simple steps:

On HuggingFace, click on your icon profil image (top right) > Settings > Access Tokens
Click on “New Tokens” on the left panel
Give a name and give at least a “read” role
Then click on “Generate a token”

Now that you have generated your token, keep it in your clipboard, you will need it later.

Let’s configure your GCP environment

You have to activate the Cloud Shell and then in the Cloud Shell, let’s set up your environment variables

gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export REGION=REGION
export CLUSTER_NAME=vllm
export HF_TOKEN=HF_TOKEN

Replace the PROJECT_ID, REGION and HF_TOKEN by your own value, the HF_TOKEN it’s the token you have generated on HuggingFace just before

Let’s configure your GKE cluster and node pools, if you want a full managed Kubernetes experience, use GKE Autopilot and this is what we will use here

In your Cloud Shell, paste the following command line to create your GKE Autopilot:

gcloud container clusters create-auto ${CLUSTER_NAME} \
 - project=${PROJECT_ID} \
 - region=${REGION} \
 - release-channel=rapid \
 - cluster-version=1.28

Let’s create Kubernetes secret for your Hugging Face ID

Configure your kubectl to communicate with your cluster

gcloud container clusters get-credentials ${CLUSTER_NAME} - location=${REGION}

2. Create your Kubernetes secret with your Hugging Face token

kubectl create secret generic hf-secret \
 - from-literal=hf_api_token=$HF_TOKEN \
 - dry-run=client -o yaml | kubectl apply -f -

Let’s deploy your vLLM model now

Create a manifest file vllm-2b-it.yaml with the following content

# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# [START gke_ai_ml_llm_serving_gemma_vllm_2b_it_deployment]
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-2b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240220_0936_RC01
        resources:
          requests:
            cpu: "2"
            memory: "7Gi"
            ephemeral-storage: "10Gi"
            nvidia.com/gpu: 1
          limits:
            cpu: "2"
            memory: "7Gi"
            ephemeral-storage: "10Gi"
            nvidia.com/gpu: 1
        command: ["python3", "-m", "vllm.entrypoints.api_server"]
        args:
        - --model=$(MODEL_ID)
        - --tensor-parallel-size=1
        env:
        - name: MODEL_ID
          value: google/gemma-2b-it
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
            medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
# [END gke_ai_ml_llm_serving_gemma_vllm_2b_it_deployment]

This deployment leverage a specific Docker image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240220_0936_RC01. This image, built with PyTorch, is hosted and maintained by Google on Artifact Registry.

2. Apply the manifest model

kubectl apply -f vllm-2b-it.yaml

3. Wait the end of your deployment

kubectl wait - for=condition=Available - timeout=700s deployment/vllm-gemma-deployment

Let’s broadcast your vLLM model

Execute the following command line

kubectl port-forward service/llm-service 8000:8000

Let’s interact with your vLLM model

We will use curl to interact with the model just deployed

USER_PROMPT="Gemma vLLM is a"
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"prompt": "${USER_PROMPT}",
"temperature": 0.90,
"top_p": 1.0,
"max_tokens": 128
}
EOF

You have now deployed Gemma on GKE Autopilot!

{"predictions":["Prompt:\nGemma vLLM is a\nOutput:\n large language model, 
trained by Google, that can perform a wide range of language tasks, 
including text generation, language translation, and question answering.\n\n
**Key Features**\n\n- **Unconditional language processing (UCLP):** 
Gemma can process and understand text regardless of its format, source, or 
structure.\n- **Multi-modal processing:** It can process and generate text 
alongside a wide range of other modalities, such as images, videos, and audio.
\n- **High-quality text generation:** Gemma produces natural and coherent 
text that is often indistinguishable from human-written text.\n-

GKE Orchestration : Deploy your Gemma LLM

Written by Alexandre Uy