Securely Exposing Vertex AI APIs using Cloud Run and Cloud Endpoints

Ankur Gautam
Google Cloud - Community
6 min readNov 3, 2023

This blog post describes how to expose a secure API connection to Vertex AI APIs by deploying a Cloud Endpoint on a Cloud Run instance.

We will interact with the Vertex AI API for PaLM 2 for text (text-bison) models. At the end, we will have a mechanism to expose an endpoint that is secured by an API key. For more information about the different PaLM APIs, please see here.

There are two ways to integrate with a Google Cloud API: first, by using Google-provided client libraries; and second, by making REST Service calls to HTTP endpoint published by Google. For both implementations, the client application requires some form of authentication to connect with these services. Let’s explore both approaches.

Using Rest Service Endpoint

To authenticate while using a Rest Service Endpoint an Authentication Bearer Token in the header of JSON request.

curl -X POST -H "Authorization: Bearer <Token>"

Cloud Run instances expose a Metadata server that can be used to generate tokens for the runtime service account. A service account with permission to access PaLM AI API can be attached to the Cloud Run service.

Metadata server can be accessed using simple HTTP requests to the http://metadata.google.internal/ endpoint with the Metadata-Flavor: Google header. To read more about itvisit this link.

Create an Application to Fetch Token

A simple code is required to access the MetaData server endpoint. The following is an example of a simple Flask application that can retrieve the token from Metadata Server.

import requests
from flask import Flask
app = Flask(__name__)


@app.get('/getAuthtoken')
def getAccessToken():
url = 'http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token'


headers = {"Metadata-Flavor" : "Google"}
response = requests.get(url, headers= headers)
return response.json()
if __name__ == "__main__":
app.run(host = '0.0.0.0',port = 8080, debug = False)

This application can be containerized and deployed on a Cloud Run instance that is only accessible internally. This is represented as metadata client in the above image. The complete code can be downloaded from GitHub. (Please note that this code is provided for understanding and experimentation purposes only, and is not recommended for use in production)

The instructions to build a container image and store in Artifact Registory are here.

The instructions to deploy a container image to Cloud Run are here.

Create a Cloud Run Instance for Cloud Endpoint

Next, create a Cloud Run instance that will host an ESPv2 proxy which will host the Cloud Endpoint. This instance must be publicly accessible.

Before creating a Cloud Endpoint, reserve a URL by creating a Cloud Run instance as just a placeholder. The instruction to deploy a sample application on Cloud Run can be found here.

Create a Cloud Endpoint

Next, create an OpenAPI document with host as the public Cloud Run instance and backend as the internal Cloud Run instance. This document will be used to create a Cloud Endpoint -

swagger: '2.0'
info:
title: Cloud Endpoints for Metadata Server Cloud Run
description: Cloud Endpoints with a Cloud Run backend
version: 1.0.0
host: metarun-endpoint-sample-el.a.run.app #This is a dummy url. Replace this with the public Cloud Run endpoint which will host ESPv2
schemes:
- https
produces:
- application/json
x-google-backend:
address: https://metadata-run-sample-uc.a.run.app #This is a dummy url. Replace this with the DNS from metadata server internal cloud run instance
path_translation: APPEND_PATH_TO_ADDRESS
paths:
/getAccesstoken:
get:
summary: Get Access Token
operationId: accesstoken
responses:
'200':
description: A successful response
schema:
type: string
security:
- api_key: []
parameters:
- in: query
name: name
required: false
type: string
securityDefinitions:
# This section configures basic authentication with an API key.
api_key:
type: "apiKey"
name: "key"
in: "query"

Next, deploy this OpenAPI definition to Cloud Endpoint -

gcloud endpoints services deploy openapi-run.yaml \
--project <PROJECT_ID>

and Enable the API -

gcloud services enable <ENDPOINTS_SERVICE_NAME>

Next, deploy an ESPv2 container image on the public Cloud Run instance and map the endpoint to it.

We can create an image with ESPv2 endpoint by following steps -

Download the script to build ESP image.

chmod +x gcloud_build_image

./gcloud_build_image -s CLOUD_RUN_HOSTNAME \
-c CONFIG_ID -p <PROJECT_ID>

Deploy a new revision to the Cloud Endpoint Cloud Run(public) instance which already has a sample application running.

gcloud run deploy metaserver-endpoint \
--image="gcr.io/<PROJECT_ID>/endpoints-runtime-serverless:ESP_VERSION-CLOUD_RUN_HOSTNAME-CONFIG_ID" \
--allow-unauthenticated \
--platform managed \
--project=<PROJECT_ID>

ESP_VERSION-CLOUD_RUN_HOSTNAME-CONFIG_ID : This can copied from the out of gcloud_build_image command. It looks similar to this — gcr.io/your-project-id/endpoints-runtime-serverless:2.14.0-gateway-12345-uc.a.run.app-2019–02–01r0

Create a restricted API Key

gcloud api-keys create KEY_ID \
--display-name="DISPLAY_NAME" \
--restrictions="apis=API_NAME,methods=METHOD_NAME"

Once the endpoint is deployed, token can be generated by calling the endpoint Cloud Run instance from external client.

curl --request GET \
--header "content-type:application/json" \
"https://${ENDPOINTS_HOST}/getAccesstoken?key=<API_KEY>"

Once you have the token, use it to call the API for predict endpoint for PaLM 2’s text API(text-bison)

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://${API_ENDPOINT}/v1/projects/${PROJECT_ID}/locations/us-central1/publishers/google/models/${MODEL_ID}:predict" -d \
$'{
"instances": [
{
"content": "How far is the moon from earth
"
}
],
"parameters": {
"candidateCount": 1,
"maxOutputTokens": 1024,
"temperature": 0.2,
"topP": 0.8,
"topK": 40
}
}'

Using Client Libraries

To authenticate while using client libraries, Application Default Credentials (ADC) should be configured. ADC is a method used by Google Libraries to automatically find credentials based on the application environment. While there are many ways to implement ADC, the following discussion will cover the use of an attached service sccount as an ADC.

Deploy an application built using client libraries

There are many sample applications that you can get from this link for this purpose along with instructions to deploy.

As a service account is already attached to the container runtime in Cloud Run instance, client libraries will automatically detect credentials using ADC.

The instructions assume that there is an API exposed on the Cloud Run instance called /getPredictResponse which calls the Vertex AI API using client libraries. This API will work as a backend/target for Cloud Endpoint.

Create a Cloud Run Instance for Cloud Endpoint

Once there is an application deployed on an internal Cloud Run instance, There should be another Cloud Run instance which will be publicly accessible and can host a Cloud Endpoint restricted with an API Key.

Follow previously shared instructions to reserve a URL by deploying a sample application on Cloud Run.

Create a Cloud Endpoint

Create an OpenAPI document with host as the public Cloud Run instance and backend as the internal Cloud Run instance. This document will be used to create a Cloud Endpoint -

swagger: '2.0'
info:
title: Cloud Endpoints for Metadata Server Cloud Run
description: Cloud Endpoints with a Cloud Run backend
version: 1.0.0
host: metarun-endpoint-sample.a.run.app #This is a dummy url and replace this with endpoint of external endpoint cloudrun instance
schemes:
- https
produces:
- application/json
x-google-backend:
address: https://metadata-run-sample-uc.a.run.app #This is a dummy url and replace this with the DNS from vertex AI API client running on internal cloud run instance
path_translation: APPEND_PATH_TO_ADDRESS
paths:
/getPredictResponse:
get:
summary: Get Predicted Response
operationId: predictresponse
responses:
'200':
description: A successful response
schema:
type: string
security:
- api_key: []
parameters:
- in: query
name: name
required: false
type: string
securityDefinitions:
# This section configures basic authentication with an API key.
api_key:
type: "apiKey"
name: "key"
in: "query"

To deploy ESPv2 follow the same steps previously shared once done create an API Key.

Test the service using the public Cloud Run url with an API Key.

curl \
-X POST \
-H "Content-Type: application/json" \
"https://${CLOUDRUN_API_ENDPOINT}/getPredictResponse" -d \
$'{
"instances": [
{
"content": "How far is the moon from earth
"
}
],
"parameters": {
"candidateCount": 1,
"maxOutputTokens": 1024,
"temperature": 0.2,
"topP": 0.8,
"topK": 40
}
}'

This blog has used the Vertex AI API as an example, but the same architecture can be used for other Google Cloud APIs and custom applications. Compute resources, including Compute Engine instances, store their metadata in the Metadata Server in key-value format. For more information on how to use the Metadata Server, please see the documentation.

ESPv2 provides the capability to host an OpenAI API configuration on Cloud Run and GKE, which can be integrated with in-built API security in Google Cloud. For more information on ESPv2, please see the documentation.

--

--