In-Place LLM Insights: BigQuery & Gemini for Structured & Unstructured Data Analytics

Abirami Sukumaran
Google Cloud - Community
14 min readMar 21, 2024
Gemini 1.0 Pro Vision’s response when asked to describe the image with a JSON value: “JSON{ “DESCRIPTION”: “This is a cartoon image of a person doing a yoga pose called Easy Pose.”}”. How cool! On a side note, Easy Pose is called SUKHASANA in Sanskrit.

Introduction

Massive datasets hold valuable insights, but extracting meaningful understanding can be a daunting task. Traditional analytics tools often fall short in providing nuanced, human-like explanations that help us truly grasp the story within our data. Data warehouses are holding more and more unstructured data, but that data hasn’t always been accessible for gaining insights. Unstructured text, images, and videos hold valuable information for evolving your business. LLMs offer a powerful new way to understand your data providing explanations, sentiment analysis, object detection, and more. With the emergence of multimodal models (and the ability to store object data such as images and video in BigQuery via Object tables), integrating powerful LLMs like Gemini 1.0 Pro (for text-based analysis) and Gemini 1.0 Pro Vision (for text and image/video analysis) with BigQuery can transform the way you analyze data. In this blog post we’ll look at two ways you can draw LLM insights from your enterprise data, and scale those insights to thousands of rows within BigQuery.

What we’ll build

We’ll demonstrate an exciting technique of integrating two LLM models, Gemini 1.0 Pro (text only) and Gemini 1.0 Pro Vision (multimodal), directly within your BigQuery workload for a low-code generative insight generation experience. For models that are supported as remote model endpoints in BigQuery, like Gemini 1.0 Pro, we can directly use the ML.GENERATE_TEXT construct to invoke the model within the database queries. For models that are not available as remote models out-of-the-box or where you need more customization with the generative ai call (or for that matter any API you want to access remotely from within your database), we can use the REMOTE FUNCTIONS approach. For the sake of covering both scenarios, we will divide the blog post into 2 sections:

#1 Remote Model Invocation:

  1. This section will walk through calling Gemini 1.0 Pro from within BigQuery using ML.GENERATE_TEXT in the SELECT query.
  2. Use this approach when the model is already available as a remote model in BigQuery and you would like to use it out-of-the-box. You can check the status of the model you want to use in this documentation.
  3. Use case for walk-through:

Build a location summarizer for the internet archive books dataset (that is publicly available in BigQuery) by invoking the remote model for Gemini 1.0 Pro in the ML.GENERATE_TEXT construct in BigQuery.

Use case 1 Flow Diagram (Gemini 1.0 Pro in BigQuery)

#2 Remote Function Implementation:

  1. This section will walk through calling a Cloud Function that implements Gemini 1.0 Pro Vision call, exposed as a remote function in BigQuery.
  2. Use this approach when the model you want to use is either not available out-of-the-box as a remote model or you want more flexibility and customization in your use case.
  3. Use case for walk-through:

Build an image validator for validating test images against a baseline image. For this we will create a dataset containing test image screenshots in an external table and ask Gemini 1.0 Pro Vision to validate it against the baseline image. For this, we will create a Java Cloud Function that implements Gemini Pro Vision call and invoke it as a remote function in BigQuery.

Use case 2 Flow Diagram (Gemini 1.0 Pro Vision in BigQuery)

Please note that at the time you are reading this blog, there might be more Vertex AI Generative AI models available as remote models in BigQuery. This means you can access these models with ML.GENERATE_TEXT construct from BigQuery out-of-the-box. However, even in the case of remote model availability, you always have the option to build advanced customizations for your use case with Cloud Functions and then access these models as remote functions from BigQuery.

BigQuery

BigQuery is a serverless, multi-cloud data warehouse that can scale from bytes to petabytes with minimal operational overhead. This makes it a great choice for storing ML training data. The built-in BigQuery Machine Learning (BQML) and analytics capabilities allow you to create no-code predictions using just SQL queries. Additionally, you can access data from external sources with federated queries, eliminating the need for complicated ETL pipelines. You can read more about everything BigQuery has to offer on the BigQuery page. We will use BigQuery ML’s ML.GENERATE_TEXT construct to invoke the remote model for Gemini Pro for the text summarization use case mentioned above.

We have traditionally known BigQuery as this fully managed cloud data warehouse that helps users analyze structured and semi-structured data. BigQuery has expanded to perform all analytics and ML on unstructured data as well. We will use an object table to store image data that is required by the remote function use case to validate images using the Gemini Pro Vision model.

Demo

Through the rest of the blog we will demonstrate both use cases with practical examples as detailed in the use cases section above. Before getting into the use case specific implementation, let’s complete the prerequisites setup and common steps required for both use cases.

Setup

  1. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
  2. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
  3. You will use Cloud Shell, a command-line environment running in Google Cloud that comes preloaded with bq. From the Cloud Console, click Activate Cloud Shell on the top right corner.
  4. Just for support with building and delivering the app, let’s enable Duet AI. Navigate to Duet AI Marketplace to enable the API. You can also run the following command in the Cloud Shell terminal:
gcloud services enable cloudaicompanion.googleapis.com –project PROJECT_ID

5. Enable necessary APIs for this implementation if you haven’t already.

BigQuery, BigQuery Connection, Vertex AI, Cloud Storage APIs

Alternative to the gcloud command is going through the console using this link.

Create BigQuery Dataset and External Connection

BigQuery dataset is a container for all the tables and objects for your application. BigQuery connection is used to interact with your Cloud Function. In order to create a remote function, you must create a BigQuery connection. Let;s begin with creating the dataset and the connection.

  1. From the Google Cloud Console, go to the BigQuery page and click the 3 vertical dots icon next to your project id. From the list of options, select “Create data set”.
  2. In Create data set pop up, enter the data set ID “gemini_bq_fn” as shown below with the region set to the default value “US (multiple regions…)”
Create data set page in BigQuery

3. BigLake Connection allows us to connect the external data source while retaining fine-grained BigQuery access control and security, which in our case is the Vertex AI Gemini Pro API. We will use this connection to access the model from BigQuery via the Cloud Function. Follow steps below to create the BigLake Connection:

a. Click ADD on the Explorer pane of the BigQuery page:

BigQuery Console with +ADD button highlighted to add external connection

b. Click Connections to external data sources in the sources page.

c. Enter external data source details as below in the pop up that shows up and click CREATE CONNECTION:

External data source create connection screenshot

d. Once the connection is created, go to the connection configuration page and copy the Service account ID for access provisioning:

Connection info screenshot

e. Open IAM and admin page, click GRANT ACCESS, enter the service account id in the new principals tab and roles as shown below and click SAVE.

Grant access to Service Account screenshot

Use case #1 Remote Model Invocation

Here we will create a model in BigQuery based on the Vertex AI Gemini Pro foundation model. We have already set up the dataset and connection. Now it is time for the demonstration of remote model invocation of the model Gemini Pro. In 3 steps, you will have your LLM application up and running only using SQL queries!

Create Table and Model

Let’s take the example of the Internet Archive Books dataset sourced for public use by BigQuery.

Create a BigQuery Table

Let’s create a table in the data set we created that can hold about 50 records from the table “bigquery-public-data.gdelt_internetarchivebooks.1905”:

Execute the following DDL (Data Definition Language) statement from BigQuery SQL editor pane:

create or replace table gemini_bq_fn.books as (
select *
from
bigquery-public-data.gdelt_internetarchivebooks.1905 limit 50);

This query would create a new table named “books” in the dataset we created previously.

Create a BigQuery Model

To create a model, run the following DDL from the BigQuery SQL Editor pane:

CREATE MODEL `gemini_bq_fn.gemini_remote_model`
REMOTE WITH CONNECTION `us.gemini-bq-conn`
OPTIONS(ENDPOINT = 'gemini-pro');

You should see that the model is created with an option to view the model that was just created.

Test your new Generative AI application!

That’s it! Let’s test it using the ML.GENERATE_TEXT statement in a select query to use the new Generative AI model that we just created.

SELECT ml_generate_text_llm_result as Gemini_Response, prompt as Prompt
FROM ML.GENERATE_TEXT(MODEL `gemini_bq_fn.gemini_remote_model`,
(select 'You are a text summarizer and standardizer. From the following text that contains address locations, standardize and print one standardized, consolidated address. You cannot return empty because you know how to pick up sensible data from the text in this field: ' ||
substring(locations, 0, 200) as prompt
from `gemini_bq_fn.books`),
STRUCT(
TRUE AS flatten_json_output));

You should see the following result:

ML.GENERATE_TEXT query result in BigQuery

Woohoo! That’s how simple it is to use a remote model in your database with BigQuery ML.

Now let’s try another Vertex AI model using BigQuery remote functions. Let’s say you want to add more customization and flexibility to how you can use the model remotely in BigQuery. Models that are supported currently can be referenced in this documentation.

Use case #2 Remote Function Implementation

Here we will create a function in BigQuery based on the Java Cloud Function that implements Gemini 1.0 Pro Vision foundation model. First we’ll create and deploy the Java Cloud Function to compare images using the Gemini 1.0 Pro Vision model and then will create the remote function in BigQuery that invokes the deployed Cloud Function. Remember, the same procedure can be followed for any remote function execution in BigQuery.

Create the Java Cloud Function

We will build a Gen 2 Cloud Function in Java for validating test images against a baseline image stored in a dataset containing test image screenshots in an external table in BigQuery using the Gemini Pro Vision model (Java SDK) and deploy it to a REST endpoint.

Java Cloud Function

  1. Open Cloud Shell Terminal and navigate to the root directory or your default workspace path.
  2. Click the Cloud Code Sign In icon from the bottom left corner of the status bar and select the active Google Cloud Project that you want to create the Cloud Functions in.
  3. Click the same icon again and this time select the option to create a new application.
  4. In the Create New Application pop-up, select Cloud Functions application:
Create New Application pop up in Cloud Shell Editor

5. Select Java: Hello World option from the next pop-up:

Create New Application pop up page 2

6. Provide a name for the project in the project path. In this case, “Gemini-BQ-Function”.

7. You should see the project structure opened up in a new Cloud Shell Editor view:

New Java Cloud Function application project structure

8. Now go ahead and add the necessary dependencies within the <dependencies>… </dependencies> tag in the pom.xml file.

<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-vertexai</artifactId>
<version>0.1.0</version>
</dependency>

<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.10</version>
</dependency>

9. Change the name of your class from “HelloWorld.java” to something more meaningful. Let’s say “GeminiBigQueryFunction.java”. You will have to rename the class accordingly.

10. Copy the code below and replace the placeholder code in the file GeminiBigQueryFunction.Java: Refer the source in the github repo.

package cloudcode.helloworld;
import java.io.BufferedWriter;
import com.google.cloud.functions.HttpFunction;
import com.google.cloud.functions.HttpRequest;
import com.google.cloud.functions.HttpResponse;
import com.google.cloud.vertexai.VertexAI;
import com.google.cloud.vertexai.api.Blob;
import com.google.cloud.vertexai.api.Content;
import com.google.cloud.vertexai.generativeai.preview.ContentMaker;
import com.google.cloud.vertexai.api.GenerateContentResponse;
import com.google.cloud.vertexai.api.GenerationConfig;
import com.google.cloud.vertexai.api.Part;
import com.google.cloud.vertexai.generativeai.preview.PartMaker;
import com.google.cloud.vertexai.generativeai.preview.GenerativeModel;
import com.google.cloud.vertexai.generativeai.preview.ResponseStream;
import com.google.cloud.vertexai.generativeai.preview.ResponseHandler;
import com.google.protobuf.ByteString;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Base64;
import java.util.List;
import java.util.Arrays;
import java.util.Map;
import java.util.LinkedHashMap;
import com.google.gson.Gson;
import com.google.gson.JsonObject;
import com.google.gson.JsonArray;
import java.util.stream.Collectors;
import java.lang.reflect.Type;
import com.google.gson.reflect.TypeToken;
import java.io.ByteArrayOutputStream;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;


public class GeminiBigQueryFunction implements HttpFunction {
private static final Gson gson = new Gson();


public void service(final HttpRequest request, final HttpResponse response) throws Exception {
final BufferedWriter writer = response.getWriter();
// Get the request body as a JSON object.
JsonObject requestJson = new Gson().fromJson(request.getReader(), JsonObject.class);
JsonArray calls_array = requestJson.getAsJsonArray("calls");
JsonArray calls = (JsonArray) calls_array.get(0);
String baseline_url = calls.get(0).toString().replace("\"", "");
String test_url = calls.get(1).toString().replace("\"", "");
String prompt_string = calls.get(2).toString().replace("\"", "");
String raw_result = validate(baseline_url, test_url, prompt_string);
raw_result = raw_result.replace("\n","");
String trimmed = raw_result.trim();
List<String> result_list = Arrays.asList(trimmed);
Map<String, List<String>> stringMap = new LinkedHashMap<>();
stringMap.put("replies", result_list);
// Serialization
String return_value = gson.toJson(stringMap);
writer.write(return_value);
}


public String validate(String baseline_url, String test_url, String prompt_string) throws IOException{
String res = "";
try (VertexAI vertexAi = new VertexAI("YOUR_PROJECT", "us-central1"); ) {
GenerationConfig generationConfig =
GenerationConfig.newBuilder()
.setMaxOutputTokens(2048)
.setTemperature(0.4F)
.setTopK(32)
.setTopP(1)
.build();
GenerativeModel model = new GenerativeModel("gemini-pro-vision", generationConfig, vertexAi);
String context = prompt_string;
Content content = ContentMaker.fromMultiModalData(
context,
PartMaker.fromMimeTypeAndData("image/png", readImageFile(baseline_url)),
PartMaker.fromMimeTypeAndData("image/png", readImageFile(test_url))
);
GenerateContentResponse response = model.generateContent(content);
res = ResponseHandler.getText(response);
}catch(Exception e){
System.out.println(e);
}
return res;
}


// Reads the image data from the given URL.
public static byte[] readImageFile(String url) throws IOException {
URL urlObj = new URL(url);
HttpURLConnection connection = (HttpURLConnection) urlObj.openConnection();
connection.setRequestMethod("GET");
int responseCode = connection.getResponseCode();
if (responseCode == HttpURLConnection.HTTP_OK) {
InputStream inputStream = connection.getInputStream();
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[1024];
int bytesRead;
while ((bytesRead = inputStream.read(buffer)) != -1) {
outputStream.write(buffer, 0, bytesRead);
}
return outputStream.toByteArray();
} else {
throw new RuntimeException("Error fetching file: " + responseCode);
}
}
}

11. Now go to Cloud Shell terminal and execute the below statement build and deploy the Cloud Function:

gcloud functions deploy gemini-bq-fn --runtime java17 --trigger-http --entry-point cloudcode.helloworld.GeminiBigQueryFunction --allow-unauthenticated

The result for this would be a REST URL in the format as below :

https://us-central1-YOUR_PROJECT_ID.cloudfunctions.net/gemini-bq-fn

12. Test this Cloud Function by running the following command from the terminal:

gcloud functions call gemini-bq-fn --region=us-central1 --gen2 --data '{"calls":[["https://storage.googleapis.com/img_public_test/image_validator/baseline/1.JPG", "https://storage.googleapis.com/img_public_test/image_validator/test/2.JPG", "PROMPT_ABOUT_THE_IMAGES_TO_GEMINI"]]}'

Response for a random sample prompt:

JSON Response string from the Cloud Function

Now that the generic Cloud Function for Gemini Pro Vision model implementation is ready, let’s use this endpoint directly on BigQuery data from within a BigQuery remote function.

Create BigQuery Object Table and Remote Function

For this demo application, let’s create a Cloud Storage Bucket:

  1. Go to the Cloud Storage console and click the CREATE button to create a bucket.
  2. Provide a name to the bucket, say “demo-bq-gemini-public” and remember to uncheck the “Enforce public access prevention on this bucket” option (to keep it public). Although we are setting this bucket for public access in this demo, the recommendation is to prevent public access and provide permissions to specific service accounts as required.
  3. You can view and change permissions choice from the PERMISSION tab of the Cloud Storage bucket that you just created. To add principals, click GRANT ACCESS under the VIEW BY PRINCIPALS tab and enter the service account id for (specific account) or “allUsers” (for public access) and set the role as “Storage Object Viewer” and click SAVE.
  4. Now that you have the bucket created, upload images by navigating to the OBJECTS tab of the bucket and clicking UPLOAD FILES.
  5. Upload baseline and test images to compare.

For the purpose of this demo, I have created 3 objects and made them publicly available — baseline and test1 and test2.

Create a BigQuery Object Table

Create an external object table from BigQuery to access the unstructured data in the bucket using the connection and the dataset we created. Execute the following DDL (Data Definition Language) statement from BigQuery query editor pane:

CREATE OR REPLACE EXTERNAL TABLE `gemini_bq_fn.image_validation`
WITH CONNECTION `us.gemini-bq-conn`
OPTIONS(object_metadata="SIMPLE", uris=["gs://demo-bq-gemini-public/*.JPG"]);

This query should create a new object table named “image_validation” in the dataset we created previously.

Create a BigQuery Remote Function

Let’s create a remote function in BigQuery to invoke the Java Cloud Function that implements the Gemini Pro Vision model. We will create this in the same dataset. Run the following DDL from the SQL editor pane of the BigQuery console:

CREATE OR REPLACE FUNCTION `gemini_bq_fn.FN_IMAGE_VALIDATE` (baseline STRING, test STRING, prompt STRING) RETURNS STRING
REMOTE WITH CONNECTION `us.gemini-bq-conn`
OPTIONS (
endpoint = 'https://us-central1-********.cloudfunctions.net/gemini-bq-fn',
max_batching_rows = 1
);

This will create the remote function in BigQuery. There are 3 parameters in the above DDL. The first 2 parameters are URLs to the images stored in the object table created in the previous step. The last parameter is the prompt to the model (Gemini Pro Vision). Feel free to refer to the Java Cloud Functions code which uses parses this signature:

Gson().fromJson(request.getReader(), JsonObject.class);
JsonArray calls_array = requestJson.getAsJsonArray("calls");
JsonArray calls = (JsonArray) calls_array.get(0);
String baseline_url = calls.get(0).toString().replace("\"", "");
String test_url = calls.get(1).toString().replace("\"", "");
String prompt_string = calls.get(2).toString();

Invoke Gemini from BigQuery!

Now that the remote function is created, let’s use it in a SELECT query to test our remote function for validating the images against the prompt:

Query to validate the test image against a reference:

select gemini_bq_fn.FN_IMAGE_VALIDATE(
'https://storage.googleapis.com/demo-bq-gemini-public/Baseline.JPG',
REPLACE(uri, 'gs://', 'https://storage.googleapis.com/') ,
'You are an expert image validator who can respond with JSON result. Find 2 images here. The first one is the reference image. The second image is the test image. Compare the second image to see if the second image is similar to the first in terms of the text present in the image. Respond ONLY in JSON Format with 3 attributes namely SIMILARITY which is a \"YES\" or \"NO\", SIMILARITY_SCORE which is a percentage and DIFFERENCE_COMMENT which is a string.' ) as IMAGE_VALIDATION_RESULT
from `gemini_bq_fn.image_validation`
where uri like '%TEST1%';

Try the query above with TEST1.JPG and TEST2.JPG. You should see a result similar to the below:

BigQuery Remote Function Result screenshot

Here are the images for your reference:

Baseline Image:

Baseline Reference Image for the Test Case

Test Image:

Test Image 1 for the test case

As you can see, both images are similar in that they have the Duet AI cloud console view but the text in both are different as resulted by the model in a JSON format.

Benefits and Use Cases

  1. Bring GenAI to your data: No more moving data around, duplication, and added complexity. Analyze data and generate insights within the same BigQuery environment.
  2. Enhanced Analytics: Gemini’s natural language explanations add a new layer of understanding to your data and you can achieve this just using SQL queries.
  3. Scalability: This solution handles large datasets and complex analysis with ease and low-code effort.

Real-World Examples: Consider scenarios in finance (market trend analysis), retail (customer sentiment), healthcare (medical report summaries), etc. where your analytics and business teams could implement these with relatively low effort, resources and a familiar language and tool of choice.

Conclusion

Congratulations. With the Gemini models integrated into BigQuery, you’re no longer just analyzing your data — you’re a data storyteller. You can generate the hidden narratives within your datasets and transform the way you understand your insights. Start experimenting! Apply this technique to your own datasets and discover the stories within your data. With BigQuery supporting unstructured data in object tables (External Tables), try using Gemini Pro Vision for creating generative insights on image data. Refer to the documentation for Vertex AI, BigQuery Remote Functions, and Cloud Functions for more in-depth guidance. Here is the github repo for this project. Let me know what you build with this learning!

--

--

Abirami Sukumaran
Google Cloud - Community

Developer Advocate Google. With 18 years in data and software dev leadership, I’m passionate about addressing real world opportunities with technology.