Vertex AI and BigQuery for Natural Language Exploration of GCP Billing Data

Steven Aldinger
TeamSnap Engineering
7 min readJan 24, 2024

Introduction

One of the best parts of the generative AI boom is that historically-technical things to interact with can be reduced to a natural, intuitive interface. Tasks that require in-depth knowledge and skills to carry out can now be wrapped with Large Language Models (LLM) to respond to questions asked in natural language, and thanks to frameworks like Streamlit and Gradio, you can seamlessly integrate a graphical interface in mere minutes.

This article is your guide to unleashing Google’s new Gemini Pro model on your Google Cloud billing data that has been exported to BigQuery. Combining these tools can help break down technical barriers, and make exploring datasets as approachable as having a conversation.

Everything shown in the article can be run on your computer or as a Cloud Run application by following the instructions in the stevenaldinger/vertex-ai-google-cloud-billing repository.

Why spend the time to do this?

Unlocking the immense power of BigQuery becomes a thrilling journey when paired with the magic of generative AI’s natural language capabilities. Imagine being able to ask intricate questions about your data using everyday language, eliminating the need for complex query languages like GoogleSQL, as well as the skills to interact with BigQuery. This is a convenient time saver for people already familiar with the tools, but it’s also ushering in a new era of data exploration and discovery for everyone else.

In this new era, data access isn’t reserved for the tech-savvy engineers. Generative AI can help democratize data, empowering a more diverse audience with natural language driven analysis. As businesses start to embrace this, the result will be more than just operational efficiency. Efficiency will come paired with a true competitive advantage, fueled by fresh perspectives and a profound understanding of their own data.

Leveraging Generative AI with BigQuery

This section gives an overview of the strategy I used to integrate Gemini Pro with BigQuery, and most of the links point to relevant lines of code in the open source repo. I’m working with Google Cloud products, but you can swap in other models like OpenAI’s GPT just by changing the few lines in this single file.

The following image is a preview of what the application looks like. If you get bored of the walkthrough and want to jump straight into running the app or spelunking in the code base, you can choose your own adventure at stevenaldinger/vertex-ai-google-cloud-billing.

In this strategy, the first thing to do is load the BigQuery dataset’s table schemas and make them available for use in an LLM chain.

I used LangChain’s BigQuery Loader, but the main purpose for that was just to more easily integrate with other LangChain tools. The real secret sauce is this simple BigQuery query for the table names and their DDLs so that we can add the table definitions to an LLM prompt later on.

SELECT table_name, ddl
FROM `{bigquery_dataset}.INFORMATION_SCHEMA.TABLES`
WHERE table_type = 'BASE TABLE'
ORDER BY table_name;

The result of executing that snippet in BigQuery is a list of all the table names in a dataset, as well as their definitions. The output for the billing export table we’re targeting looks like this:

CREATE TABLE `steven-aldinger.gdax.gcp_billing_export_v1_0161FF_E1335F_xxxxx`
(
billing_account_id STRING OPTIONS(description=""),
service STRUCT<id STRING OPTIONS(description=""), description STRING OPTIONS(description="")> OPTIONS(description=""),
sku STRUCT<id STRING OPTIONS(description=""), description STRING OPTIONS(description="")> OPTIONS(description=""),
usage_start_time TIMESTAMP OPTIONS(description=""),
usage_end_time TIMESTAMP OPTIONS(description=""),
project STRUCT<id STRING OPTIONS(description=""), number STRING OPTIONS(description=""), name STRING OPTIONS(description=""), labels ARRAY<STRUCT<key STRING OPTIONS(description=""), value STRING OPTIONS(description="")>> OPTIONS(description=""), ancestry_numbers STRING OPTIONS(description=""), ancestors ARRAY<STRUCT<resource_name STRING OPTIONS(description=""), display_name STRING OPTIONS(description="")>> OPTIONS(description="")> OPTIONS(description=""),
labels ARRAY<STRUCT<key STRING OPTIONS(description=""), value STRING OPTIONS(description="")>> OPTIONS(description=""),
system_labels ARRAY<STRUCT<key STRING OPTIONS(description=""), value STRING OPTIONS(description="")>> OPTIONS(description=""),
location STRUCT<location STRING OPTIONS(description=""), country STRING OPTIONS(description=""), region STRING OPTIONS(description=""), zone STRING OPTIONS(description="")> OPTIONS(description=""),
export_time TIMESTAMP OPTIONS(description=""),
cost FLOAT64 OPTIONS(description=""),
currency STRING OPTIONS(description=""),
currency_conversion_rate FLOAT64 OPTIONS(description=""),
usage STRUCT<amount FLOAT64 OPTIONS(description=""), unit STRING OPTIONS(description=""), amount_in_pricing_units FLOAT64 OPTIONS(description=""), pricing_unit STRING OPTIONS(description="")> OPTIONS(description=""),
credits ARRAY<STRUCT<name STRING OPTIONS(description=""), amount FLOAT64 OPTIONS(description=""), full_name STRING OPTIONS(description=""), id STRING OPTIONS(description=""), type STRING OPTIONS(description="")>> OPTIONS(description=""),
invoice STRUCT<month STRING OPTIONS(description="")> OPTIONS(description=""),
cost_type STRING OPTIONS(description=""),
adjustment_info STRUCT<id STRING OPTIONS(description=""), description STRING OPTIONS(description=""), mode STRING OPTIONS(description=""), type STRING OPTIONS(description="")> OPTIONS(description=""),
tags ARRAY<STRUCT<key STRING OPTIONS(description=""), value STRING OPTIONS(description=""), inherited BOOL OPTIONS(description=""), namespace STRING OPTIONS(description="")>> OPTIONS(description=""),
cost_at_list FLOAT64,
transaction_type STRING,
seller_name STRING
)
PARTITION BY DATE(_PARTITIONTIME);

This next step is where the fun starts. The table definitions can be passed into a prompt alongside a natural language question about billing, but instead of asking for a direct answer to the question, we can ask it to generate GoogleSQL code instead. Note that it’s also important to supplement the prompt with the current date, so when we say “last 3 months” in a query, it doesn’t go by the (very incorrect) date of the data it was trained on.

Write a BigQuery Google Standard SQL query that answers the following question.
Today's date is {todays_date}.

Question: {human_language_question}

{table_ddl}

Given the question, “What are my top 5 most used services by cost?”, this is the code it outputted; which can be passed directly into the standard BigQuery python client. The cherry on top is that the client has a built-in way to get the results as a Pandas DataFrame, which makes it really easy to integrate with other code.

WITH ServiceCost AS (
SELECT
service.description AS service_name,
SUM(cost) AS total_cost
FROM
`steven-aldinger.gdax.gcp_billing_export_v1_0161FF_E1335F_xxxxx`
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP("2023-01-21 00:00:00") AND TIMESTAMP("2024-01-21 00:00:00")
GROUP BY
service.description
)
SELECT
service_name,
total_cost
FROM
ServiceCost
ORDER BY
total_cost DESC
LIMIT 5;

What if the generated code fails to execute and BigQuery responds with an error message instead? An engineer would read the error message, review their failed query, and double check the DDL to see where they made a mistake. Why not try to reproduce that process automatically with another generative AI prompt?

The LLM can be called into play in a normal error handling pattern, and this time it’s goal will be to try to resolve the bug. By writing a new prompt that contains the same context an engineer would use to fix the issue, the LLM has a chance of recovering the fumble and still successfully answering the question.


Fix the following BigQuery Google Standard SQL code to resolve the error.
Only respond with code, no explanation.

BigQuery Google Standard SQL code:
```sql
{failed_query}
```

Error message:
```
{error_message}
```

Schema context:
{content}

If the query executes successfully, the rows returned from the BigQuery client will be in an easy to work with DataFrame that’ll look something like this.

#    service_name  total_cost
0 Compute Engine 325.275
1 Vertex AI 128.687
2 Cloud Storage 32.423
3 Maps API 4.739
4 BigQuery 1.59191

That’s pretty neat on it’s own, but I think most people would agree that visualizing data in charts can make it much more meaningful than viewing rows in a table. Generative AI can be leveraged even further by prompting it to write visualization code for the DataFrame.

A visualization prompt might look something like the following snippet. Giving the user’s question as context in the prompt alongside the DataFrame helps it better understand how to display the data clearly. The prompt that’s actually used in the code is a one-shot prompt that uses a single example of what we consider to be a good output. This reduces the chances of generating incomplete or broken code and gives us confidence it will follow nuanced instructions like assigning the Matplotlib figure to a variable named fig.

Write python code to visualize the following pandas dataframe using matplotlib. Only respond with code, no explanation.
Be sure to define the dataframe as `df` before plotting.
Assign the figure to a variable named `fig`.
The question asked was "{query}"
The corresponding dataframe is:
{dataframe}
Visualization code:

Here’s an example of generated visualization code from Gemini Pro.

import pandas as pd
import matplotlib.pyplot as plt

# Define the dataframe
df = pd.DataFrame({
'ServiceDescription': ['Compute Engine', 'Vertex AI', 'Cloud Storage', 'Maps API', 'BigQuery'],
'TotalCost': [325.275, 128.687, 32.423, 4.739, 1.59191]
})

# Plot the bar chart
df.sort_values(by='TotalCost', ascending=False).head(5).plot.barh(x='ServiceDescription', y='TotalCost')

# Customize the plot
plt.title('Top 5 Most Used Services By Cost')
plt.xlabel('Total Cost')
plt.ylabel('Service Description')
plt.grid(True)
plt.show()

fig = plt.gcf()

Take whatever precautions you deem necessary for your environment before following this approach, but if you’re confident enough in your security and reliability, the visualization code returned from the LLM can be passed into exec(), and integrated with your business logic. This is demonstrated in a way that won’t break your code base here, but with a complete disregard for security concerns, given this is effectively an intentional remote code execution vulnerability.

Since the LLM was told to assign the figure to a variable named fig, we can take control of it after the exec() and convert it into a NumPy array using Pillow, or whatever else you might need to do. By saving the image data as BytesIO or converting to a NumPy array, we can easily display the chart for the user in the UI as a Streamlit image.

# execute the code generated by the LLM
# this runs `fig = plt.gcf()` after configuring the plot
exec(visualization_code)

# save the figure to a BytesIO object
buf = io.BytesIO()
fig.savefig(buf, format='png')
buf.seek(0)

# load the image data from the BytesIO object into a numpy array
image = Image.open(buf)
image_array = np.array(image)

Conclusion

The ease of integration between generative AI and BigQuery represents a paradigm shift in how organizations harness the power of data, marking a significant step forward in the evolution of data analytics within organizations. The fusion of human-like language understanding with the analytical capabilities of BigQuery transforms the way users can interact with their data, making it more accessible and insightful.

Furthermore, the synergy between generative AI and BigQuery facilitates the discovery of patterns and trends that might have remained hidden in traditional analysis approaches. By allowing users to pose nuanced questions in natural language, they can uncover correlations, anomalies, and potential cost optimization opportunities that might have eluded them using conventional methods. Even for technical people, asking back-to-back questions in natural language comes with a lot less fatigue than writing coded queries. This easy interaction with data could also enhance the agility of decision-making processes and enable organizations to adapt more swiftly to changing circumstances.

Regardless of how great the potential seems to be in theory, how much of an impact it really makes will depend on the creativity of the people adopting it. I’m optimistic about where things are headed, but only time will tell.

Thanks for reading, and be sure to check out the open source companion repo to deploy the example app to Cloud Run for a starting point if you’d like to try working with your own data.

--

--