Build Your Own SQL Analyst Bot

Ask Questions of Your Structured Data; Get Grounded Answers

John Grinalds
Google Cloud - Community
9 min readAug 28, 2023

--

Introduction

One of the challenges of using LLMs (Large Language Models) in a business context is getting the model to answer factually and accurately about your company’s data. One possible solution is Retrieval Augmented Generation (RAG) using a vector database to populate the prompt context (see my post: Q&A With Your Docs: A Gentle Introduction to Matching Engine + PaLM). This works well for semi-structured data like text files and PDFs. But what if you wanted to retrieve data from a stuctured data source? What if we had our LLM use the results of an analytic database query? That’s what we are going to explore in this post.

Using Google’s new Codey APIs, announced at Google I/O earlier this year, we’ll build a system that:

  1. Converts the user’s natural language question to a SQL statement
  2. Runs that SQL statement against an analytic database
  3. Uses the query result to answer the user’s original question

We’ll also discuss prompt tuning as well as some of the shortcomings and limitations of a system like this.

In this how-to, I will be querying the NYC Citibike public dataset.

While I will explain each code module in order, the full code is provided at the end of this tutorial.

Step 1: Enable Needed Cloud APIs

Run gcloud init to authenticate with your GCP user and project.

Enable the necessary APIs for your project:

gcloud services enable aiplatform.googleapis.com --async

Step 2: Install Python Packages

If needed, install the necessary Python packages into your Python environment, using Pip:

pip install google-cloud-bigquery pandas google-cloud-aiplatform db-dtypes

Step 3: Get Example Data and Schemas

In order for the Codey LLM to know what the tables look like, we will need to give it the table schemas and example data.

The following code checks whether this information has already been retrieved (and saved locally), and runs the needed queries if not.

Note that because we are using Pandas dataframes to represent the query response, the default view truncations can sometimes be too restrictive. Hence the Pandas options adjustments at the beginning of this code.

The following code will create data.txt and schemas.txt files locally, if they don't already exist.

pd.options.display.max_columns = 500
pd.options.display.max_rows = 500

def get_data(tables):
try:
with open("./data.txt", "r") as d:
data = d.read()
return data
except:
data = ""
for table in tables:
if table == 'bigquery-public-data.new_york.citibike_trips':
querystring = f"""
SELECT
*
FROM
`{table}`
WHERE gender != ""
LIMIT
5
"""
else: # 'bigquery-public-data.new_york.citibike_stations'
querystring = f"""
SELECT
*
FROM
`{table}`
WHERE station_id IS NOT NULL
LIMIT
5
"""
data += f"\n\nData for table: {table}:\n\n"
data += str(client.query(querystring).result().to_dataframe())

with open("./data.txt", "w") as d:
d.write(data)

return data

def get_schemas(tables):

try:
with open("./schemas.txt", "r") as s:
schemas = s.read()
return schemas
except:
schemas = ""
for table in tables:
querystring = f"""
SELECT
column_name,
data_type
FROM
`bigquery-public-data.new_york`.INFORMATION_SCHEMA.COLUMNS
WHERE
table_name = "{table.split(".")[-1]}";
"""
schemas += f"\n\nSchema for table: {table}:\n\n"
schemas += str(client.query(querystring).result().to_dataframe())

with open("./schemas.txt", "w") as s:
s.write(schemas)

return schemas

Step 4: Generate SQL With Codey

Now we will generate the SQL query in response to the user’s question.

A couple of things to observe with this code:

  1. The actual prompt that Codey receives is a combination of the schemas and data of the available tables and the user’s question, all wrapped with additional context and rules for Codey.
  2. The rules are a result of troubleshooting and coercing the system to provide helpful, accurate answers.
  3. In the return statement, we are filtering out ```sql and ```; that is because Codey responds with a Markdown-formatted query. We just want the raw query text.
def get_proposed_query(question, schemas, data):

parameters = {
"temperature": 0.2,
"max_output_tokens": 1024
}

response = codey.predict(
prefix = f"""

{schemas}

{data}

As a senior analyst, given the above schemas and data of bicycle trips in New York City, write a BigQuery SQL query to answer the following question:

{question}

When constructing SQL statements, follow these rules:
- There is no `MONTH` function; if you want the month, instead use `EXTRACT(month FROM starttime) AS month`

""",
**parameters
)

return response.text.replace("```sql", "").replace("```", "")

Step 5: Orchestrate The Function Calls in a main() Statement

Finally, we can combine these function calls into a single main() statement. Some things to note about this code:

  1. It includes argument parsing for command line use; see below for usage examples.
  2. The SQL query result is not returned directly to the user. Instead the text generation model (the PaLM text-bison@001 model) interprets this result as a part of the final system response.
  3. The temperature of the text generation model is set to 0. The temperature can be thought of as the “creativity” of the model’s response. Because we only want the model to respond using the data available in the SQL query response, we are setting it to 0 (the lowest creativity value). Feel free to experiment with different values.
def main():
parser = argparse.ArgumentParser(description='A program to respond to analytic questions with SQL context')
parser.add_argument('-q', '--question', help='The users question')
parser.add_argument('-v', '--verbose', action='store_true', help='Whether to print more detail')
args = parser.parse_args()

if not args.question:
print("You must enter a question")
return

user_question = args.question

tables = ['bigquery-public-data.new_york.citibike_stations', 'bigquery-public-data.new_york.citibike_trips']
schemas = get_schemas(tables)
data = get_data(tables)

proposed_query = get_proposed_query(user_question, schemas, data)
if args.verbose:
print("\nProposed Query: \n", proposed_query)
query_result = client.query(proposed_query).result().to_dataframe()
if args.verbose:
print("BQ Query Result: \n\n", query_result, "\n")

prompt = f"""
Context: You are a senior business intelligence analyst.
Use the following query result to give a detailed answer to any questions you receive: {query_result}
If the column header is something like: "f0_" that means that you have number for your answer.
Do not use any other information to answer the question. Only use information from the query result. Do not make up information. If you don't have enough information, say "I don't have enough information."
Question: {user_question}
"""

print("Answer:", generation_model.predict(prompt, temperature = 0, max_output_tokens = 1024), "\n")

Step 8: Test and Observe

That’s it! Usage is as follows:

python cli_analyst.py -v -q <USER_QUESTION>

The -v flag prints additional information during the function call. The -q flag precedes the question being asked.

The complete Python code is as follows:

import argparse
from google.cloud import bigquery
import pandas as pd
from vertexai.language_models import CodeGenerationModel
from vertexai.preview.language_models import TextGenerationModel

generation_model = TextGenerationModel.from_pretrained("text-bison@001")
codey = CodeGenerationModel.from_pretrained("code-bison@001")
client = bigquery.Client()

pd.options.display.max_columns = 500
pd.options.display.max_rows = 500

def get_data(tables):
try:
with open("./data.txt", "r") as d:
data = d.read()
return data
except:
data = ""
for table in tables:
if table == 'bigquery-public-data.new_york.citibike_trips':
querystring = f"""
SELECT
*
FROM
`{table}`
WHERE gender != ""
LIMIT
5
"""
else: # 'bigquery-public-data.new_york.citibike_stations'
querystring = f"""
SELECT
*
FROM
`{table}`
WHERE station_id IS NOT NULL
LIMIT
5
"""
data += f"\n\nData for table: {table}:\n\n"
data += str(client.query(querystring).result().to_dataframe())

with open("./data.txt", "w") as d:
d.write(data)

return data

def get_schemas(tables):

try:
with open("./schemas.txt", "r") as s:
schemas = s.read()
return schemas
except:
schemas = ""
for table in tables:
querystring = f"""
SELECT
column_name,
data_type
FROM
`bigquery-public-data.new_york`.INFORMATION_SCHEMA.COLUMNS
WHERE
table_name = "{table.split(".")[-1]}";
"""
schemas += f"\n\nSchema for table: {table}:\n\n"
schemas += str(client.query(querystring).result().to_dataframe())

with open("./schemas.txt", "w") as s:
s.write(schemas)

return schemas

def get_proposed_query(question, schemas, data):

parameters = {
"temperature": 0.2,
"max_output_tokens": 1024
}

response = codey.predict(
prefix = f"""

{schemas}

{data}

As a senior analyst, given the above schemas and data of bicycle trips in New York City, write a BigQuery SQL query to answer the following question:

{question}

When constructing SQL statements, follow these rules:
- There is no `MONTH` function; if you want the month, instead use `EXTRACT(month FROM starttime) AS month`

""",
**parameters
)

return response.text.replace("```sql", "").replace("```", "")

def main():
parser = argparse.ArgumentParser(description='A program to respond to analytic questions with SQL context')
parser.add_argument('-q', '--question', help='The users question')
parser.add_argument('-v', '--verbose', action='store_true', help='Whether to print more detail')
args = parser.parse_args()

if not args.question:
print("You must enter a question")
return

user_question = args.question

tables = ['bigquery-public-data.new_york.citibike_stations', 'bigquery-public-data.new_york.citibike_trips']
schemas = get_schemas(tables)
data = get_data(tables)

proposed_query = get_proposed_query(user_question, schemas, data)
if args.verbose:
print("\nProposed Query: \n", proposed_query)
query_result = client.query(proposed_query).result().to_dataframe()
if args.verbose:
print("BQ Query Result: \n\n", query_result, "\n")

prompt = f"""
Context: You are a senior business intelligence analyst.
Use the following query result to give a detailed answer to any questions you receive: {query_result}
If the column header is something like: "f0_" that means that you have number for your answer.
Do not use any other information to answer the question. Only use information from the query result. Do not make up information. If you don't have enough information, say "I don't have enough information."
Question: {user_question}
"""

print("Answer:", generation_model.predict(prompt, temperature = 0, max_output_tokens = 1024), "\n")

if __name__ == '__main__':
main()

Let’s try some example queries. How about:

python cli_analyst.py -v -q "Which 3 months are the most trips taken in?"

Notice how the system uses the Codey API to propose and return a SQL query. Next it runs that query against the BQ table and returns the query response. And finally, notice how the text generation LLM correctly converts the month designations from numeric to lexical (i.e. the ninth month of the year is September). Pretty cool!

Let’s try another one:

python cli_analyst.py -v -q "When was the youngest rider you have on record born?"

Whoops! It’s returning the oldest rider on record (apparently 138 years old…) rather than the youngest. Let’s fix that by adding an additional rule to the SQL generation prompt context. In the get_proposed_query() function, add a rule at the end of the rule list that says:

- The `birth_year` field tells the birth year of when the rider was born. Older riders have smaller birth years, and younger riders have larger birth years.

Now let’s try again:

Hey, that looks better! Giving the model some additional context and instructions can help it more accurately choose how to formulate the SQL statement.

Feel free to try more queries of your own. As you can see, this is a promising solution to generating factual results to routine queries. However, there are some limitations and shortcomings as we’ll discuss in the next section.

Discussion

While it could be tempting to want to hand a system like this to a non-technical user, there are a few limitations. First, there’s no guarantee that the Codey API will generate valid SQL code. For example, the “There is no MONTH function..." rule in the get_proposed_query() function exists because I found it was trying to use a MONTH() function, which doesn't exist. Even though we can mitigate this behavior with the appropriate rule, we don't know the complete possible error space and thus cannot mitigate every possible invalid query. Additionally, even when Codey generates valid SQL code, there is no guarantee that it is answering the question being asked. LLMs can't read minds and thus can only reference the question as given. If the question is vague or poorly worded, the Codey API could interpret it differently than was intended. Ultimately, there is no way to guarantee that the SQL code generated is "fit for use", other than having a qualified analyst assess the statement. As a result, while it might be tempting to deploy a system like this directly to non-technical users, it would likely be more valuable to technical, SQL-knowing users, as an aid and accelerator to their work.

Conclusion

I hope this has been an exciting look at some of Google’s code generation capabilities for analytics. And while a system like this is not yet ready to replace a knowledgeable SQL analyst, it’s exciting to see the ways it could boost the existing abilities of already-technical users.

References and Further Reading

  1. I was inspired by this post from Ken Van Haren: Replacing a SQL analyst with 26 recursive GPT prompts
  2. Google Cloud Codey APIs

All code in this post is Copyright 2023 Google LLC, Licensed under the Apache License, Version 2.0.

--

--