Build Your Own SQL Analyst Bot
Ask Questions of Your Structured Data; Get Grounded Answers
Introduction
One of the challenges of using LLMs (Large Language Models) in a business context is getting the model to answer factually and accurately about your company’s data. One possible solution is Retrieval Augmented Generation (RAG) using a vector database to populate the prompt context (see my post: Q&A With Your Docs: A Gentle Introduction to Matching Engine + PaLM). This works well for semi-structured data like text files and PDFs. But what if you wanted to retrieve data from a stuctured data source? What if we had our LLM use the results of an analytic database query? That’s what we are going to explore in this post.
Using Google’s new Codey APIs, announced at Google I/O earlier this year, we’ll build a system that:
- Converts the user’s natural language question to a SQL statement
- Runs that SQL statement against an analytic database
- Uses the query result to answer the user’s original question
We’ll also discuss prompt tuning as well as some of the shortcomings and limitations of a system like this.
In this how-to, I will be querying the NYC Citibike public dataset.
While I will explain each code module in order, the full code is provided at the end of this tutorial.
Step 1: Enable Needed Cloud APIs
Run gcloud init
to authenticate with your GCP user and project.
Enable the necessary APIs for your project:
gcloud services enable aiplatform.googleapis.com --async
Step 2: Install Python Packages
If needed, install the necessary Python packages into your Python environment, using Pip:
pip install google-cloud-bigquery pandas google-cloud-aiplatform db-dtypes
Step 3: Get Example Data and Schemas
In order for the Codey LLM to know what the tables look like, we will need to give it the table schemas and example data.
The following code checks whether this information has already been retrieved (and saved locally), and runs the needed queries if not.
Note that because we are using Pandas dataframes to represent the query response, the default view truncations can sometimes be too restrictive. Hence the Pandas options adjustments at the beginning of this code.
The following code will create data.txt
and schemas.txt
files locally, if they don't already exist.
pd.options.display.max_columns = 500
pd.options.display.max_rows = 500
def get_data(tables):
try:
with open("./data.txt", "r") as d:
data = d.read()
return data
except:
data = ""
for table in tables:
if table == 'bigquery-public-data.new_york.citibike_trips':
querystring = f"""
SELECT
*
FROM
`{table}`
WHERE gender != ""
LIMIT
5
"""
else: # 'bigquery-public-data.new_york.citibike_stations'
querystring = f"""
SELECT
*
FROM
`{table}`
WHERE station_id IS NOT NULL
LIMIT
5
"""
data += f"\n\nData for table: {table}:\n\n"
data += str(client.query(querystring).result().to_dataframe())
with open("./data.txt", "w") as d:
d.write(data)
return data
def get_schemas(tables):
try:
with open("./schemas.txt", "r") as s:
schemas = s.read()
return schemas
except:
schemas = ""
for table in tables:
querystring = f"""
SELECT
column_name,
data_type
FROM
`bigquery-public-data.new_york`.INFORMATION_SCHEMA.COLUMNS
WHERE
table_name = "{table.split(".")[-1]}";
"""
schemas += f"\n\nSchema for table: {table}:\n\n"
schemas += str(client.query(querystring).result().to_dataframe())
with open("./schemas.txt", "w") as s:
s.write(schemas)
return schemas
Step 4: Generate SQL With Codey
Now we will generate the SQL query in response to the user’s question.
A couple of things to observe with this code:
- The actual prompt that Codey receives is a combination of the schemas and data of the available tables and the user’s question, all wrapped with additional context and rules for Codey.
- The rules are a result of troubleshooting and coercing the system to provide helpful, accurate answers.
- In the return statement, we are filtering out
```sql
and```
; that is because Codey responds with a Markdown-formatted query. We just want the raw query text.
def get_proposed_query(question, schemas, data):
parameters = {
"temperature": 0.2,
"max_output_tokens": 1024
}
response = codey.predict(
prefix = f"""
{schemas}
{data}
As a senior analyst, given the above schemas and data of bicycle trips in New York City, write a BigQuery SQL query to answer the following question:
{question}
When constructing SQL statements, follow these rules:
- There is no `MONTH` function; if you want the month, instead use `EXTRACT(month FROM starttime) AS month`
""",
**parameters
)
return response.text.replace("```sql", "").replace("```", "")
Step 5: Orchestrate The Function Calls in a main() Statement
Finally, we can combine these function calls into a single main()
statement. Some things to note about this code:
- It includes argument parsing for command line use; see below for usage examples.
- The SQL query result is not returned directly to the user. Instead the text generation model (the PaLM
text-bison@001
model) interprets this result as a part of the final system response. - The temperature of the text generation model is set to 0. The temperature can be thought of as the “creativity” of the model’s response. Because we only want the model to respond using the data available in the SQL query response, we are setting it to 0 (the lowest creativity value). Feel free to experiment with different values.
def main():
parser = argparse.ArgumentParser(description='A program to respond to analytic questions with SQL context')
parser.add_argument('-q', '--question', help='The users question')
parser.add_argument('-v', '--verbose', action='store_true', help='Whether to print more detail')
args = parser.parse_args()
if not args.question:
print("You must enter a question")
return
user_question = args.question
tables = ['bigquery-public-data.new_york.citibike_stations', 'bigquery-public-data.new_york.citibike_trips']
schemas = get_schemas(tables)
data = get_data(tables)
proposed_query = get_proposed_query(user_question, schemas, data)
if args.verbose:
print("\nProposed Query: \n", proposed_query)
query_result = client.query(proposed_query).result().to_dataframe()
if args.verbose:
print("BQ Query Result: \n\n", query_result, "\n")
prompt = f"""
Context: You are a senior business intelligence analyst.
Use the following query result to give a detailed answer to any questions you receive: {query_result}
If the column header is something like: "f0_" that means that you have number for your answer.
Do not use any other information to answer the question. Only use information from the query result. Do not make up information. If you don't have enough information, say "I don't have enough information."
Question: {user_question}
"""
print("Answer:", generation_model.predict(prompt, temperature = 0, max_output_tokens = 1024), "\n")
Step 8: Test and Observe
That’s it! Usage is as follows:
python cli_analyst.py -v -q <USER_QUESTION>
The -v
flag prints additional information during the function call. The -q
flag precedes the question being asked.
The complete Python code is as follows:
import argparse
from google.cloud import bigquery
import pandas as pd
from vertexai.language_models import CodeGenerationModel
from vertexai.preview.language_models import TextGenerationModel
generation_model = TextGenerationModel.from_pretrained("text-bison@001")
codey = CodeGenerationModel.from_pretrained("code-bison@001")
client = bigquery.Client()
pd.options.display.max_columns = 500
pd.options.display.max_rows = 500
def get_data(tables):
try:
with open("./data.txt", "r") as d:
data = d.read()
return data
except:
data = ""
for table in tables:
if table == 'bigquery-public-data.new_york.citibike_trips':
querystring = f"""
SELECT
*
FROM
`{table}`
WHERE gender != ""
LIMIT
5
"""
else: # 'bigquery-public-data.new_york.citibike_stations'
querystring = f"""
SELECT
*
FROM
`{table}`
WHERE station_id IS NOT NULL
LIMIT
5
"""
data += f"\n\nData for table: {table}:\n\n"
data += str(client.query(querystring).result().to_dataframe())
with open("./data.txt", "w") as d:
d.write(data)
return data
def get_schemas(tables):
try:
with open("./schemas.txt", "r") as s:
schemas = s.read()
return schemas
except:
schemas = ""
for table in tables:
querystring = f"""
SELECT
column_name,
data_type
FROM
`bigquery-public-data.new_york`.INFORMATION_SCHEMA.COLUMNS
WHERE
table_name = "{table.split(".")[-1]}";
"""
schemas += f"\n\nSchema for table: {table}:\n\n"
schemas += str(client.query(querystring).result().to_dataframe())
with open("./schemas.txt", "w") as s:
s.write(schemas)
return schemas
def get_proposed_query(question, schemas, data):
parameters = {
"temperature": 0.2,
"max_output_tokens": 1024
}
response = codey.predict(
prefix = f"""
{schemas}
{data}
As a senior analyst, given the above schemas and data of bicycle trips in New York City, write a BigQuery SQL query to answer the following question:
{question}
When constructing SQL statements, follow these rules:
- There is no `MONTH` function; if you want the month, instead use `EXTRACT(month FROM starttime) AS month`
""",
**parameters
)
return response.text.replace("```sql", "").replace("```", "")
def main():
parser = argparse.ArgumentParser(description='A program to respond to analytic questions with SQL context')
parser.add_argument('-q', '--question', help='The users question')
parser.add_argument('-v', '--verbose', action='store_true', help='Whether to print more detail')
args = parser.parse_args()
if not args.question:
print("You must enter a question")
return
user_question = args.question
tables = ['bigquery-public-data.new_york.citibike_stations', 'bigquery-public-data.new_york.citibike_trips']
schemas = get_schemas(tables)
data = get_data(tables)
proposed_query = get_proposed_query(user_question, schemas, data)
if args.verbose:
print("\nProposed Query: \n", proposed_query)
query_result = client.query(proposed_query).result().to_dataframe()
if args.verbose:
print("BQ Query Result: \n\n", query_result, "\n")
prompt = f"""
Context: You are a senior business intelligence analyst.
Use the following query result to give a detailed answer to any questions you receive: {query_result}
If the column header is something like: "f0_" that means that you have number for your answer.
Do not use any other information to answer the question. Only use information from the query result. Do not make up information. If you don't have enough information, say "I don't have enough information."
Question: {user_question}
"""
print("Answer:", generation_model.predict(prompt, temperature = 0, max_output_tokens = 1024), "\n")
if __name__ == '__main__':
main()
Let’s try some example queries. How about:
python cli_analyst.py -v -q "Which 3 months are the most trips taken in?"
Notice how the system uses the Codey API to propose and return a SQL query. Next it runs that query against the BQ table and returns the query response. And finally, notice how the text generation LLM correctly converts the month designations from numeric to lexical (i.e. the ninth month of the year is September). Pretty cool!
Let’s try another one:
python cli_analyst.py -v -q "When was the youngest rider you have on record born?"
Whoops! It’s returning the oldest rider on record (apparently 138 years old…) rather than the youngest. Let’s fix that by adding an additional rule to the SQL generation prompt context. In the get_proposed_query()
function, add a rule at the end of the rule list that says:
- The `birth_year` field tells the birth year of when the rider was born. Older riders have smaller birth years, and younger riders have larger birth years.
Now let’s try again:
Hey, that looks better! Giving the model some additional context and instructions can help it more accurately choose how to formulate the SQL statement.
Feel free to try more queries of your own. As you can see, this is a promising solution to generating factual results to routine queries. However, there are some limitations and shortcomings as we’ll discuss in the next section.
Discussion
While it could be tempting to want to hand a system like this to a non-technical user, there are a few limitations. First, there’s no guarantee that the Codey API will generate valid SQL code. For example, the “There is no MONTH
function..." rule in the get_proposed_query()
function exists because I found it was trying to use a MONTH()
function, which doesn't exist. Even though we can mitigate this behavior with the appropriate rule, we don't know the complete possible error space and thus cannot mitigate every possible invalid query. Additionally, even when Codey generates valid SQL code, there is no guarantee that it is answering the question being asked. LLMs can't read minds and thus can only reference the question as given. If the question is vague or poorly worded, the Codey API could interpret it differently than was intended. Ultimately, there is no way to guarantee that the SQL code generated is "fit for use", other than having a qualified analyst assess the statement. As a result, while it might be tempting to deploy a system like this directly to non-technical users, it would likely be more valuable to technical, SQL-knowing users, as an aid and accelerator to their work.
Conclusion
I hope this has been an exciting look at some of Google’s code generation capabilities for analytics. And while a system like this is not yet ready to replace a knowledgeable SQL analyst, it’s exciting to see the ways it could boost the existing abilities of already-technical users.
References and Further Reading
- I was inspired by this post from Ken Van Haren: Replacing a SQL analyst with 26 recursive GPT prompts
- Google Cloud Codey APIs
All code in this post is Copyright 2023 Google LLC, Licensed under the Apache License, Version 2.0.