You ask the questions, GPT digs the insights
A Gen AI approach to business data analysis
By Andrew Zhu and Jacob H. Marquez
The “Chat” in ChatGPT conveys that ChatGPT can provide information as a sort of dialog — albeit with an information cut date of September 2021 that reflects the temporal limit of ChatGPT’s awareness.
But ChatGPT goes beyond being merely a powerful tool for providing useful information. Along with its GPT siblings such as GPT-3.5 or GPT-4 (collectively referenced throughout this article, along with ChatGPT, as GPT), it is essentially a reasoning machine. Through GPT’s OpenAI API, we can harness this reasoning power in a variety of ways. In this article, I present an approach for using GPT to fully automate business data analysis.
What this means in practice is that you can ask questions about your business and GPT can deliver:
- A query script to match your question, in SQL, Python, or even Kusto.
- Automatic execution of the script that is generated.
- Automatic data visualization with natural language guidance.
Now, you may be concerned that you can’t or don’t want to share your internal data with either Microsoft or OpenAI. But fortunately, no internal raw data is required to upload through the API to take advantage of GPT in the approach that I describe below.
This article walks through the process of using the OpenAI GPT API to answer questions about your business through the automated querying of a SQL database. The examples can be easily adapted to work with other types of databases, if needed.
1. Generate query script
1.1. A quick sample
As data scientists or business analysts, when we write SQL to query the data in a database, we don’t need the actual data uploaded into our brains to do so (though futurists may be working on something to help with this à la the film The Matrix). We do need to understand the data — the model, tables, table definitions, variables, data types, and so on. Looking at a sample of the data could help as a reference point as well. All of this is geared toward developing context for easier query writing.
The way to use GPT for writing and executing SQL scripts is similar. GPT needs to know information about the data model, tables, and columns as well as any other context that would be necessary for a human being to query the database.
The key is to prepare a proper prompt for the GPT API. Here is a quick sample (from the OpenAI online examples). Suppose we are curious to understand the names of departments that employed more than 10 employees in the last three months.
To use GPT to query for this information, you must build a prompt, such as that provided in the OpenAI documentation as follows, with detailed information about the data and what to generate.
### Postgres SQL tables, with their properties:
#
# Employee(id, name, department_id)
# Department(id, name, address)
# Salary_Payments(id, employee_id, amount, date)
#
### A query to list the names of the departments which employed more than 10 employees in the last 3 months
SELECT
You can try it for yourself! Copy and paste the prompt into the public GPT for it to return the following result. If we have the three tables holding the data in a PostgreSQL database, the execution of the following script should provide the data we request. A pipeline in Python code can automate all these steps.
SELECT d.name
FROM Department d
INNER JOIN Employee e ON d.id = e.department_id
INNER JOIN Salary_Payments sp ON e.id = sp.employee_id
WHERE sp.date > NOW() - INTERVAL '3 months'
GROUP BY d.name
HAVING COUNT(*) > 10
But the simple sample here is not good enough. To generate a complex script for more complex questions, we need to design a better prompt.
1. 2. Use prompt roles
The sample used above follows a previously available code completion guide. Unfortunately, this is now deprecated. However, we can use the newer chat models, which are able to do more tasks with similar capabilities.
Here is a Python example using the new API (from the OpenAI document; their example Python notebook):
import openai
openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
]
)
Both GPT-3.5 and GPT-4 are available to use. Check out the Models documentation for a full list. When your request to access GPT-4 is approved, you can reuse all those code samples and prompt templates without any editing.
In the code sample above, messages is an array of dictionaries or objects that holds the conversation thus far. Each object needs information about the “role” and the “content,” which are either the role prompt, answer, or direction for how GPT should behave.
In the example above, the system
role defines the personality profile that GPT should adopt. It informs the assistant role (i.e., the responses) about how to behave; for example, the profile of a “helpful assistant” or, in our case, “a Microsoft SQL Service T-SQL expert who can write high-quality scripts.” This role is optional but helpful for ensuring the desired results.
Next, the user
role represents us, the real human beings asking questions or providing guidance. The value of “content” can include all data-related context, table descriptions, column definitions, and even rules to teach GPT to write code!
The assistant
role holds the previous replies from GPT. In this way, GPT has access to all context to answer the most recently asked questions.
With the help of the messages array, we can build chat-style conversation using the GPT API, like the experience we have with ChatGPT but more powerful. The messages array doesn’t need to be created with existing conversation content if there isn’t any; instantiating it with a “system” and single “user” is enough to retrieve a response from GPT.
1.3. Design the prompt function
With the basics of the prompt function established, we can now construct a prompt to help us get GPT to answer more complex questions. This is done by adding several components to the prompt text:
- System role
- Dataset context (data model, table definitions, columns, and data types)
- Language guidance (optional)
- Additional prompt together with GPT prompt techniques
We expect the user to input only the question and request in natural language. Therefore, a function that connects these components is required.
Let’s cover each section one at a time.
1.3.1. System role prompt
The system role definition asks GPT to perform a role, for example a Microsoft SQL Service T-SQL expert who can write high quality scripts.
system_context = '''
You are T-SQL expert, can generate high quality T-SQL script,
double check T-SQL gramma, double check T-SQL syntex.
Don't explain the code, just generate the code block itself
'''
Besides the role definition, the “double check” prompt enables GPT to fix some native grammatical errors. (We still aren’t quite sure how it works internally, maybe it is just a side effect of the LLM-emergent phenomenon.)
The last line “don’t explain code
” tells GPT to omit the code explanation. Typically, when asked for code examples, GPT provides long descriptions including both code and explanations for what the code is doing. Because we want to automate the execution of the returned scripts, we need to tell GPT to provide only the code. More on how to do this is included below.
1.3.2. Build prompt for dataset context
For GPT to answer our questions with data, it needs to be provided with context. Specifically, metadata about the data model such as table definitions, variable data types, and relationships. This is the first part of the second part of the final prompt, which we store as “dataset_context”.
Note: The “ — ” before each line in the script below is preferred for GPT to produce the T-SQL script.
dataset_conext = '''
-- T-SQL Script, Microsoft SQL Server.
-- Table Definitions:
-- Employee(id:int, name:varchar, department_id:int)
-- Department(id:int, name:varchar, address:varchar)
-- Salary_Payments(id:int, employee_id:, amount:decimal, date:datetime)
'''
Table definitions give GPT an idea of the columns and data types. Further context about what the table is used for is helpful. To provide it, we can add “table definitions” to this “dataset_context” component of the prompt. Write this description in natural language, i.e., in the way you might explain the table to a colleague. GPT can understand the important points even if there are minor grammatical errors. However, do make sure to spell the table names and variables correctly, otherwise, the scripts will be incorrect.
dataset_conext = '''
-- T-SQL Script, Microsoft SQL Server.
-- Table Definitions:
-- Employee(id:int, name:varchar, department_id:int)
-- Department(id:int, name:varchar, address:varchar)
-- Salary_Payments(id:int, employee_id:, amount:decimal, date:datetime)
-- Table Description:
-- Employee table holds all the employee information, each table include employee id, employee name, and employee's department id
-- Department table holds the information about the Employee department structure.
-- Salary_Payments table hold the salary data for all employees across time, in date grain.
'''
Notice that there is a reference to “id
” across the three tables, but its meaning varies among them. As a result, we need to explain to GPT what “id” means for each instance of this column across all tables. It is helpful to do this for all columns, especially if your tables include a lot of internal jargon and acronyms. We can add a “Column Definitions” section to the “dataset_content” prompt text.
dataset_conext = '''
-- T-SQL Script, Microsoft SQL Server.
-- Table Definitions:
-- Employee(id:int, name:varchar, department_id:int)
-- Department(id:int, name:varchar, address:varchar)
-- Salary_Payments(id:int, employee_id:, amount:decimal, date:datetime)
-- Table Description:
-- Employee table holds all the employee information, each table include employee id, employee name, and employee's department id
-- Department table holds the information about the Employee department structure.
-- Salary_Payments table hold the salary data for all employees across time, in date grain.
-- Column Definitions:
-- The id from employee table represent the id of a employee, and is unique
-- The id from Department table represent the id of a department
-- The id from Salary_Paayments represent the the payment id, one id for each payment
'''
The more specific the description we add to the dataset_context prompt, the better the quality of the T-SQL code that will be generated.
1.3.3. Build language guidance prompt
GPT is pretty good at generating scripts for commonly used languages like SQL and Python, so you typically don’t need to provide a language guidance section. However, if you are going to ask GPT to produce a script that is not very popular, commonly used, or has newly released syntax, you can use another section of the prompt text to define it. Here is a sample of how to guide GPT to generate a Kusto script:
kusto_lang_guide = '''
// Kusto Language Guide:
// Use datetime_add() function to get desired date window
// Use datetime() function to wrap a datetime string, for example: datetime('2023-01-01')
// Use todatetime(tostring(DIM_DateId)) to convert DIM_dateId from int to datetime
// Use tostring() function to convert number to string
// Use function now() to have the current datetime
// Do Use $left.columnname == $right.columnname when join two tables
// Use double parenthese (()) to wrap a sub query. for example: table1 | where userid in ((table2 | distinct userid))
// Use getyear() function to get year from datetime, for example: getyear(datetime('2023,5,1'))
// Use getmonth() function to get month from datetime, for example: getmonth(datetime('2023,5,1'))
// Use let keyword for variable assigment
'''
You can build your own language guidance to suit your own purpose.
1.3.4. Additional prompt techniques
There are many effective prompt engineering techniques that are being found and discussed by researchers and engineers. OpenAI provides a document that summarizes techniques to improve GPT reliability. In this section, we are going to use some of the techniques discovered to generate more reliable scripts.
When a user inputs a question such as “List the names of the departments that employed more than 10 employees in the last three months,” we don’t just append the question to the prompt string. We must add some additional words around it to help GPT generate a T-SQL script to answer the question. This is done by adding a section “T-SQL script to … ” like this:
user_input = f"""
-- T-SQL script to {user_input},
-- Use minimal table required.
-- Build the script step by step with comments,
"""
The prompt use minimal table required
informs GPT to use only the required tables. If you don’t add this, GPT is more likely to join all the tables you provide to answer the question, even though the question can be answered using a single table. It is a little bit silly, but a product of the property of each LLM model. Note that different models could behave differently.
Next is the most fun part. The “step by step
” guidance in the “user_input” text string is like a magic wand to greatly improve the quality of code generation. It does this by telling GPT to make the steps clear in the code that leads to the result. Specifically, GPT adds comments for step 1, step 2 …, through step n. This can help us understand the meaning of each line of code (and also helps GPT itself, I think).
If GPT provides you with nonsense answers to your question, try “step by step” and you may get much better results.
With the components now created, we can construct the final prompt text (stored as “user_prompt”). Provide “system prompt” to the system role content, and then join all the other user input prompts into one string to provide as user role content. Here is an example:
user_prompt = ''.join([table_definition,kusto_lang_guide,user_input])
messages = [
{"role": "system", "content": system_context},
{"role": "user", "content": user_prompt+'{text}'}
]
Notice the {text} mark above. This is where GPT is going to complete the question. In essence, GPT is just a machine that outputs the next most likely text (actually, a token).
1.4. Execute query script
If everything is done right, a high-quality T-SQL provided by GPT is the result. Simply copy and paste the code to execute it and test the result.
But our final goal is to execute the script automatically, which is not hard to achieve. For SQL Server, use pyodbc
to submit the script and execute it in the same Python session. If you are not familiar with pyodbc
, here is the documentation from Microsoft.
For Kusto, I am using the ASDSDR tool to execute Kusto script using Python, and with Azure AAD authentication, I can execute Kusto script in Python without needing to manually input the password.
2. Data visualization
Assume we have the data in the Python context, maybe in a list or as a Pandas Dataframe object. How do we visualize the data? There are many options, including:
- Use a predefined visualization package that can draw line, bar, and pie charts, and ask GPT to produce a function call to visualize the data.
- Ask GPT to provide a fixed Python function that handles all visualization tasks, so that we don’t need to generate the visualization every time.
- Ask GTP to generate a Python script and save it as a Python file, and use
os.system(cmd)
in Python to execute another Python script and load the pre-saved data in a CSV file.
Although these are all reasonable solutions, they either lack flexibility or are too complex to actually implement. They also underestimate the capabilities of GPT.
After some experimentation, I have found an approach that enables users to produce visualizations with any package, for any plot type (either static or interactive), with any style (Matplotlib, Seaborn, Plotly, you name it). You don’t even need to provide the plot type—such as line chart or bar chart—you simply ask GPT to visualize that data for a better understanding of it. Here is how I did it.
2.1. Execute string Python code inside of a running Python runtime
When the Python runtime is in running, how do we ask Python to execute the newly generated Python code? One solution is to use the eval()
function, which includes context data.
a = 1
b = 2
exe_str = "a+b"
print(eval(exe_str))
# return 3
The example above shows that the eval()
function has the value of a and b. The drawback is that the eval()
function can handle only simple Python expression statements. You can’t evaluate a function or complete visualization code block.
The other solution is the exec()
function. What’s good about the exec()
function is that it can execute complex and long Python scripts in string format. However, it has no overall runtime context information by default. For example:
a = 1
b = 2
exe_str = "a+b"
print(exec(exe_str))
# return none
Here the result is none
. Nevertheless, the power to execute both long and all Python scripts in string format in one function is all that we need, as we can work around the lack of runtime context.
2.2. Make the Pandas Dataframe object into a string
The idea here is to turn the data in the dataframe object into a Python JSON string. Usually, this works because the aggregated data size is relatively small and not down to the individual customer level. Then provide the data along with the visualization prompt to GPT, and ask GPT to visualize the data with user guidance.
Note that this approach does not work for plotting dots into the millions due to the GPT token limitation. Also, one must be cautious if the data is sensitive or related to individual customers — in those cases, I would suggest using another approach such as asking GPT to provide a function to visualize the data instead of providing the data to GPT.
The best part of this solution is its flexibility. When GPT holds the aggregated data, GPT automatically generates the entire visualization layer code, with no need to tune anything. Simply call exec(vis_code_str)
to visualize the result—no need to specify the column name, the structure of the dataframe, or the size of the dataframe. This solution saves a lot of logic coding in Python to implement automation.
2.3. Visualization prompt design
Similar to the T-SQL script generation prompt, we also break down the prompt into several parts:
- System role prompt
- Fixed visualization rule prompt
- GPT hypnosis prompt
Make the pandas Dataframe object into a string
We can turn the Dataframe object into a JSON string using one line of code:
data_json = str(df_data.to_json())
Do remember to add the str() function, which turns a JSON object into a string in JSON format.
2.3.1. Build the system role prompt
For this, we ask GPT to function as a visualization expert:
system_content = '''
You are Python expert that can generate Python code,
you are good at data visualization and ploting,
can generate Python code to plot professional and beautiful charts and plots,
Provide Example usage as code comment,
Don't explain the code, don't wrap code with block mark ``` ```,
just generate the code block itself
'''
Similar to the script generation role, we add more guidance to ask GPT to behave like a visualization expert.
2.3.2. Build the fixed visualization rule prompt
We only need the user to input a minimal description to generate the desired plot. But GPT requires a more specific description, creating a gap, and so we use the fixed visualization to bridge the gap.
fixed_vis_rule = '''
# Default visualization settings:
# Tilt the x axis label for better reading
# The default size is 10 inch height, 16 inch width
# Round the label value to keep two decimals
# Show data label in readable format, as G, M, K. for example, 12456 show as 12k, 1234567 show as 1.2M
# Tilt data label angel for better reading
'''
The prompt itself explains its own purpose. The prompt asks GPT to tilt the x-axis label to show long x-axis labels at an angle. It also defines the default width, height, and more.
If you put the fixed visualization prompt above the user input prompt, end users can update the settings as needed; for example, they can ask GPT to plot a chart with an eight-inch height and a 12-inch width to override the default setting.
2.3.3. Build the GPT prompt with step-by-step instructions
As always, this is the most fun part, as you “hypnotize” GPT to build the script step by step:
gpt_hypnosis = """
# Python to plot the data,
build the script step by step,
double check python gramma,
double check python syntex,
double check logic.
"""
Sometimes, I have found that GPT puts the date and time in the x-axis even when I need the x-axis to be category names instead. That is why I’ve included “double-check logic” in the prompt.
2.3.4. Combine the prompts
Now combine all the prompts together:
user_prompt = f'''
# Generate Python code to visualize the following data:
{data_json}
{fixed_vis_rule}
{user_vis_rule}
{gpt_hypnosis}
'''
messages = [
{"role": "system", "content": system_content},
{"role": "user", "content": user_prompt+'{text}'}
]
The string “user_vis_rule” is provided by the user, and can read like this:
user_vs_rule = '''
# user visualization rules:
# Use plotly to visualize the data
# Plot bar chart
# Deplotment name in the x axis. salary payment as the data value
# Chart title is Monthly Department Salary Payment
# The chart size is 8 inch height, 12 inch width
'''
The user input visualization prompt tells GPT to:
- Use the plotly package to visualize the data. We can also change it to Matplotlib or any other library. Note that we need to install the package to be used beforehand.
- Plot a bar chart (which we can change to another reasonable chart style).
- Distinguish between what is to be placed in the in x-axis and what data is used to plot. Note that the Dataframe object can have many columns; if we don’t specify the x-axis and data to plot, GPT may generate the chart based on its own understanding of the data, which might not be what we want. (Though there is a high chance that you will get a plot that shows Datetime in the x-axis.)
- Provide the title of the chart.
- Override the chart size.
2.4. Render the chart
Call the API and GPT returns the entire Python script. There is no need to tune or update the script, simply call exec() to execute the returned Python code:
response = openai.ChatCompletion.create(
model = model_name
, messages = messages
, temperature = 0
)
script_raw = response.choices[0].message["content"]
print(script_raw)
exec(script_raw)
In the sample above, I also use print(script_raw) first to review the generated Python code. After all, we humans need to make the decision whether to use or abandon the generated code.
3. Summarize
Traditionally, to dig insights from data, data scientists and analysts have not only needed to be experts in the data, they’ve also needed to think about scripting, spend time discussing requirements, and do performance tuning — not to mention finding code and scripting to visualize the data. And after doing all of this, it could still easily take another month to build a dashboard while conquering numerous mouse click-and-drag errors and doing the styling.
But one report or well-built dashboard is rarely used for more than a year. And during this time, people typically are wowed by the result during the first week, and then gradually ask for more features, causing the data scientist to struggle to refresh the data and regularly update the logic. It all might be interesting during the first iteration of building it, but this soon gives way to merely labor-intensive work.
If we look introspectively, however, we see that all we need from the data is to address the key questions that help us improve the business, reduce costs, or enrich the customer experience.
With the help of GPT in business data analysis, we can create a situation in which someone with a business question can type in the question and AI will provide the code, resulting data, and visualizations. This means that what used to take a month to produce in the way of answers can now be done in perhaps an hour, and if everything goes well, maybe five minutes.
These revolutionary changes are just getting started in the data analysis field. The software industry is gradually evolving to “Software 2.0,” and something similar is happening to data analysis.
Not everyone likes it, but the improvements in productivity and reduction in manual effort will soon cause this new style of data analysis to prevail across the industry. Instead of rejecting the change, a better strategy may be to embrace it, and realize the resulting benefits.
So, today, what can we as data scientists and analysts do to stand strong for the future? There are several things that are likely to still be important for data workers:
- Understanding the business and the data.
- Providing clear and detailed documentation of the data and business scope.
- Being good at scripting, as AI-generated results rely on humans to review.
- Bringing good taste in visualization and even art, to help build an effective and professional visualization layer.
- Building high-quality datasets.
On one hand, some may worry about job reduction and AI taking over, but on the other hand, there are likely to be numerous new opportunities that are created while reducing more manual, repetitive, and boring work. Is this good? I think so.
Let’s help make the change together step by step and see where the future will go.
4. References and useful links
- https://platform.openai.com/docs/guides/code
- https://platform.openai.com/docs/guides/gpt/chat-completions-api
- https://learn.microsoft.com/en-us/sql/connect/python/pyodbc/step-3-proof-of-concept-connecting-to-sql-using-pyodbc?view=sql-server-ver16
- https://karpathy.medium.com/software-2-0-a64152b37c35
- Emergent Abilities of Large Language Models: https://arxiv.org/abs/2206.07682
- Techniques to improve reliability: https://github.com/openai/openai-cookbook/blob/main/techniques_to_improve_reliability.md