Automating data analytics with ChatGPT

Published in

Data Science at Microsoft

13 min readJul 4, 2023

As an AI language model, ChatGPT is becoming well known for its ability to generate text-based responses to given prompts and contextual cues. Numerous useful business applications exist for LLMs (large language models, which include ChatGPT) to work with unstructured text data such as summarization, information extraction, and augmented data generation. An area where LLM and ChatGPT have not shined yet, however, is structured data, particularly analytics on structured data.

The main reason for this is that LLMs, including ChatGPT, are trained with a focus on the language and communication aspects of information, and not quantitative reasoning. To date this has meant that they are less reliable for quantitative reasoning tasks, including documented instances in which ChatGPT and other LLMs have made errors in dealing with numbers.

But most critical sources of information in business still involve structured data, so the ability to automate reasoning with structured data is highly desirable. In this article, I present a methodology, coupled with a practical reference implementation, to turn ChatGPT into a powerful business analytics assistant.

Idea

Although ChatGPT has limitations in working with numbers and quantitative data directly, it’s worth noting that humans also easily make mistakes when performing calculations involving complex numbers and vast amounts of data. That’s why instead of doing number crunching by hand, a trained professional can benefit from tools that help with the heavy lifting and produce accurate insights.

This, of course, is possible only with the necessary high-level knowledge and skills, such as business domain expertise, data analytics techniques spanning simple calculation to advanced statistics, and knowledge of how to use them.

Fortunately, ChatGPT excels at providing the high-level knowledge component. Professionals can turn to ChatGPT to get advice on solving business problems, doing data analysis, and writing code, for example.

This leads to a compelling question: What if we could teach ChatGPT to leverage such tools and the thought process behind them to analyze problems within specific domains, particularly business analytics? By exploring this possibility, we can potentially expand ChatGPT’s capabilities and transform it into a valuable tool for data analytics professionals.

Goal

The objective is to allow users to ask complex analytical questions of business data, which often exists in structured SQL databases. Ultimately, the goal is for ChatGPT to deliver the answers in the best possible format, complete with rich visualizations that make it easier for users to comprehend the results. By achieving this, users can derive valuable insights from business data without needing to possess advanced technical skills.

Approach

Expanding from the initial idea described above, here is the overall approach to building the desired capability:

Use ChatGPT’s broad knowledge in data and business analytics to plan execution at both high and detailed levels, with the help from context that we provide.
Guide ChatGPT to break a complex problem or question into addressable steps. For this we can use a popular technique in LLM prompt engineering known as chain of thought (CoT). Additionally, we can augment this approach with a more advanced CoT technique called ReAct to enable ChatGPT to re-evaluate the planned approach based by observing the results of intermediate steps.
Give ChatGPT the necessary tools to perform data retrieval and data analysis. Here we can take advantage of the capabilities of ChatGPT to write SQL queries and Python data analysis code in designing the tool.
Design the prompt to instruct ChatGPT to perform the specific action at each step.
As ChatGPT is merely the “brain,” supplement the approach with inter-system communication.
Build the end user application.

Solution design

The following diagram illustrates the overall design of the solution.

Process

Similar to a manual analytic process, the automated analytic application process is designed with three main stages:

Data acquisition: This stage involves retrieving the data from the source system to answer the business question. The automation of this stage requires knowledge of the data schema of the source system and the requisite business knowledge for selecting the right data.
Analytical computation: This stage encompasses the performance of everything from simple computation such as aggregations to statistical analysis and Machine Learning.
Presentation: This stage involves visualizing and presenting the data to the user.

ChatGPT agents

The ChatGPT agents perform tasks on their own by using the tools, observing the results, and adjusting their actions based on their observations. There are two agents in this design:

Data engineer: This agent is responsible for performing data acquisition from the source system (in this case a SQL database). The data engineer agent receives instructions from the data scientist agent.
Data scientist: The main agent in this solution, responsible for producing the end result as prompted by a human user’s request. The data scientist agent can request that the data engineer agent acquire the necessary data, and then the data scientist agent uses tools to perform data analytics to produce a final answer.

The use of two separate agents follows the design thinking of dividing a potentially complex task into multiple sub-tasks that are easier for ChatGPT to work on.

Agent tools

Like tools used by humans, tools for agents enable them to perform complex tasks via high level actions without having to worry about the technical details behind them.

Tools for the data engineer agent include Python Console and SQL Query. Python Console is a Python utility function that retrieves data from the source system based on the input of a SQL query. The data engineer agent must create the SQL query based on its knowledge of the source system and the requirements specified by the data scientist agent.
Tools for the data scientist agent include Python Console, Data Analytics & ML Libraries, Plotly for visualization, and the display() utility function. The display() function helps the data scientist agent communicate the results to the end user by implementing output communication specific to the application platform.

Application platform

We need an application platform for ChatGPT’s tools and agents to run on as well as serving the end user a UI application for interacting with the agents. The platform needs to provide the following capabilities:

Python code, including for the ChatGPT API to access back-end SQL systems and data analytics libraries.
Data visualization via a web interface.
Web application widgets for interacting with end users.
Managing stateful memory during an interactive session.

There are many great Python-based platforms that can support web-based data visualization and interaction such as Dash and Bokeh, but I prefer Streamlit because of its capability to support stateful memory and its simplicity in development. If you are new to Streamlit, you can learn about it here.

In a production deployment, you may choose to deploy the agents as restful back-end APIs using a framework such as Flask and deploy the UI layer using Streamlit as a front-end application for scalability and maintainability.

Data sources

SQL or structured data is the primary source for data analysis. In this implementation, two options are provided — SQLite and SQL Server — but you can easily expand them to any SQL source with only minor changes.

Implementation

Implementation encompasses agents, tools, and the application platform.

Agents

Agents are implemented following the ReAct framework, as mentioned earlier. With its built-in knowledge of business and data analytics and with an initial prompt, the agent plans how to solve an input question. For non-trivial problems, agents might require multiple intermediate analysis steps leading to unanticipated yet advanced outcomes. Observations gained during these intermediate steps may change the original plan.

The figure below illustrates this approach.

Prompts design

An agent prompt follows this structure:

Role definition
Task instruction
Tools and their use
Few-shot examples

Note that for this level of complexity, few-shot examples are needed to help ChatGPT understand details that are difficult to convey with only instructions. Code in the few-shots highlight important logic that we want ChatGPT to memorize with unnecessary specifics omitted for brevity and generalizability.

Here is a prompt template for use with the data scientist agent:

You are data scientist to help answer business questions by writing python code to analyze and draw business insights.
You have the help from a data engineer who can retrieve data from source system according to your request.
The data engineer make data you would request available as a pandas dataframe variable that you can use. 
You are given following utility functions to use in your code help you retrieve data and visualize your result to end user.
    1. Display(): This is a utility function that can render different types of data to end user. 
        - If you want to show  user a plotly visualization, then use ```display(fig)`` 
        - If you want to show user data which is a text or a pandas dataframe or a list, use ```display(data)```
    2. Print(): use print() if you need to observe data for yourself. 
Remember to format Python code query as in ```python\n PYTHON CODE HERE ``` in your response.
Only use display() to visualize or print result to user. Only use plotly for visualization.
Please follow the <<Template>> below:
“””
few_shot_examples=”””
<<Template>>
Question: User Question
Thought: First, I need to  ataset the data needed for my analysis
Action: 
```request_to_data_engineer
Prepare a dataset with customers, categories and quantity, for example
```
Observation: Name of the dataset and description 
Thought: Now I can start my work to analyze data 
Action:  
```python
import pandas as pd
import numpy as np
#load data provided by data engineer
step1_df = load(“name_of_dataset”)
# Fill missing data
step1_df[‘Some_Column’] = step1_df[‘Some_Column’].replace(np.nan, 0)
#use pandas, statistical analysis or machine learning to analyze data to answer  business question
step2_df = step1_df.apply(some_transformation)
print(step2_df.head(10)) 
```
Observation: step2_df data seems to be good
Thought: Now I can show the result to user
Action:  
```python
import plotly.express as px 
fig=px.line(step2_df)
#visualize fig object to user.  
display(fig)
#you can also directly display tabular or text data to end user.
display(step2_df)
```
... (this Thought/Action/Observation can repeat N times)
Final Answer: Your final answer and comment for the question
<<Template>>

Here is a prompt template for use with the data engineer agent:

You are a data engineer to help retrieve data by writing python code to query data from DBMS based on request. 
You generally follow this process:
1. You first need to identify the list of usable tables 
2. From the question, you decide on which tables are needed to  cquire data
3. Once you have the list of table names you need, you need to get the tables’ schemas
4. Then you can formulate your SQL query
5. Check your data 
6. Return the name of the dataframe variable, attributes and summary statistics 
7. Do not write code for more than 1 thought step. Do it one at a time.

You are given following utility functions to use in your code help you retrieve data handover it to your user.
    1. Get_table_names(): a python function to return the list of usable tables. From this list, you need to determine which tables you are going to use.
    2. Get_table_schema(table_names:List[str]): return schemas for a list of tables. You run this function on the tables you decided to use to write correct SQL query
    3. Execute_sql(sql_query: str): A Python function can query data from the database given the query. 
        - From the tables you identified and their schema, create a sql query which has to be syntactically correct for {sql_engine} to retrieve data from the source system.
        - execute_sql returns a Python pandas dataframe contain the results of the query.
    4. Print(): use print() if you need to observe data for yourself. 
    5. Save(“name”, data): to persist dataset for later use
Here Is a s“ecif”c <<Template>> to follow:“”””

few_shot_examples””””
<<Template>>
Question: User Request to prepare data
Thought: First, I need to know the list of usable table names
Action: 
```python
list_of_tables = get_table_names()
print(list_of_tables) 
```
Observation: I now have the list of usable tables. 
Thought: I now choose some tables from the list of usable tables . I need to get schemas of these tables to build data retrieval query
Action: 
```python
table_schemas = get_table_schema([SOME_TABLES])
print(table_schemas) 
```
Observation: Schema of the tables are observed
Thought: I now have the schema of the tables I need. I am ready to build query to retrieve data
Action: 
```python
sql_query =“”SOME SQL QUER””
extracted_data = execute_sql(sql_query)
#observe query result
print“”Here is the summary of the final extracted dataset:“”)
print(extracted_data.describe())
#save the data for later use
save“”name_of_datase””, extracted_data)
```
Observation: extracted_data seems to be ready
Final Answer: Hey, data scientist, here is name of dataset, attributes and summary statistics
<<Template>>

As you can see from the templates, each agent is made aware of the presence of the other agent so that they can collaborate together in a chain. In this implementation, the data engineer agent is used to assist the data scientist agent, which in turn interacts with the user. Each agent uses a run method to execute the ReAct flow. The run method coordinates the execution of tools and interacts with the LLM (ChatGPT in this example) in a loop. The run method stops when the agent finds a final answer to the question or request. Please see my github repo for more details.

Tools

To prevent agents from having to deal with the unnecessary complexity of interacting with external systems and applications, some utility functions and tools are provided such as display, persist, load, and so on. In this way, the agent needs to know only how to use the tools in the prompt.

### Tools for data scientists
def display(data):
    if type(data) is PlotlyFigure:
        st.plotly_chart(data)
    elif type(data) is MatplotFigure:
        st.pyplot(data)
    else:
        st.write(data)
def load(name):
    return self.st.session_state[name]
def persist(name, data):
    self.st.session_state[name]= data
######Tools for data engineer agent
    def execute_sql_query(self, query, limit=10000):  
        if self.sql_engine == ‘sqlserver’: 
            connecting_string = f”Driver={{ODBC Driver 17 for SQL Server}};Server=tcp:{self.dbserver},1433;Database={self.database};Uid={self.db_user};Pwd={self.db_password}”
            params = parse.quote_plus(connecting_string)

            engine = create_engine(“mssql+pyodbc:///?odbc_connect=%s” % params)
        else:
            engine = create_engine(f’sqlite:///{self.db_path}’)  


        result = pd.read_sql_query(query, engine)
        result = result.infer_objects()
        for col in result.columns:  
            if ‘date’ in col.lower():  
                result[col] = pd.to_datetime(result[col], errors=”ignore”)  
  
        if limit is not None:  
            result = result.head(limit)  # limit to save memory  
  
        # session.close()  
        return result  
    def get_table_schema(self, table_names:List[str]):

        # Create a comma-separated string of table names for the IN operator  
        table_names_str = ‘,’.join(f”’{name}’” for name in table_names)  
        # print(“table_names_str: “, table_names_str)
        
        # Define the SQL query to retrieve table and column information 
        if self.sql_engine== ‘sqlserver’: 
            sql_query = f”””  
            SELECT C.TABLE_NAME, C.COLUMN_NAME, C.DATA_TYPE, T.TABLE_TYPE, T.TABLE_SCHEMA  
            FROM INFORMATION_SCHEMA.COLUMNS C  
            JOIN INFORMATION_SCHEMA.TABLES T ON C.TABLE_NAME = T.TABLE_NAME AND C.TABLE_SCHEMA = T.TABLE_SCHEMA  
            WHERE T.TABLE_TYPE = ‘BASE TABLE’  AND C.TABLE_NAME IN ({table_names_str})  
            “””  
        elif self.sql_engine==’sqlite’:
            sql_query = f”””    
            SELECT m.name AS TABLE_NAME, p.name AS COLUMN_NAME, p.type AS DATA_TYPE  
            FROM sqlite_master AS m  
            JOIN pragma_table_info(m.name) AS p  
            WHERE m.type = ‘table’  AND m.name IN ({table_names_str}) 
            “””  
        else:
            raise Exception(“unsupported SQL engine, please manually update code to retrieve database schema”)

        # Execute the SQL query and store the results in a DataFrame  
        df = self.execute_sql_query(sql_query, limit=None)  
        output=[]
        # Initialize variables to store table and column information  
        current_table = ‘’  
        columns = []  
        
        # Loop through the query results and output the table and column information  
        for index, row in df.iterrows():
            if self.sql_engine== ‘sqlserver’: 
                table_name = f”{row[‘TABLE_SCHEMA’]}.{row[‘TABLE_NAME’]}”  
            else:
                table_name = f”{row[‘TABLE_NAME’]}” 

            column_name = row[‘COLUMN_NAME’]  
            data_type = row[‘DATA_TYPE’]   
            if “ “ in table_name:
                table_name= f”[{table_name}]” 
            column_name = row[‘COLUMN_NAME’]  
            if “ “ in column_name:
                column_name= f”[{column_name}]” 

            # If the table name has changed, output the previous table’s information  
            if current_table != table_name and current_table != ‘’:  
                output.append(f”table: {current_table}, columns: {‘, ‘.join(columns)}”)  
                columns = []  
            
            # Add the current column information to the list of columns for the current table  
            columns.append(f”{column_name} {data_type}”)  
            
            # Update the current table name  
            current_table = table_name  
        
        # Output the last table’s information  
        output.append(f”table: {current_table}, columns: {‘, ‘.join(columns)}”)
        output = “\n “.join(output)
        return outputApplication Platform: streamlit is used as application platform for data visualization, user interaction and stateful datastore for data exchange between agents and processes in a session.

Application platform

An application platform is necessary to host agents, tools, and enable interaction with end users. Because this is a data analytics application, I am using Streamlit, as I’ve described above, as the underlying execution platform.

Best practices

Automating an analytics application is a complex undertaking. The following challenges require careful consideration:

Complexity of data source: A data source for an analytics application may have numerous data objects and tables with complex schema, which you must pass on to ChatGPT as context. If you use the entire schema, your message size may well exceed ChatGPT’s token limit. In that case, a dynamic context building technique should be used so that only the necessary context for a question is loaded as part of the message. In this implementation, this is done in three steps: 1.) Identify the tables needed for the question, 2.) Retrieve the detail schema of the identified tables, and 3.) Build a data retrieval query based on the schema and the user’s request.
Custom definition and mappings: Each domain and business scenario may have proprietary names, rules, and concepts that are not part of the public knowledge that ChatGPT was trained on. To incorporate these objects, think of them as additional context that should be passed on to ChatGPT, probably in the same dynamic manner as the data schema, as mentioned above.
Complexity of the problem: If your scenario requires complex analytical logic — for example, revenue forecasting or causal analysis — consider building a specialized prompt template just for your scenario. Although limited to a narrow domain, a specialized prompt template can provide specialized instruction for a ChatGPT agent to use in following complex logic.
Output format consistency: As a generative LLM, ChatGPT has a certain level of randomness in its output format, even with clear instruction. This should be expected and handled using a validation and retry flow.
Complexity of tools and environments: Complex APIs and interaction flow may confuse ChatGPT and consume multiple few-shot examples to train ChatGPT as a result. Try to reduce the complexity by wrapping multiple complex APIs into a simple API before exposing to ChatGPT.
Reliability: Randomness and hallucination may impact the ability of ChatGPT — or any LLM — to deliver accuracy and reliability. Users must be trained to ask clear and specific questions, while the application must be designed to have multiple validation mechanisms at intermediate processing steps. A good practice is to make intermediate output available for the user to validate, such as displaying the generated SQL query and code.

In this article, I have presented a methodology and practical reference implementation to turn ChatGPT into a powerful business analytics assistant. I hope this is helpful to you in your work. The code repository for this article is located here.

James Nguyen is on LinkedIn.

References

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models: https://arxiv.org/abs/2201.11903
ReAct: Synergizing Reasoning and Acting in Language Models: https://arxiv.org/abs/2210.03629

Automating data analytics with ChatGPT

Idea

Goal

Approach

Solution design

Process

ChatGPT agents

Agent tools

Application platform

Data sources

Implementation

Agents

Prompts design

Tools

Application platform

Best practices

References

Written by James Nguyen