Automating Data Analytics with Large Language Models

Felixbastian Hans
8 min readOct 11, 2023

--

From Data Exploration to Documentation: ChatGPT as an End-to-End Solution in Data Analytics

Introduction

Large Language Models (LLMs) like chatGPT have emerged as powerful tools capable of understanding human language and even generating code. As a data analyst, I interact with chatGPT on a daily basis to structure an idea, to write SQL queries and to better understand existing queries. An idea emerged to put all those features together in one solution capable of automating certain aspects of my work. I identified three mission critical features to implement:

  • Identify: Support the analyst in finding the right data to do the job
  • Implement: Turn natural language into a database query (SQL)
  • Document: Document the results and the approach

Let’s delve into this idea with a hands-on example.

If you want to try out the solution yourself, access the link below:

The Approach

The objective of this prototype is to create an end-to-end solution that allows the user to pose a task or objective and to receive back the answer in the form of data.

To achieve this, the LLM, in our case GPT-4, will be given a data model. The data model consists of all table names including their columns, data types and relationships with other tables. Furthermore, three exemplary rows will be provided for each table to give the LLM an understanding of the format.

Now, when a user provides a prompt such as “What are the sales for 2022”, chatGPT will receive the data model and the question as context together with the instructions to formulate an SQL query as an answer. This can be easily done with Langchain, a common python library for advanced LLM solution architecture design.

SQL is a standard database language to merge, filter and transform tabular data. It is the most fundamental tool of every data analyst and allows to turn code into tabular data. With this setup, we can now go one step further and turn a problem statement into data.

The Data

For our exploration, we’ll be utilizing the Northwind database, a freely available resource designed to demonstrate SQL queries and challenges. This database mirrors a generic company’s data landscape, encapsulating elements such as employees, suppliers, customers, and orders.

All tables are connected via keys. For example, each order has a unique Order ID that matches the “Orders” table to the “Order Details” table that itself is connected to the “Products” table via the Product ID. This shows that relationships can become quite complex. For example, to receive the products ordered by a specific customer, all those tables (Products, Order Details, Orders) must be chained together on their respective IDs.

Northwind database schema

Testing the Basics

A quintessential step in any data analyst job is the ability to transform questions about data in basic SQL prompts. To assess chatGPT’s capabilities, a repository of 40 questions was used, designed to test SQL fundamentals on the Northwind dataset. These questions span a variety of standard problem types, including table filtering, column concatenation, and table joining, with escalating levels of difficulty.

Source of questions: Northwind Database GitHub Repository

Example 1:

Create a report showing ProductName, UnitPrice, QuantityPerUnit of products that are out of stock.

SELECT ProductName, UnitPrice, QuantityPerUnit FROM northwind.products 
WHERE "UnitsInStock" = 0;

Example 2:

Create a report that the CompanyName and total number of orders by customer renamed as number of orders since Decemeber 31, 1994.

SELECT c.CompanyName, count(c.CustomerID) AS NumberofOrders 
FROM northwind.customers c
JOIN northwind.orders o ON o.CustomerID = c.CustomerID
WHERE o.OrderDate >= '1994–12–31'
GROUP BY c.CustomerID having count(c.CustomerID) > 10;

Remarkably, chatGPT managed to accurately solve 39 out of the 40 questions on the first attempt, showcasing its proficiency in handling basic data querying tasks. Now, let’s step up the game and transition to a real-world scenario.

Presenting the Solution

Technology is only as good as its user interface. To facilitate seamless data analysis and query generation, the following components are an essential part of all common data platforms:

  1. A data panel for viewing individual tables.
  2. A text field for entering SQL queries.
  3. An output section to view the result.

To maximize the LLM’s potential, three additional features were introduced:

  1. A text field to prompt the data in natural language.
  2. A display box with two features:
  • Showing the assumptions made by the LLM in generating the results (Send button at 4)
  • Returning the documentation for an SQL query and its generated table (Send button at 2).

3. A text field for inputting table specific details the LLM should consider when coming up with a solution.

For simplicity, the solution does not contain chat features.

Solution architecture

A Practical Example: Analyzing Shipment Companies

Let’s delve into a practical use case. In the dataset, each order is delivered by one of three shipment companies: Speedy Express, United Package, or Federal Shipping. Now imagine, it is your first day on the job as a data analyst at the Northwind company. Your supervisor gives you the following task:

“I want to benchmark our shipment companies. Can you tell me if one company delivers faster than the other?”

Exploring the Data

Being new to the job, identifying the relevant tables and their interconnections must be our initial step. We begin by articulating the objective in the prompt box.

Data prompting — exploration

Through this process, the model:

  • Recognizes the relevance of the orders table for analyzing delivery times.
  • Identifies the necessary relationships between the column “ShipVia” and “ShipperID” to link the Orders and Shippers tables.
  • Proposes a method to calculate average delivery time by comparing “OrderDate” and “ShippedDate” and documented assumptions.

Before proceeding, let’s look at the Order table to validate this approach.

Northwind database — Order Table

While the logic of calculating the delivery time seems sound, we notice a potential bias. The average delivery time could be influenced due to varying shipment destinations.

Refining the Approach

The first step helped us to explore the necessary tables and columns to answer the question. Now, to gain a comprehensive view of the shipment companies, we need to profile the average delivery time by region.

Since we now know our objective and the data, we can be a little more specific when writing our prompt with a clear picture of the output in mind.

Data prompting -refining

ChatGPT was able to come up with the correct approach. This can sometimes take a little prompt engineering but is often much quicker than coming up with the correct SQL query yourself.

The analysis reveals that Federal Shipping generally outperforms its competitors, especially in key markets like Western Europe and North America. However, other companies excel in certain regions.

Documenting the Result

In a data-driven company, documenting the approach is often as important as coming up with the correct result. We need to make sure that, in case the approach is picked up again by other people (or digital agents), the user can quickly understand the overall objective and columns.

In the next step, chatGPT will receive back the query it generated in the last section. Additionally, the description of the tables used will be added including their columns and some sample rows. In our case, the order table. Then, chatGPT will be instructed to define the objective of the generated table and to detail how each column is created.

Documentation generation

The result speaks for itself. ChatGPT was able to correctly identify the objective of the analysis and to document each column and its sources. In case another user picks up the table again, the scope would quickly be understood.

The Future of Data Analytics

The Pros and Cons

In the grand scheme of technological advancements, the integration of LLMs like chatGPT in the data analysis domain marks a significant milestone and will shape the future application landscape. As we have seen in this exploration, these models have the potential to revolutionize the way data analysts approach their tasks, offering a new way to communicate with technology like never before.

Databases in practice may encompass hundreds of unconnected tables. It is often difficult to have an overview of the entire data landscape and to identify the optimal tables for the task at hand. This is something LLMs have great potential to excel in. Feeding the model table information and available documentation enables users to quickly get an expert opinion on how to approach the problem and what data to use. Furthermore, having an SQL native tool to create and document database queries can vastly improve efficiency and data understanding throughout the organization.

However, there are also challenges such as the restricted prompt length, limiting the inclusion of extensive data models. It is not possible to add an infinite amount of tables to the context of the LLM. Therefore, we encounter scalability issues with this approach. Additionally, the quality of results can vary, necessitating specific and well-thought-out prompts to avoid lazy or inaccurate responses. We have seen that in order to calculate the average shipment time, chatGPT takes a shortcut and does not consider the regions/countries. This shows the necessity to understand the data and prompts yourself and the need to provide clear instructions.

The Challenges of Integration

Overcoming the challenges of the technology requires a substantial engineering effort. However, there are methods to counteract. For example, the maximum context length could be circumvented by creating individual agents per dataset, the shortcuts and bias in answering could be avoided by chain of thought prompting or back-and-forth interaction with the user. This all needs to be rigorously tested and tried on a large scale. It requires total integration with the data, interface, and documentation platform.

This goes far beyond the core competencies and focus of most companies. Therefore, these tasks lay in the hands of the data analysis solution provider such as Google Bigquery or Snowflake or entirely new market entrants. Although no major solution provider has yet made its move, I am certain that most of those companies are currently working on solutions. However, the responsibility does not solely lay on the shoulders of the solution provider.

Taking Action

The results of any agent can only be as good as the company’s understanding and documentation of the data. An agent can only come up with the correct answer if the data foundation is well documented and the connection between the individual tables, dataset collections and domains is established. In practice, a lot of best practices about the usage of company internal data still gets passed in between domain experts or other data analysts. In order to get a company ready to successfully integrate generative solutions in their application landscape, companies must set up processes that result in thoroughly digitized documentation of datasets and domain concepts.

Generative AI has great potential in automating and accelerating many tedious processes throughout the organization. It might be the missing piece to fully bridge the gap between humans and technology. However, to fully tap its potential, companies need to take action and establish an internal understanding of the technology’s potentials and limitations.

--

--