Extracting insights from rich tabular datasets : fast, easily and frugally (Part 5/7)
This article is part of a series on extracting key insights from rich tabular datasets.
The series covers :
- The challenge (Part 1)
- key insights for rich tabular datasets (Part 2)
- dataset preparation (Part 3)
- variable treatment : selection, enrichment, and feature engineering (Part 4)
- leveraging LLMs at critical workflow points (Part 5)
- the structure of the flowchart, its rules, and decision conditions (Part 6)
- findings from testing a broad-spectrum of datasets (Part 7)
The ultimate goal of this series is to provide a comprehensive flowchart that encompasses every step, from dataset ingestion to extracting key insights.
This article focuses on how a virtual agent (based on Large Language Models) can assist the data specialist at critical stages of the workflow.
The data specialist, while working on an analysis, needs to consult domain experts to provide knowledge on extreme values, the variables that meaningful, potential feature crosses that make sense,…But what if a virtual agent could fulfill some of these roles ?
While GPT agent specialized on data analysis may struggle with generating meaningful suggestions without precise queries, they can offer substantial value when prompted with well-defined tasks.
Let’s test this hypothesis with an example using manually ChatGPT. I took a public dataset from a meteo station in Toulouse France, to avoid selecting a too popular one for which GPT would have enormous ressources to nuture its answers from.
Describing dataset and its variables
We want to get an virtual agent to provide a description of the dataset and its variable. I copy paste the list of the variable labels and submitted this prompt to GPT:
GPT returns the answer :
I cut here GPT answer to avoid a too long image, and then the conclusion goes :
Identification of target variables
We now want to know amongst the variables which can be a potential target variable. I submitted the following prompt to GPT:
Then GPT returned the table with an explanation:
Detection of anomalous values
Anomalous values in a variable typically deviate from a realistic range. GPT can suggest plausible limits and provide justifications. I submitted this prompt, including min and max for each continuous variable of the dataset:
GPT Response:
I cut here GPT answer to avoid a too long image.
And then the summary goes:
GPT correctly spoted that the first wind direction variable is not on a scale 0–360, but an unusual 1–15, and suggests to check whether his assumption is correct (it is correct but corresponds to a reduced scale 1–15 from 0–360°).
Attributing names to clusters
In segmentation tasks, naming clusters can be challenging, especially with numerous variables and clusters.
I used the following prompt:
GPT Response.
Interpretation of a PCA
To interpret the Principal Component Analysis (PCA) components, I asked GPT for explanations:
Here is how GPT interprets the first component :
At the end, GPT summarizes the 4 components :
For datasets with multiple variables, GPT can definetely perform a task that is difficult for human beings.
Selecting variables
We want to know which predictors to select from the dataset to predict temperature.
Here is the prompt to GPT:
Here is GPT answer:
Identification of missing variables and sources
Identifying missing variables is critical in data analysis. I asked GPT whether any variables might be missing based on the context of the dataset:
GPT Response:
Feature crosses identification
Feature crosses can significantly enrich a dataset, particularly when involving operations like subtraction or division between variables. I used GPT to identify suitable variables and operations:
Here is GPT answer :
Integration of virtual agent assistance in the workflow
We have demonstrated that virtual agents can assist at various workflow stages, offering valuable insights. These include:
- Integratable results: results that can be directly incorporated into the workflow (e.g., feature crosses, cluster naming, and anomaly thresholds).
- Informative results: results that provide context or additional information to the data specialist (e.g., variable descriptions, identification of missing variables).
Incorporating the first type into the workflow could yield significant value. For instance, an agent might suggest a list of potential feature crosses in JSON format. The user could then select the relevant ones, triggering an application to create the features, train the model, and assess the importance of both existing and proposed feature crosses.
Libraries for Automating LLM Calls
In this experiment, I manually interacted with GPT. However, there are several approaches to automate these interactions, including using Python scripts directly or through various frameworks.
Conclusion
Virtual agents can contribute domain-specific expertise at critical points in the data analysis workflow, such as anomaly detection, variable selection, feature cross-suggestion, and cluster naming. While the prompts used in this experiment were relatively basic, they can be refined by more precisely defining the agent’s role, providing detailed dataset characteristics, and structuring the prompts into sequential steps.
My goal is to share the flowchart and Python scripts developed during this project, ultimately through an application that automates these tasks and provides downloadable Jupyter notebooks at the end of the process. The initial version of the flowchart will be simple but will evolve in sophistication over time.