Extracting insights from rich tabular datasets : fast, easily and frugally (Part 5/7)

Vincent Castaignet
6 min readAug 18, 2024

--

This article is part of a series on extracting key insights from rich tabular datasets.

The series covers :

The ultimate goal of this series is to provide a comprehensive flowchart that encompasses every step, from dataset ingestion to extracting key insights.

This article focuses on how a virtual agent (based on Large Language Models) can assist the data specialist at critical stages of the workflow.

The data specialist, while working on an analysis, needs to consult domain experts to provide knowledge on extreme values, the variables that meaningful, potential feature crosses that make sense,…But what if a virtual agent could fulfill some of these roles ?

While GPT agent specialized on data analysis may struggle with generating meaningful suggestions without precise queries, they can offer substantial value when prompted with well-defined tasks.

Let’s test this hypothesis with an example using manually ChatGPT. I took a public dataset from a meteo station in Toulouse France, to avoid selecting a too popular one for which GPT would have enormous ressources to nuture its answers from.

Open dataset from a meteo station from Toulouse France— image by the author

Describing dataset and its variables

We want to get an virtual agent to provide a description of the dataset and its variable. I copy paste the list of the variable labels and submitted this prompt to GPT:

Prompt submitted to GPT — image by the author

GPT returns the answer :

GPT answer — image by the author

I cut here GPT answer to avoid a too long image, and then the conclusion goes :

GPT answer — image by the author

Identification of target variables

We now want to know amongst the variables which can be a potential target variable. I submitted the following prompt to GPT:

Prompt submitted to GPT — image by the author

Then GPT returned the table with an explanation:

GPT answer — image by the author

Detection of anomalous values

Anomalous values in a variable typically deviate from a realistic range. GPT can suggest plausible limits and provide justifications. I submitted this prompt, including min and max for each continuous variable of the dataset:

Prompt submitted to GPT — image by the author

GPT Response:

GPT answer — image by the author

I cut here GPT answer to avoid a too long image.

And then the summary goes:

GPT correctly spoted that the first wind direction variable is not on a scale 0–360, but an unusual 1–15, and suggests to check whether his assumption is correct (it is correct but corresponds to a reduced scale 1–15 from 0–360°).

Attributing names to clusters

In segmentation tasks, naming clusters can be challenging, especially with numerous variables and clusters.

I used the following prompt:

Prompt submitted to GPT — image by the author

GPT Response.

GPT answer — image by the author

Interpretation of a PCA

To interpret the Principal Component Analysis (PCA) components, I asked GPT for explanations:

Prompt submitted to GPT — image by the author

Here is how GPT interprets the first component :

GPT answer — image by the author

At the end, GPT summarizes the 4 components :

GPT answer — image by the author

For datasets with multiple variables, GPT can definetely perform a task that is difficult for human beings.

Selecting variables

We want to know which predictors to select from the dataset to predict temperature.

Here is the prompt to GPT:

Prompt submitted to GPT — image by the author

Here is GPT answer:

GPT answer — image by the author

Identification of missing variables and sources

Identifying missing variables is critical in data analysis. I asked GPT whether any variables might be missing based on the context of the dataset:

Prompt submitted to GPT — image by the author

GPT Response:

GPT answer — image by the author

Feature crosses identification

Feature crosses can significantly enrich a dataset, particularly when involving operations like subtraction or division between variables. I used GPT to identify suitable variables and operations:

Prompt submitted to GPT — image by the author

Here is GPT answer :

GPT answer — image by the author

Integration of virtual agent assistance in the workflow

We have demonstrated that virtual agents can assist at various workflow stages, offering valuable insights. These include:

- Integratable results: results that can be directly incorporated into the workflow (e.g., feature crosses, cluster naming, and anomaly thresholds).

- Informative results: results that provide context or additional information to the data specialist (e.g., variable descriptions, identification of missing variables).

Incorporating the first type into the workflow could yield significant value. For instance, an agent might suggest a list of potential feature crosses in JSON format. The user could then select the relevant ones, triggering an application to create the features, train the model, and assess the importance of both existing and proposed feature crosses.

Libraries for Automating LLM Calls

In this experiment, I manually interacted with GPT. However, there are several approaches to automate these interactions, including using Python scripts directly or through various frameworks.

Conclusion

Virtual agents can contribute domain-specific expertise at critical points in the data analysis workflow, such as anomaly detection, variable selection, feature cross-suggestion, and cluster naming. While the prompts used in this experiment were relatively basic, they can be refined by more precisely defining the agent’s role, providing detailed dataset characteristics, and structuring the prompts into sequential steps.

My goal is to share the flowchart and Python scripts developed during this project, ultimately through an application that automates these tasks and provides downloadable Jupyter notebooks at the end of the process. The initial version of the flowchart will be simple but will evolve in sophistication over time.

--

--

Vincent Castaignet

I’m Vincent, a data analyst/scientist who wants to share how Python libraries help extract insights from tabular datasets easly, fast, and frugally.