Extracting insights from rich tabular datasets: fast, easily and frugally (Part 1/7)

Vincent Castaignet
6 min readJun 11, 2024

--

The challenge

Rich tabular datasets encompass a variety of variables (quantitative, categorical, date/time, geographical, text, identifiers…). These datasets typically originate from sources such as surveys with numerous questions, exports from web analytics tools, IoT applications, open data sources, or process optimization analyses.

These datasets emerge from various contexts: an analyst studying a specific issue (HR, process optimization), extraction from an analytics tool (web analytics), preliminary datasets in the EDA (exploratory data analysis) phase of a long-term modelling project, or open datasets explored by enthusiasts. The analyst might be well-acquainted with the dataset, or discovering it for the first time, regardless of his expertise level. However, the tasks to perform remain by and large the same.

I make the strong hypothesis that a flowchart, using specific characteristics of the dataset and its variables (number of observations, dimensionality, skewness,…), with the proper sequences, can generate the key insights in the most effective manner in most cases, and that a pragmatic approach will provide the flowchart including the key metrics for its rules and decision conditions.

To address these questions, I am testing 50 datasets representative of rich tabular datasets: varying in size (30 to 1M observations), number of variables (4 to 150), and types of predictive and target variables, including various types of distribution and unbalanced classes.

Over the next four months, I will explore the following topics in a series of articles:

In the final article, I will share findings from the datasets corpus on each topic presented in the series, and assess the quality, statistical reliability and stability of the insights. Datasets are selected to represent a wide spectrum of rich tabular datasets: varying in size (30 to 1M observations), number of variables (4 to 150), and types of predictive and target variables, including various types of distribution and unbalanced classes.

Because the whole flowchart holds too many nodes to be displayable, I will address how to return explicit flowchart decision conditions to data specialists:

  • as a graph description list (most graph libraries like networkx use the same format for the generation of the graph), and
  • as Jupyter notebooks covering sections of the workflow, shared via GitHub.
Simplified flowchart — Image by author

Interpreting the Flowchart

  • The flowchart involves a dataset, scripts, rules (part of the scripts), a data specialist, and the assistance of a virtual agent.
  • The flowchart breaks down the workflow in a series of tasks executed sequentially from top to bottom.
  • The scripts handles tasks that they can perform autonomously, according to rules.
  • The data specialist interacts at crucial stages of the workflow: to select between propositions (which is the target variable? which is the objective of the analysis? are the suspectly anomalous values just outliers ?), or to get informed (the insights).
  • For certain tasks the scripts rely on rules on the characteristics of the datatset and its variables to make decisions.
  • The data specialist can get expertise knowledge from a virtual agent on crucial issues like the description of the dataset variables, anomalous values, selection of variables, and much more.

This exercise serves two purposes: to be integrated into an application I am developing and to widely share the results from the tests and the future application.

Who is this article for?

Data specialists (consultants, analysts, scientists, engineers) who are interested in streamlining the EDA phase of their projects.

A complex process

Extracting key insights necessitates covering the entire data process: data preparation, variable selection, feature engineering, modelling, and explainability. This extends beyond the scope of the mere EDA.

Open-source libraries like Python’s Scikit-learn, Pandas, Numpy, Matplotlib, Seaborn, ALE, SHAP,… are invaluable resources. However, integrating them requires time: understanding their use cases, selecting the best library for specific tasks, choosing appropriate parameters, and keeping them updated is not straightforward.

To reduce environmental impact, processing time, and cost, it is becoming increasingly crucial to operate a frugal process : for instance sampling the dataset to a sufficient level for extracting insights, by reducing the dataset size from 200K observations to 10K. Are there thresholds of r² or AUC ROC below which the hierarchy of feature importance becomes skewed?

Running a random forest with scikit-learn and default parameters enables in many contexts to quickly ascertain feature importance and the marginal impact of predictors on the target variable, and likewise performing Shap library for explaining predictors influence on the target variable for specific observations. But what if the performance of the random forest is subpar and the initial sample size has been significantly reduced?

Optimizing the workflow necessitates answering several questions:

  • Which insights you need for the context of the dataset?
  • Which tasks should be carried out?
  • In what sequence?
  • How to resolve chicken-and-egg dependencies between tasks?
  • Which performance metric to use for the model generating the insights?
  • What levels of sampling and model performance ensure reliable insights?
  • Should a standard process be applied for quick answers or a sophisticated one for optimized results?

Data specialists are iterating several times various scenarios with little visibility on the potential improvements before getting satisfactory results. The whole process can become chaotic and lengthy.

But by mutualizing the resources (performing a watch on the best libraries, applying them to datasets, and sharing the optimal flowchart and its decision conditions), the process can be significantly streamlined.

A definitive potential for organizing the workflow

The number of potential insights sought is fairly small:

  • The features that differentiate groups of observations,
  • The most influential variables and features,
  • The marginal impact of these variables on a target variable (sensitivity and thresholds or inflection points),
  • The characteristics associated with specific values of the target variable (top 20%, bottom 20%, “positive”…).

Most key insights do not require the highest precision, unlike in a modelling project. For identifying important variables/features, the primary need is to establish a hierarchy among them, even if it is just identifying the group of the most influential ones. When evaluating the marginal impact of a predictor on a target, you may only need to determine whether the relationship is linear, and if not, identify any thresholds/inflection points. When examining the combination of features associated with a specific target variable value, you may just need to know whether the most correlated quantiles between two variables maintain a monotone order (5th with the 5th) without needing the precise interval bounds.

To optimize the workflow there are key tasks to perform as early as possible:

  • describing precisely the dataset by extracting key metrics : numbers of observations and variables, dimensionality, % of values missing, but also metrics like the number of effective combinations provided by the dataset.
  • characterizing precisely each variable : distribution metrics, % extreme values, for ordinal variables which ones should be considered discrete (which will bring benefits for modelling and the appropriate treatment of missing and extreme values).
  • defining the objective (segmentation, explanation of a target variable, or cleaning/feature engineering,…), and the target variable if the objective requires one.
  • getting the performance of a versatile model (r² or AUC ROC depending on the type of target variable of a random forest applied by default with a single split train/test) for estiming the level of work required on modelling.

Virtual agents specialized on data analysis, like GPT data analyst, often keep iterating through multiple options without delivering a final result, because they lack a spine to drive their workflow. But the virtual agent, even generalist, if asked precise questions, can provide very valuable expert knowledge and suggests options. The data specialist would very likely benefit from those input from the virtual agent, while keeping the final say and makes the decisions.

Most data specialists maintain a wide spectrum of notebooks, boilerplate scripts, more or less well referenced, to be in a position to rapidly apply standardized scripts to address specific problems.

The goal here is to provide a standard template dedicated to key insights extraction, relying on the most relevant models and libraries. Dataset and variables characteristics will help predict whether the insights are statistically reliable.

In its first iteration, the flowchart will be simple but will grow more sophisticated over time.

Who I am and why the challenge ?

I’ve been working as an entrepreneur on data applications and data analysis for 3 decades, and I believe that making the relevant decision conditions of a flowchart explicit and sharable would bring substantial benefits to all.

I invite you to subscribe to my page if you are interested in these topics.

--

--

Vincent Castaignet

I’m Vincent, a data analyst/scientist who wants to share how Python libraries help extract insights from tabular datasets easly, fast, and frugally.