Extracting insights from rich tabular datasets: fast, reliably and frugally (Part 2/7)

11 min readAug 14, 2024

Key insights for rich tabular datasets

This article is part of a series on extracting key insights from rich tabular datasets.

Our core premise asserts that describing precisely the dataset and its variables (number of observations, dimensionality, variables types, distinct values and distribution,…), and sequencing tasks in the right order, allow to streamline significantly the insight generation process and to provide a flowchart to data practioners.

The series covers :

The challenge (Part 1)
Key insights for rich tabular datasets (Part 2)
Dataset cleaning (Part 3)
Feature engineering (Part 4)
Leveraging LLMs at critical workflow points (Part 5)
The flowchart, its rules, and decision conditions (Part 6)
Findings from testing a broad-spectrum of datasets (Part 7)

The ultimate goal of this series is to provide a comprehensive flowchart that encompasses every steps, from dataset ingestion to extracting key insights. This article specifically focuses on identifying the key insights to provide.

Types of Datasets and Insights

In datasets brimming with a variety of variables (quantitative, categorical, date/time, geo, text), you typically encounter:

Surveys with numerous questions/answers and variables that describe respondents.
Exports from web, media, ad campaigns, or IoT analytics, with multiple event records, often timestamped.
Datasets from HR surveys with satisfaction, churn, absenteeism/sickness metrics.
Datasets from process optimization studies for plants, infrastructure, equipment,...

Let set aside the usual BI insights like trends (KPIs changes between periods, versus budget, contrasts between dimensions like regions, products, departments,…), which involves mainly continuous variables one by one versus time or dimensions. We are interested here by getting a comprehensive picture on rich tabular datasets, especially their multivariate dimensions, and further granular EDA (Exploratory Data Analysis) insights.

The set of potential insights extracted from tabular dataset is fairly small:

characteristics that differentiate groups of observations,
If a target variable exists, the variables that wield significance, the marginal impact of these variables on the target (sensitivity and thresholds or inflection points), and the characteristics associated with specific values of the target variable (e.g. top 20%, bottom 20% if the target is continuous, “positive” if the target discrete or binary…).

To illustrate those key insights, I have drawn upon a famous dataset from Kaggle, Adult income.

The following figures provides the key insights that can be proposed for a fast analysis:

A global view of the main features correlations

Dendrogram of main correlations on dataset “Adult income”- Image by author

The dendrogram provides a comprenhensive picture of the main correlations between all variables, mixing categorical modalities and continuous variables.

How to interpret the graph (a dendrogram) ?

Colours signify features that goes together.
The further left the linkage between features or the closer they are vertically, the stronger the correlation.
All correlations are positive.
If a categorical variable feature (e.g., “occupation = Exec-managerial”) correlates with a continuous variable (e.g., “hours-per-week”), it indicates that “Exec-managerial” is associated with high “hours-per-week” values.

How it’s constructed?

Categorical variables are converted to dummy variables.
Continuous variables are standardized.
The loadings on the first four components of a PCA are summed for each feature. If more than 35 features (the max easily displayed/readable), the top features are selected.
Correlations between the features are computed.
A dendrogram is built from the correlation matrix.

An Association Matrix

Association matrix on “Adult income” - Image by author

The association matrix provides a decisive advantage: the correlations are computed between all types of variables, continuous or discrete, in a condensed format. If categorical variables were dummified, the matrix would have to cross 104 dummies and 6 continuous variables, thus a 110 x 110 matrix.

How to interpret the matrix?

The target variable, income, is crossed with the various type of variables
The correlations are computed between all types of variables, continuous or discrete. For instance, the correlation between education, a categorical variable with 15 distinct values (“Master”, “Doctorate”,…), and education_num (from 1 to 16 years), a quantitative variable, stands out with almost 100%.

How it’s constructed:

An association matrix handles the 3 variable type couples :

continuous/continuous : measured with Pearson correlation (ranging -1 to +1)
discrete/discrete : measured with Cramer ‘s V (ranging 0 to 1)
continuous/discrete : Eta correlation ratio (ranging 0 to 1)

A Condensed Segmentation

Segmenting by Kmodes, 5 clusters, on dataset “Adult income” — Image by author

This segmentation, using the Kmodes model, provides the major category of each variable for each cluster. As opposed to Kmeans, the table is condensed (dummy coded categorical variables would add multiple rows) and continuous variables are more interpretable (intervals, more relevant for instance for personas identification in marketing, instead of means).

How to interpret the table ?

Clusters/groups are presented in columns (5 by default), the variables in rows.
Quantile labels: For variable “age” in C1, “Q1/5Q” means the first quantile of the discretization into 5 quantiles, with (17.0, 26.0] representing the interval limits of the quantile.
Features are sorted by decreasing order of variable discrimination power between clusters: the most discriminating variables appear at the top.

How it’s constructed?

Continuous variables are discretized (5 by default, to allow for a scale like : very low, low, medium, high, very high)
The observations are clustered using Kmodes. A major advantage is its ability to present a condensed overview (one row per variable for both continuous and discrete variables).
The column on the right, “distinct count,” shows the number of distinct categories for each cluster. Features are sorted by decreasing distinct count to display first variables with the highest discrimination power.

The granularity of clusters can be adjusted by varying the number of clusters and the level of discretization of continuous variables. Increasing the number of clusters from 5 to 10, for instance, yields a more granular picture.

Segmenting by Kmodes, 10 clusters, on dataset “Adult income” — Image by author

Exhaustive Bivariate Correlations between Features

Bivariate correlations between features on dataset “Adult income” — Image by author

How to interpret the table?

The strength of the correlation between modality A of variable 1 (e.g., “education = 5th-6th”) and modality B of variable 2 (e.g., “native-country = Mexico”) is measured in lift (how far the correlation is from independence). A lift of 22.51 means that the number of observations is 22.51 times higher than if the two variables were independent (lift = 1).
n_ij represents the number of observations at the intersection of the two features.

How it’s constructed:

Continuous variables are discretized using an equal frequency method.
All possible feature pairs are computed.
The lift from the contingency table crossing for all pairs is calculated.
Relevant correlations are filtered (p-value of the Chi² test > 0.05, lift < 1.2, classes with less than 10 counts).
Lift values are ranked in descending order.

This method nicely complements the two previous graphs: the dendrogram offered a comprehensive hierarchical view of bivariate relationships between features, while the cluster profiles provided a broader view of how variables combine. The lift table, in turn, provides a exhaustive list of bivariate relationships.

Identifying Features That Influence the Target Variable the Most

Feature importance on dataset “Adult income” — Image by author

How to interpret the table:

The 3 most infuential features on income are capital-gain, age, hours-per-week, with respectively 4%, 1% and 1% contribution.
The percentage represents the normalized importance of each feature in contributing to the model’s predictions.
As the target variable selected is “income”, a binary variable with 2 values (>50K or < 50K), the model predictions are the probability to get one of the 2 values, in this case (>50K).
Features are listed in decreasing order of importance.
A negative value means that the feature is possibly redundant, is adding noise or is in complex interactions with other features.

How it’s constructed:

A random forest model is run with default hyperparameters, and feature importances are extracted with the permutation importance library of scikit-learn from the random forest model.

Marginal Impact of the Most Influential Features on the Target

The graphics below display the marginal impact of the 3 most infuential features (capital-gain, age, hours-per-week) on the target variable (income). The first one relates capital-gain to income:

capital-gain/income on dataset “Adult income” — Image by author

How to interpret the graphic?

There is a sharp increase from 0 to 5K for capital-gain for predicting income > 50K, followed by a slow progression for values >5K.
Capital gain accounts for 4% point of the model predictions (according to the feature importance graphic).

The plots of the marginal effect on the target variable for the next two most influential features are shown below:

ALE for age/income on dataset “Adult income” — Image by author

How to interpret the graphic?

it represents the marginal impact of age (horizontal axis, ranging from 18 to 90 years old) on the target variable (income >50K), which spans from -0.12 to +0.10. The impact is flat from 18 to 24 (-0.12), then progresses linearly until 42, then is flat until 58 (0.09), decreases until 64, and finally remains flat afterward (+0.04).
Confidence interval (at 95%) is displayed on the graph, helping avoid overinterpreting the shape of the central line.
As seen earlier, « age », has the second highest contribution to the model’s prediction of whether someone’s income is >50K, and accounts for 1% of the total contribution.

How it’s constructed:

The graphic is built with the ALE (Accumulated Local Effects) algorithm. See here for explanations on ALE.

Hours-per-week vs. income > 50K$:

ALE for hours-per-week/income on dataset “Adult income” — Image by author

There is a sharp increase of income probability to be >50K from 30 to 60 hours-per-week, after which it levels off.
hours-per-week accounts for 1% of the model predictions (according to the feature importance graphic).

To sum up, ALE plots provide the global slope of the marginal impact (stable, increase, or decrease) of features on the target variable, and if the relationship is nonlinear, it allows the identification of thresholds and inflection points. ALEs are extracted from ensemble models like random forest, wich are by and large robust to colinearity and capture complex relations between variables, therefore allow to keep the original units of both the target variables and features.

Profiles of Groups associated with income > 50K

Profiling income>50K, 5 clusters, on dataset “Adult income” — Image by author

How to interpret the table?

Categories at the top of the table are present in all clusters associated with income >50K (distinct_count = 1): workclass = private, race = white, sex = male, capital-gain = 0. Then, features start to differentiate between clusters (hours-per-week, education, age…).

How it’s constructed?

We are considering observations associated with income > 50K (what we call generically y = q).
The selected observations are clustered with K-modes.
The clusters and their profiles are displayed in a table (the mode of each variable for each cluster), with variables with the most features in common at the top (sorted by descending order of differentiation power of feature).

Increasing the number of clusters allows for more categories to appear within clusters.

Profiling income>50K, 10 clusters, on dataset “Adult income” — Image by author

Some of the modes correspond to categories with frequencies substantially lower than 100%, which occurs when a variable has multiple categories (e.g., 14 distinct categories for “occupation”) or poorly differentiates clusters. The following table shows the frequencies for each mode for each cluster from the previous table.

Profiling income>50K, 10 clusters, mode frequencies, on dataset “Adult income” — Image by author

For instance, the first raw and the first column, crossing age and cluster G1, reads 53%: it means the age quantile, which is Q5/5Q (51–90), represents 53% of the number of observations in the cluster.

When the frequency of a mode is lower than 50%, the mode represents more of an ideal example than the dominant category. There’s a risk that recomputing K-modes might result in the mode not being associated with this cluster or not appearing in any cluster at all.

In the last article the stability of clusters will be properly gauged and correlated with other metrics (number of clusters, Silhouette index, number of observations, dimensionality…), with the objective of anticipating instability and statistical reliability, and warning early when there is such a risk.

Principles behind Selecting these Insights

Now that we have delved into concrete examples of key insights, let’s explore the guiding principles behind these choices.

When working with rich tabular datasets, several challenges arise:

the mix of various types of variables which goes much beyond the quantitative/object distinction (continuous, discrete, ordinal, frequency… and special types like dates, geographical ans texts), which require special treatments in most instances, especially for the target variable.
Too many variables to be meaningfully displayed.
Dummy-encoded discrete variables add a plethora of features and columns/raws in results.
Complex relationships beyond linear ones, notably interactions, which some models can capture but still require simplification to present.

In pursuit of insights that are easily understandable, we ought to prioritize:

Global overviews, by selecting results that can be visually grasped and displayed on a computer screen. One way consists in limiting the number of items to 35 and display them vertically (for long-label variables).
Comprementary approaches,along global views, more detailed insights for more granular EDA (e.g., ehaustive bivariate correlations between features).
Models that can handle both continuous and discrete variables: for clustering Kprototype, for correlation an association matrix.
Understandability over model precision. For instance, intervals for quantitative variables reduce information but make it more understandable. As long as those insights remain statistically reliable, they are prioritized.

Characteristics Required by the Flowchart

The rules conditions driving the flowchart requires to compute the following characteristics:

for selecting the prediction model : the type of the target variable (continuous/regression, discrete (binary or multi-class)/classification)
the distribution of the target variable (if binary, inbalance in excess of 20/80).
the performance of the models used to generate the insights (r² or AUC by cross validation, silhouette index) with the first phase of standard models.

Libraries Used to Generate the Insights

Several libraries offer straightforward solutions for generating these insights:

Modeling the relationship between target variables and predictors: Scikit-learn, where I use a random forest by default, providing a robust starting point for modeling.
Determining predictor influence on the target: PyALE for Accumulated Local Effects (ALE). This library includes confidence intervals.
Clustering observations: I use the K-modes library.
And of course, the essential four: Numpy, Pandas, MatPlotLib, and Seaborn.

Beyond this list there are several libraries that can bring tremendous value: for instance the SHAP library, which computes several insights, especially determining the predictors’ impact on the target for specific observations. These insights will be presented in a future article.

Conclusion

I’ve shared what I believe are key insights to extract from rich tabular datasets, starting with insights without a specific target variable, then with a target variable, and finally with a specific value for the target variable.

If you think I have ignored key insights for a fast analysis from your context, I would love to hear your suggestions!

You will find the full flowchart in the 6th article.

Extracting insights from rich tabular datasets: fast, reliably and frugally (Part 2/7)

Key insights for rich tabular datasets

Types of Datasets and Insights

A global view of the main features correlations

Principles behind Selecting these Insights

Characteristics Required by the Flowchart

Libraries Used to Generate the Insights

Conclusion

Written by Vincent Castaignet