Aesthetic Data Visualization

Brett Schuessler
3 min readApr 28, 2022

--

a selection of works (media: python + seaborn)

MNIST data (a bunch of handwritten integers 0–9)

MNIST bottlenecked through an autoencoder, the visualizations are a heatmap of pairwise distances between the data points in the encoding space. Blue is close, black is in the middle, red is far. Note the distinct blue squares along the diagonal — these are the pairwise distances between data that are the same digit. One can clearly see how many distinct classes there are (digits 0–9).

Note how this embedding places the ‘1’ digits very close together, and how ‘6’ is quite far from ‘7’, ‘5’, and ‘3’ — while ‘5’ is far from ‘4’ as well as ‘6’.
Note how this particular embedding has placed ‘1’ very furthest away from ‘7’, and quite far from ‘4’ and ‘9’.

NBA Stats

The following data comes from NBA stats. The original data is 13 dimensional vectors corresponding to 13 chosen stats per player per game. These are t-SNE projections of said data, where the coloring is based on k-nearest neighbors.

This demonstrates the stochastic nature of the t-SNE algorithm run with identical parameters multiple times.
This is generated from the same data except with 7 clusters in KNN (experimentally the optimal convergence value). The interpretation of this is that there is a solid argument to be made for there being 7 distinct types of play styles in the NBA.
This is also over 7 clusters but with a different set of parameters for convergence. In both this and the last figure, the ‘stringiness’ observed has yet to find an explanation but as it is an emergent property of t-SNE, it is suspected this could be a way to reverse engineer a way to determine the difference between a ‘good embedding’ vs ‘poor embedding’ for the purposes of visualization.

This graph is generated using the average values for each player, i.e. each point is a player.

This is a valuable example because it’s clear that t-SNE wasn’t able to effectively deform the inherent manifold if you take the clustering seriously. Alternately, the t-SNE could be fine but the clustering was bad. I would lean towards the former because clusters have been split into smaller clusters, but that’s more intuition than it is fact. It is an interesting example of the opacity faced in understanding what visualizations mean.

This is a combination heatmap/hierarchical clustering diagram for relative defensive strength between NBA teams for the 2019 season. The heatmap is based on the clustering distance.

It’s worth noting this is probably an absolutely terrible clustering convergence. You should always be grounded in reality and there is no qualitative reason Washington was that different from other teams. Pretty graphs can have little to no value.

This is a pair plot between the point distributions over selected players from LAC and DEN over the 2019 season. The (x,y) of each dot are the oversampled points scored by (LAC_i, DEN_j) against each other using a multi-dimensional kernel density estimator. For example, a single scatterplot could be the oversampled distribution between Leonard and Murray. The colors are clustering results — with the interpretation that each cluster represents a particular style of play with respect to the player in question. The derivation is complex but this was ultimately extremely helpful.

Note the spread in the distributions — showing clear relationships between player matchups. Also this is a necessary subset of a much larger plot, hence the lack of axis labels.

Market Data

The following is a graph of treasury yield curves over time (time is represented by color). This was a study for the pandemic fueled rapid market change from Jan 2020 through the end of April 2020. Yellow is April, Purple is January.

Note the inversions

This was a study examining the discrepancy between Open-to-Open quotes vs. Open-to-Close quotes for 5min data (for SPX) over a 30 day period. This is represented by the two frequency vs. pct change distributions overlaid upon each other.

Admittedly this the usefulness of this figure (high density around 0 pct change, noticeable sideways shift especially for larger pct change) could be wrapped up more simply, but the emergent patterns are truly beautiful.

This is an example of a pairwise difference heatmap for possible buy/sell points at 1min intervals for TSLA around Nov 2021. This was a necessary input for my optimal buy/sell algorithm (finding optimal buy and sell points given a specified number of transactions), perhaps the most practically useful example of the bunch.

I like this because it’s a sort of footprint of how the stock behaved on a certain day, that can be compared to other “footprints” like apples to apples.

--

--