Metrics for Reasoning About the Usability of Visualization Notations

Published in

Multiple Views: Visualization Research Explained

5 min readOct 17, 2023

Visualizations can be specified using a variety of programmatic notations, such as those powered by D3, ggplot2, Vega-Lite, matplotlib, Seaborn, Plotly and many other frameworks, toolkits, libraries, grammars or domain-specific languages. Evaluating and comparing the usability of such notations is a challenge faced both by their designers and their prospective users. Current practice involves labour-intensive and ad-hoc close reading of documentation and galleries of examples by experts, for example using the Cognitive Dimensions of Notations framework. In our recent paper, Metrics-Based Evaluation and Comparison of Visualization Notations, we propose a small set of easily-computable metrics which can be used to structure and inform such evaluations, and we hope will make it easier to reason about and discuss the usability of these notations.

Designers of notations often build galleries of example specification/output pairs to demonstrate expressiveness, i.e. the breadth and variety of visualizations that can be specified. Expressiveness is sometimes compared across notations by demonstrating that a pre-existing gallery for one notation can be replicated using another. As one example, the designers of Mascot.js (formerly Atlas) explicitly compare their example gallery to Vega-Lite’s in their paper. We build on this practice by defining metrics which can be computed on multi-notation galleries — galleries where each visualization is specified in each of N notations — and showing how these within-notation metrics can be usefully compared across notations.

An excerpt of a multi-notation gallery: one example specification/visualization pair expressed in two notations, here matplotlib and ggplot2. Our case study (see below) includes 40 such examples in 9 notations. Click here to see all 40 examples in these two notations side by side.

The metrics we use are:

The specification length, measured in bytes of UTF-8-encoded source text, captures aspects of notational terseness and complexity.
The vocabulary size, measured as the number of unique tokens used in the specs in a notation across gallery, captures aspects of notational economy and the number of concepts a user must mentally juggle while using the notation.
The textual distance between two specs of the same notation captures aspects of notational viscosity: how difficult it is to transform one spec into another. We measure this using the robust and generic compression distance function.
The sprawl of a given notation over that gallery is the median distance between every spec in a gallery for that notation, and is a high-level measure of viscosity.
The remoteness of a spec is the median distance between it and every other spec of the same notation in the gallery. Specs with low remoteness relative to sprawl can be considered ‘central’ to the gallery for a given notation, and conversely, ones with high remoteness relative to sprawl are likely outliers.

We hasten to note that we do not propose these quantitative metrics as part of a normative framework i.e. we make no claim that terser notations or ones with fewer tokens or less sprawl are more usable! Rather, we present them as a way to structure explorations and discussions of the complex objects we call notations. We expect that claims made about notations couched in terms of comparisons of metrics that can be traced back to concrete galleries of example can be more precise and more productively discussed than more ad-hoc, subjective reports.

Case Study

We demonstrate the use and usefulness of these metrics through a case study focused around statistical charting. We hand-built a gallery of 40 commonly-used statistical graphics of a single dataset using 9 different popular notations (see the appendix of our paper for precise definitions of these notations). We deliberately used the same dataset in the same format for all the examples and included all the required data-wrangling code in the specifications, to make cross-example and cross-notation comparisons as direct as possible.

Our 40-example gallery, covering a variety of conventional chart types (e.g. bar charts, line charts, scatter plots, and heatmaps), techniques (e.g. mapping of continuous and categorical variables to spatial or color axes, the use of grouping, stacking, binning, aggregation, small multiples, regression, and error bars) and tasks (e.g. distributions of variables and the relationships between them). The full gallery is browsable at https://app.notascope.io/

We developed an extensible open-source tool called NotaScope to develop and browse multi-notation galleries as well as compute and visualize our metrics. The case study gallery (including specifications and renderings) can be viewed within a demo instance running at https://app.notascope.io/

A video walkthrough of the NotaScope tool. A live demo is available at https://app.notascope.io/

The figures below, taken from our paper, are visualizations of the metrics we computed from our case study gallery, and serve as illustrations of what these metrics can be used for:

comparing multiple notations at a high level
comparing pairs of notations example by example
exploring the design space of a single notation

We gathered feedback on this case study, including the choice and implementation of visualizations and the usefulness and perceived value of our metrics by interviewing either the original designers of these notations or members of the core development teams for the associated projects. Our paper includes many specific quotes from these six interviews, but overall our experts broadly agreed that these metrics do capture and systematically externalize important aspects of notation usability, and should be useful for communicating about them. Our metrics-based analysis of the case study gallery, embodied in the figures below, generally aligned with their pre-existing mental models of the relationships between these notations, confirming this.

Individual metric values across many notations: per-notation distribution of specification remoteness display a clear “core and tail of outliers” pattern and vocabulary size shows a quite wide range of values.

Comparing metrics across multiple notations: to capture some of the uncertainty or bias inherent in our choice of visualizations to include in the gallery or in our implementation choices, we generated 1000 sample galleries using bootstrapping and used Kernel Density Estimation to visualize 75% probability mass contours. The overlap between some of the curves suggest that differences in metric values between these notations may be slight. The three metrics appear roughly correlated, with at least one distinct cluster of notations containing matplotlib, pandas.plot and plotly.go and a less-distinct one containing seaborn, seaborn.obj and plotly.express standing apart from vega-lite, altair and ggplot2.

Using metrics to compare pairs of notations example by example: Altair has identical semantics and output to Vega-Lite. Yet, Altair has a lower remoteness, likely caused by its ability to infer data types. Vega-Lite, ggplot2, and plotly.express have comparable remotenesses for most specs, but diverge for connected scatterplots (A, more remote in ggplot2 due to the use of dplyr), aggregated line-charts (B, more remote in plotly.express due to the use of pandas for aggregation), and regression lines (C, more remote in Vega-Lite due to higher verbosity and repetition).

Using the distance metric to visualize how a single notation (the new Seaborn Objects notation in this figure) structures the design space of addressable visualizations: Several clusters (A, B, C) are visible for in an MDS embedding and a dendrogram built via agglomerative clustering: specs within these clusters are more similar to each other than to specs outside the cluster. Specs in A contain use of Pandas for data transformation, while specs in B require a fallback to non-Objects, classic Seaborn-style notation. C is a clear outlier as neither Seaborn API supports pie charts, prompting fallback to pandas.plot notation.

This case study, though it includes hundreds of specifications, is necessarily limited in scope: there is more to visualization than (static) statistical graphics; many popular notations were not included in our study; and the notations we did include are capable of much more than we were able to include. That said, it serves as a good example of the application of our metrics-based approach to the evaluation and comparison of visualization notations.

We envision the primary users of this metrics-based approach to be notation designers looking to systematically evaluate new notations and communicate about them by comparison to existing work. These metrics could also be incorporated into tools for users of notations, for example to help them select a notation for a particular task, or to learn a new notation by comparison to one they already know. Documentation systems could use the distance metric to facilitate navigation between similar examples, for instance.

We expect that this approach can also be fruitfully applied to other visualization domains addressed by multiple notations such as diagramming, animation, interaction, and scientific visualization, as well as connected domains such as data-wrangling and modelling. If anyone reading this wants assistance in using this approach in general or our software in particular, or would like to contribute new notations (e.g. D3, AntV G2, Observable, HighCharts, base-R etc) or examples to our case study gallery, please get in touch!

Metrics for Reasoning About the Usability of Visualization Notations

Case Study

Written by Nicolas Kruchten