Metrics for Reasoning About the Usability of Visualization Notations
Visualizations can be specified using a variety of programmatic notations, such as those powered by D3, ggplot2, Vega-Lite, matplotlib, Seaborn, Plotly and many other frameworks, toolkits, libraries, grammars or domain-specific languages. Evaluating and comparing the usability of such notations is a challenge faced both by their designers and their prospective users. Current practice involves labour-intensive and ad-hoc close reading of documentation and galleries of examples by experts, for example using the Cognitive Dimensions of Notations framework. In our recent paper, Metrics-Based Evaluation and Comparison of Visualization Notations, we propose a small set of easily-computable metrics which can be used to structure and inform such evaluations, and we hope will make it easier to reason about and discuss the usability of these notations.
Designers of notations often build galleries of example specification/output pairs to demonstrate expressiveness, i.e. the breadth and variety of visualizations that can be specified. Expressiveness is sometimes compared across notations by demonstrating that a pre-existing gallery for one notation can be replicated using another. As one example, the designers of Mascot.js (formerly Atlas) explicitly compare their example gallery to Vega-Lite’s in their paper. We build on this practice by defining metrics which can be computed on multi-notation galleries — galleries where each visualization is specified in each of N notations — and showing how these within-notation metrics can be usefully compared across notations.
The metrics we use are:
- The specification length, measured in bytes of UTF-8-encoded source text, captures aspects of notational terseness and complexity.
- The vocabulary size, measured as the number of unique tokens used in the specs in a notation across gallery, captures aspects of notational economy and the number of concepts a user must mentally juggle while using the notation.
- The textual distance between two specs of the same notation captures aspects of notational viscosity: how difficult it is to transform one spec into another. We measure this using the robust and generic compression distance function.
- The sprawl of a given notation over that gallery is the median distance between every spec in a gallery for that notation, and is a high-level measure of viscosity.
- The remoteness of a spec is the median distance between it and every other spec of the same notation in the gallery. Specs with low remoteness relative to sprawl can be considered ‘central’ to the gallery for a given notation, and conversely, ones with high remoteness relative to sprawl are likely outliers.
We hasten to note that we do not propose these quantitative metrics as part of a normative framework i.e. we make no claim that terser notations or ones with fewer tokens or less sprawl are more usable! Rather, we present them as a way to structure explorations and discussions of the complex objects we call notations. We expect that claims made about notations couched in terms of comparisons of metrics that can be traced back to concrete galleries of example can be more precise and more productively discussed than more ad-hoc, subjective reports.
Case Study
We demonstrate the use and usefulness of these metrics through a case study focused around statistical charting. We hand-built a gallery of 40 commonly-used statistical graphics of a single dataset using 9 different popular notations (see the appendix of our paper for precise definitions of these notations). We deliberately used the same dataset in the same format for all the examples and included all the required data-wrangling code in the specifications, to make cross-example and cross-notation comparisons as direct as possible.
We developed an extensible open-source tool called NotaScope to develop and browse multi-notation galleries as well as compute and visualize our metrics. The case study gallery (including specifications and renderings) can be viewed within a demo instance running at https://app.notascope.io/
The figures below, taken from our paper, are visualizations of the metrics we computed from our case study gallery, and serve as illustrations of what these metrics can be used for:
- comparing multiple notations at a high level
- comparing pairs of notations example by example
- exploring the design space of a single notation
We gathered feedback on this case study, including the choice and implementation of visualizations and the usefulness and perceived value of our metrics by interviewing either the original designers of these notations or members of the core development teams for the associated projects. Our paper includes many specific quotes from these six interviews, but overall our experts broadly agreed that these metrics do capture and systematically externalize important aspects of notation usability, and should be useful for communicating about them. Our metrics-based analysis of the case study gallery, embodied in the figures below, generally aligned with their pre-existing mental models of the relationships between these notations, confirming this.
This case study, though it includes hundreds of specifications, is necessarily limited in scope: there is more to visualization than (static) statistical graphics; many popular notations were not included in our study; and the notations we did include are capable of much more than we were able to include. That said, it serves as a good example of the application of our metrics-based approach to the evaluation and comparison of visualization notations.
We envision the primary users of this metrics-based approach to be notation designers looking to systematically evaluate new notations and communicate about them by comparison to existing work. These metrics could also be incorporated into tools for users of notations, for example to help them select a notation for a particular task, or to learn a new notation by comparison to one they already know. Documentation systems could use the distance metric to facilitate navigation between similar examples, for instance.
We expect that this approach can also be fruitfully applied to other visualization domains addressed by multiple notations such as diagramming, animation, interaction, and scientific visualization, as well as connected domains such as data-wrangling and modelling. If anyone reading this wants assistance in using this approach in general or our software in particular, or would like to contribute new notations (e.g. D3, AntV G2, Observable, HighCharts, base-R etc) or examples to our case study gallery, please get in touch!