Neglected (Yet Foundational) Concepts in the Pedagogy of Data Visualization
The end of another semester is approaching fast and this year marks my 7th year in a row teaching visualization to computer science and engineering students at NYU. Over time I have been experimenting a lot with my pedagogical approach to teaching visualization and I am now at a stage where students go through many more (carefully crafted) visualization design problems than they used to.
However, the more I move from a course format where lectures are the main ingredient to one where students are confronted with a a lot more practice, some interesting patterns are emerging. In particular, I have identified a number of concepts that I have for long considered somewhat marginal and now I think are foundational.
Unfortunately, these same concepts do not seem to be covered properly in existing data visualization books. Some research exists for “some” of these concepts, but this same research needs to be translated in ways that is easy to teach and rendered practically useful in visualization design.
Here are the concepts …
We tend to think of data transformation as a technical skill. But data transformation is a design tool as much as deciding which form and shape to give to your visualization. Unfortunately, data transformation is often taught under the lens of “how” to apply certain transformations, but there is very little information out there on “why/when” to actually use them. More concretely: when is it ok to drop data values, data points or entire attributes? How do we know when a certain level of aggregation is too much or too few? When is it a good idea to normalize data using ratios/percentages so that we can make wide range of values comparable? I don’t think we have a good system to teach students how to think about data transformation as a design tool and I see this kind of limitation all the time. Let me state it again: data transformation is a design tool as much as deciding what graphical strategies to use. We need to teach this better.
Sorting objects in a visualization is not a minor aesthetic adjustments. It has dramatic implications in what can and cannot be detected, as well as on how hard it is to visually parse the whole picture. Sorting problems are everywhere in visualization: order of bars in a bar charts, order of rows or columns in anything that is based on rows/columns (matrices, bubble time series, box-plots), order of nodes in a circular graph. All of these situations share common patterns and (probably) common optimization criteria. On a related note: most visualization tools are not particularly good with sorting. Also, on another related note: if only we could stop having alphabetic order as a default!
For many years I thought scalability in visualization was exclusively about how to visualize a million data points in a scatter plot (or if you are an academic how to deal with hundred of attributes). But, lately I discovered a way more practical problem is how to deal with too many categories; a way more common problem. When you analyze and visualize data from nontrivial data sets, it happens surprisingly often that a categorical attribute has too many values to be visualized effectively. How do you deal with that? With way too many attributes you don’t have enough distinct colors and you don’t have enough space (just to mention a few common problems that stem from high cardinality). Students need a set of strategies to deal with this problem. And of course the same is true for when you have too many data objects. For instance, how do you visualize 100 time series in a line chart? In my experience this type of problems are more the norm than the exceptions and effective solutions are needed even for small numbers in the order of 30–100 items, categories, dimensions.
Data visualization is almost exclusively taught as the art of creating the best possible single chart. But many many problems are best solved with a composite set of visualizations that work “synergistically” to solve a data communication problem. Some of this is addressed in more academic contexts (and in some books) when “multiple views” are described in context of interactive interfaces. However, we do not have a good understanding of the design space of composite visualizations in which multiple charts work together to communicate some information without interaction. I am sure there are some interesting guidelines and patterns, but I am not aware of any systematic solution to the problem. The only exceptions are Zening Qu and Jessica Hullman’s paper on “Keeping Multiple Views Consistent” and Tamara Munzner’s schematic on different ways to create multiple views in her classic “Visualization Analysis and Design” book.
If there is one things that is at least as foundational as anything else we currently teach in visualization this is “comparison”. Comparison is everything in visualization and often design decisions boil down to deciding which approach to use to compare different segments of a given data set. For instance, should you plot all those time series in one plot or split them in small multiples? How do you decide? What are the options? What are the criteria? Luckily, research from Michael Gleicher et al. on “Visual Comparison for Information Visualization” can help, but this needs to be integrated in books and courses in a way that designers can think more systematically about the problem.
Of all the concepts outlined here uncertainty is probably the most readily recognized as a big gap in the pedagogy of data visualization. Yet, there is a paucity of teaching material, books do not cover it or cover it only in passing, and I don’t even mention it in my course. There are plenty of situations where a visualization designer may need to convey how much uncertainty there is around a given statistic or measure. At minimum, data visualization courses should make students aware uncertainty is a key concept and provide a way to think about it more systematically (e.g., by introducing a design space of uncertainty visualization techniques). Jessica Hullman and Matthew Kay are making a lot of progress on the research side in this area. A good starting point is our Data Stories interview with them on “Visualizing Uncertainty”.
What to do?
These are gaps I have noticed in my own teaching and as such it’s very personal. My intent is to build new modules in my course that explicitly address these issues. Unfortunately, there is not a lot of material I can draw from so this probably means I will have to create some of the material myself (yay!). And how about you? Do you struggle with the same things? Do you know of any exiting material I could use to cover some of these gaps?