Doing More With Sample Datasets
tl;dr: Visualization designers and researchers use boring standard datasets to show off their designs. We should put that wasted space to better use, to advocate for things we care about.
A not-so-hidden secret of visualization research is that, especially when we’re introducing a new technique or a new system, we tend to use pretty much the same standard datasets. These are ones that are baked into R by default (mtcars or iris or Old Faithful eruptions or whatnot), or are common in ML environments like Kaggle (Titanic passenger survival or wine quality and so on). For graph data you use the Les Mis dataset. Then you’ve got your barley dataset, and, if you want to start dipping your toe into “big data,” the airline delay dataset.
These datasets are good in that they are mostly clean, are big enough to test some limits in your system or technique, but small enough that you usually don’t need to do anything fancy on the back end to make use of them. They are also extremely easy to get a hold of. They are Google-able datasets. They are also familiar enough that you don’t need to spend a lot of time explaining the provenance or background of the data, so you can spend more time showing cool demos. They have some properties that we might want in order to show that our technique works. The iris dataset, in particular, has both some overplotting (but not too much), some clear a priori clusters, but no single dimension of clear cluster distinction, so it’s great for testing all sorts of different summarization or projection techniques.
The problem, of course, is that these standard datasets are boooooorrrring. There’s nothing particularly insightful in the data, and what insights there are are so stale that there’s no way I’d be able to bring them into my daily life. I don’t care about barley yields. I’ve already read Les Misérables, so I generally remember which characters hang out. I know that gender and passenger class partially, but not entirely, predicted who died on the Titanic. It’s just all rather mind numbing. It also doesn’t always serve well as evidence for the efficacy of your technique, since these toy datasets don’t let you see what happens when you try to scale up to the complexity, size, and noisiness of the data that you do care about.
There’s another thing I don’t like about these standard datasets, however, and that’s that they are missed opportunities. If I’m displaying a new graph layout algorithm or a new visual analysis platform, it doesn’t really matter what data I use to illustrate it in the paper figures, so long as it’s sufficiently complex and “realistic” that it looks like my system does what it’s supposed to. That means that I have an opportunity to use that visual real estate to show something that I (and perhaps also you) care about. Why waste that space on something trivial?
To be more concrete, I think we can use demos, examples, and explanations as platforms for advocacy, just by choosing datasets that we actually care about.
Tired of using Old Faithful eruptions to show that your kernel density estimate technique works? Why not try the UK’s self-reported gender pay gap data instead. Want to show that your outlier detection algorithm works well? Instead of mislabeled barley yields or flower species types, why not look at the World Bank’s indicators dataset and look at things like per capita health expenditure and life expectancy. Have a time series smoothing or prediction technique you want to test out? Throw NASA’s carbon dioxide and global temperature anomaly data in peoples’ faces as many times as you can until it sinks in. Heck, just choose a topic that you care about from data.gov.
I get that, especially for government data stores, sometimes the websites are archaic or the metadata is sparse, but at the end you get to say that your demo of graph layout or sparklines or treemaps or what have you has something to say about something important. Choosing these more politically or ethically motivated datasets ought to have very little impact on how your work is presented in an academic or portfolio context (it’s just about the design and technique, after all), but can actually result in some positive change beyond getting a paper accepted or getting a consulting gig lined up.
Problems With The Solution
A natural reaction to this suggestion is that I seem to be proposing that we “politicize” what was hitherto “neutral ground.” My response is that there’s no such thing as neutral ground here. Visualizations are potentially persuasive ways of presenting information, just like any other medium. By choosing to make your sample visualizations be about nothing (or more or less nothing), you’re already implicitly making a stand. In particular, you’re ceding ground to the louder voices who perhaps do not have the beliefs and scruples you do, and will gladly do whatever it takes to manipulate the data to their own ends. Don’t assume that somebody is advocating for your positions with your data if you aren’t going to. Saying nothing is just as much of a political statement as getting up on the barricades with a megaphone, it’s just a statement in support of the status quo at best, and in support of the loudest voices at worst (rhetorical loudness being a poor predictor of correctness).
Especially in academic visualization work, our datasets are often not chosen for us. Corporations and governments and institutional agencies have an outsized influence over what data we show. US agencies with three letter acronyms can afford to pay for visualization experts to investigate the best way of analyzing large corpora of surveillance data. The people being surveilled lack the organizational or economic power to write the same kinds of checks or offer the same kinds of grants. Companies are willing to invest in techniques that combine machine learning, visualization, and automated decision-making to handle large stores of data. The people on the other end of these “weapons of math destruction” are often not so well-situated. There is an imbalance of power between those who are capable of gathering, curating, and visualizing data, and those who are impacted by the the use or abuse of these data. How we make use of this power differential has an inescapable moral component. By adding our own political causes to visualization demos and presentations, we’re not adding bias to a previously pristine purely abstract domain; if anything, we’re moving the scales just a little bit back towards a more neutral position.
Of course, there’s a more pragmatic objection to this proposal. Not everybody in your peer or professional group may share your enthusiasm or perspective. A professor emeritus may have enough clout and influence (and indulgence from the audience) to turn a sample of a visualization technique into their personal soapbox for a bit, but not everybody has the same luxury. Our traditional sample datasets are safe in a way that more interesting datasets are not. Especially for people in more precarious positions (junior faculty, freelancers, or lower-level employees in companies), maybe their paper or demo or design gets put in front of the Wrong Person, and that is just enough bias to cost them the job or promotion or paper acceptance.
I’ll be less dogmatic about my response to this objection. If you don’t feel safe about advocating in your current position, then I can’t force you to. My remedies to this concern are twofold. First, that we should lead from the top. Senior people with more institutional power and bigger safety nets should use some of this leeway for good. The second is that we need to organize. A single junior developer can’t safely write a letter protesting Amazon’s connection with ICE, but lots of developers sure can. A single person may not be able to change the sexual harassment policies at Google, but lots of them can. It’s much easier to challenge institutional norms with institutional power. That suggests a need to, formally or informally, connect and agitate with people in your cohort. These kinds of connections can be built one-on-one, and can contribute to long-lasting institutional change once the ball is rolling.
What To Do Next
Collect and curate datasets with real social impact: I can google “iris dataset” and get a .csv within a click or two. It should be just as easy to get my hands on economic inequality datasets, or women’s health datasets, or voter participation datasets. This suggests the need for a centralized but democratic repository of datasets for social good. This will likely require institutional organization and resources to create, but I’m happy to be surprised by the ability and passion of hobbyists.
Reward visualization work that advocates for justice: There are periodically efforts to reward academic work with badges for things like its commitment to open access or use of pre-replication to alleviate replicability concerns. Similarly, there are also awards for the aesthetic and intellectual appeal of visualizations. We should find ways to reward visualization work that makes a strong case for social change or raises visibility for otherwise under-examined social ills in the same way. This could be in the form of parallel groups that look over proceedings and issue rewards or demerits (as with the Open Access Vis project that collates information about open science practices in the IEEE VIS conference), or even more informal recognition like signal-boosting important work, or solidarity with others working in spaces you care about.
Work with advocacy organizations: We often donate our personal time and energy to causes we care about. There’s no reason that this philanthropy has to stop with our checkbooks or our weekends. Designing good visualizations and working with data is a form of expertise, and can be a critically important component of persuasion and communication. If you’re in an organization that sets aside resources for personal projects, try to make use of those resources to help organizations do their data science work (even if it doesn’t result in a paper or a CV line). If you’re in an organization that doesn’t offer those resources, see how you’d go about allocating them.
Maybe I’m off base. Maybe people really are passionate about barley yields in Minnesota and flower measurements in Quebec from 80 years ago. But I’m not, and it seems at the very least inefficient to waste all of that communicative space rehashing them again and again. Visualization is supposed to care about “data ink ratios” and the like, so shouldn’t we try to put just a little bit of signal in with all that noise?
Thanks to Michelle Borkin, Alex Kale, and Alper Sarikaya for comments and suggestions.