VizNet: Towards a Large-Scale Visualization Learning and Benchmarking Repository
This article summarizes a paper authored by Kevin Hu, Snehalkumar ‘Neil’ S. Gaikwad, Madelon Hulsebos, Michiel A. Bakker, Emanuel Zgraggen, César Hidalgo, Tim Kraska, Guoliang Li, Arvind Satyanarayan, and Çağatay Demiralp. This paper will be presented at CHI 2019 on Tuesday 7th May 2019 at 16:00 in the session Visualization Systems and Repositories.
Takeaway
VizNet is a large-scale corpus of over 31 million datasets compiled from the web, open data repositories, and online visualization platforms. Researchers can use VizNet to conduct experiments with real-world data, assess the ecological validity of synthetic data, and compare design techniques against a common baseline.
The Need for Visualization Repositories
Large-scale databases such as WordNet [1] and ImageNet [2] provide the data needed to train and test machine learning models, as well as a common baseline for evaluation, experimentation, and benchmarking. They have proven instrumental in pushing the state-of-the-art forward in language modeling and computer vision.
Research on graphical perception, however, often relies on ad hoc or synthetically generated datasets that do not display the same characteristics as data found in the wild. To date, insufficient attention has been paid to design and engineer a centralized and large-scale repository for evaluating the effectiveness of visual designs. This heightens the need for building a large scale corpus to learn, evaluate, and benchmark various measures of perceptual effectiveness.
Characterizing Real-World Data
We introduce VizNet, a large-scale corpus of over 31 million datasets compiled from the web, open data repositories, and online visualization platforms.
We find that real-world datasets typically consist of 17 rows and 3 columns. 51% of the columns in the corpus are categorical data, 44% quantitative, and only 5% temporal. About half of the columns are best described by a normal, lognormal, or power law distribution. Summary statistics and distributions (bottom) are shown below.
Utility of VizNet as a resource for data scientists and visualization researchers
We demonstrate VizNet’s viability as a platform for conducting online crowdsourced experiments at scale by replicating the Kim and Heer (2018) study assessing the effect of task and data distribution on the effectiveness of visual encodings [3], and extend it with an additional task: outlier detection.
While largely in line with the original findings, our results do exhibit several statistically significant differences as a result of our more diverse backing datasets. These differences inform our discussion on how crowdsourced graphical perception studies must adapt to and account for the variation found in organic datasets.
As the VizNet corpus grows, assessing the effectiveness of these (data, visualization, task) triplets, even using crowdsourcing, will quickly become time- and cost-prohibitive. To contend with this scale, we conclude by formulating effectiveness prediction as a machine learning task over these triplets. Our results suggest that machine learning offers a promising method for efficiently annotating VizNet content.
Conclusions
- VizNet provides the common baseline for comparing visualization design techniques and developing benchmark models and algorithms for studying graphical perception at scale.
- We demonstrate how machine learning models can offer a promising method for efficiently annotating (data, visualization, task) triplets at scale.
- VizNet research provides an important direction to understand the opportunities and challenges faced in replicating prior work in human-computer interaction and visualization research.
Acknowledgments
We thank Alex Johnson for providing access to the Plotly API, Robert Kosara for providing the Many Eyes data, and the authors of [4] for scraping and providing access to open data repositories.
References
[1] George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.
[2] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.
[3] Younghoon Kim and Jeffrey Heer. 2018. Assessing Effects of Task and Data Distribution on the Effectiveness of Visual Encodings. Computer Graphics Forum (Proc. EuroVis) (2018).
[4] Sebastian Neumaier, Jürgen Umbrich, and Axel Polleres. 2016. Automated Quality Assessment of Metadata across Open Data Portals. Journal of Data and Information Quality (2016).