What eye tracking can tell you about visualizations (and other images)


TLDR: New crowdsourcing techniques and computational algorithms are making it possible to measure and predict human attention at-scale, and this comes with exciting new opportunities to study how people interact with visualizations.

What could you do with an artificially intelligent oracle that made predictions about where people look? Well, you could feed it your new visualization design and determine which design elements or data points are most eye-catching to naive observers from first glance, and which ones capture attention over a longer period of time. You might even discover which features of the visualization observers find most relevant to a particular task and what facts they would likely remember later. Such insights might help you refine your visualization, make it a more effective analysis or communication tool, and provide some objective evaluation of your visualization or interface (beyond just asking people if they like it).

“Give me this oracle now,” you say. Truth is, we have nothing quite like this at the moment, but there is good reason to believe this will soon change. This blog post will be about the factors that need to come together to make attention-predicting artificially intelligent systems a reality — particularly for complex, interactive, and multi-modal visualizations and interfaces. Along the way, I’ll tell you a bit about what kind of machinery is already out there for predicting attention on photographs of objects and natural scenes, the cool applications such machinery is enabling, and the efforts that are underway to generalize this machinery to other types of images, including visualizations.

Eye tracking has long captivated psychologists, computer scientists, and advertisers. In 1967, with a mechanistic device physically coupled to the eyeball of his participants, Yarbus was able to show that observers examined a photograph differently if given different visual tasks. The eyes as a window into the observer’s mind has been a concept that has permeated many research studies and inspired companies (e.g., EyeQuant, Eyetato, Realeyes, Feng-GUI) to offer “learn about your users” services by capturing users’ eye movements. Jacob and Karn’s 2003 review on the use of eye tracking in HCI and usability research stated: “We see promising research work, but we have not yet seen wide use of these approaches in practice or in the marketplace.” What has changed since then?

We are currently at the confluence of advancements on a few different fronts. There are (1) new algorithms, (2) new hardware, (3) new data capturing approaches, and (4) an increased push for interdisciplinary research.

The DeepFix deep learning-based saliency model predicts where people are likely to look on natural images. More examples of successes and failures of deep learning-based saliency models can be found here.

New algorithms, new hardware. You’ve probably already heard this story. Deep learning and GPU programming have been taking many computer science subfields by storm and demonstrating leaps in performance in solving a whole host of prediction tasks, compared to previous state-of-the-art methods. Attention prediction has been no different. The MIT Saliency Benchmark (which I have been running until recently) keeps a running scoreboard of the best performing computational models for predicting human eye movements on natural images — i.e., saliency models. The benchmark has seen a proliferation of deep learning models since 2014 (starting with eDN by Vig et al.). In fact, of the 93 submissions made to the benchmark since 2012, over 25 are deep learning models (just look for “net” and “deep” as part of the model name), 10 of which occupy the top spots on the scoreboard.

Saliency models have come a long way from being able to predict that bright, contrasting patches against neutral backgrounds pop-out and capture people’s attention, to seemingly “understanding” the subtleties of the kinds of objects and actions that attract people’s attention in photographs. Already these successes have been picked up by companies: for instance, Twitter uses saliency models to pick better image crops for previews and Adobe auto-crops video from landscape to vertical format using saliency. While the predictions of deep saliency models already look quite remarkable when compared to real human eye movements, these models are not fully “understanding” these images yet, and there is still work to be done.

New data capturing approaches. Deep neural networks are nil without appropriately sized training datasets, so large attention datasets are to thank for these recent successes in saliency modeling. Unfortunately for attention prediction, capturing the eye movements of human observers requires specialized hardware set up in a dedicated lab, and many man-(or in this case, woman)-hours of data collection. Fortunately for attention prediction, alternative interaction modalities can be used as effective proxies for eye movements.

An example of the BubbleView methodology that serves as a proxy for eye movement data. Related methodologies for capturing attention at scale can be found here.

In an upcoming CHI paper, we take a look at different user interfaces for crowdsourcing attention. For instance, a cursor-based moving-window approach (e.g., BubbleView) can be used to have participants explore a blurred image (such as the visualization in the figure included here) using their mouse cursor to de-blur small bubble regions of the image, one at a time. This approach was used to capture attention data to train a saliency model for graphic designs and information visualizations. A closely related implementation of a cursor-based moving-window is SALICON, which was used to collect attention data on 10K natural images for training those deep saliency models we were talking about. These new methodologies are democratizing the collection of attention data beyond research labs that are equipped with eye trackers.

Aside from powering the training of computational models, you can already use these approaches today to run easy-to-scale crowdsourced studies about how people attend to, interact with, and explore visualizations under different task constraints. For instance, this MV blog post describes another use case of the BubbleView technique for examining the differences between liberals and conservatives when studying visualizations of climate change (a.k.a., what attention patterns can tell you about someone’s political leaning). Another example of using crowdsourced attention is the work Anelise Newman presented at the VISxVISION workshop at VSS’19 on how the ZoomMaps interface can be used to capture attention on large-scale visualizations. Harnessing the mobile phone as a restricted window, we have people pinch and zoom to explore large-scale visualizations. By converting their zoom patterns into attention maps, we can study how different users interact with visualizations.

Different users’ attention maps computed from their zoom patterns as they viewed a visualization of multiples on their mobile devices. The ZoomMaps methodology offers another proxy for eye movements.

An increased push for interdisciplinary research. New algorithms, new hardware, and new data capturing approaches can push research communities into new domains, but those communities need to be receptive to change. Fortunately, the popularity of interdisciplinary topics as workshops at top computer science conferences is on the rise, with a particular focus on human perception & cognition (e.g., SVHRHM at NeurIPS, VISxVISION at VIS, MBCCV at CVPR, Attention Workshop at The Web Conference). A proliferation of powerful models and A.I. algorithms is making researchers hungry for new application domains and high-level modeling insights beyond incremental technical improvements. Psychology, cognitive science, and neuroscience are proving to be a fertile ground of new and old ideas up for grabs. If you’re looking for inspiration for your own research, come join us in sunny Florida in May for the annual Vision Science conference; the VISxVISION team will be there, and we’ve held some popular events in the past. I will also be co-organizing a SIG (Special Interest Group) at CHI’20 on EMICS: Eye Movements as an Interface to Cognitive State.

The VISxVISION team along with our invited speakers at our VIS’19 workshop on “Novel Directions in Vision Science and Visualization Research”.

Looking forward. Given that the most popular workshops at the last VIS were either ML-related or perception-related, I wouldn’t be surprised to start seeing more work at the intersection of both fields. What might such interdisciplinary work look like? Crowdsourcing interfaces like the ones discussed above can be used to capture data about how people explore and interact with visualizations; how they complete analysis tasks; what they notice, get stuck on, find confusing or engaging; how they search for signal within noise, and which data trends “pop-out” to them. Computational models can be trained on human data to perceive, break down, and make predictions about visualizations and tasks using human-like biases and heuristics. Predictions from such models can be projected back onto the original visualizations to produce feedback on how to improve, simplify, or make those visualizations or tasks more effective. A first attempt at such a workflow can be found in this UIST 2017 paper: the BubbleView technique was used to capture attention patterns on over a thousand visualizations, a deep learning model was trained on this data to predict attention patterns on new visualizations, and these predictions were in turn used for an application to automatically compute thumbnails for visualizations to facilitate search through a database.

A demonstration of how a thumbnail can be automatically computed for a visualization. A predicted attention map for the visualization is used to select the most visually important regions to include in the thumbnail. The thumbnail is intended to offer a preview of the visualization to more easily locate it in a big list of thumbnails.

Other applications of predicting attention maps on graphic designs include reflowing a design to new aspect ratios or sizes. These ideas can be extended in the future to automatically retargeting visualizations for different form factors (e.g., converting a desktop visualization to a mobile version). Saliency models have also been used to prioritize visual content to transmit first during progressive transition/image loading, for image or video compression, and image enhancement applications. Such ideas can be applied to volume rendering and scientific visualizations.

Alternatively, computational models can be used to pre-filter large datasets and choose projections of the data that have more perceptually interesting trends to offer to human analysts. These ideas are not new, but given the confluence of all the factors discussed above, I think we are quickly moving towards a future in which they can become quite practical.

The eyes as a window into the mind. Coming back full circle, the eyes indeed have a lot to tell us about how people interact with visual content. Features of eye movements, including the amount of time spent at a single point (fixation duration), the speed at which the eyes move between locations on the image (saccade velocity), the amount of points explored and how widely they are distributed across the image, can all provide clues about the user’s viewing process. In this book chapter on eye fixation metrics, we survey how different features of eye movements have been previously linked to different processes such as engagement and complexity of the content being viewed. In our VIS’15 paper we also showed a relationship between where people look on visualizations and how they encode/retrieve the visualization from memory. In our 2015 Vision Research paper, we showed how from eye movements alone we could predict whether an individual would remember an image at a later time point, and in a vision science bioarxiv paper we showed how the pupil dilations and blinks of individuals at recall time were indicative of how difficult it was to retrieve an image from memory.

Eye movement patterns at recognition (during a memory task) on the most and least recognizable visualizations from the MASSVIS dataset. The visualizations on the left were almost immediately recognizable without the observers having to look around at all; the visualizations on the right had to be explored at much greater detail before the observers could confirm that they have indeed seen them before.

Eye movements can provide clues about the observer’s mental model, expertise, and even cultural differences. Signatures of the eyes can be used to detect fatigue and cognitive load. Eye tracking can be used as a measure of how people analyze code, detect traffic hazards, or read data visualizations. Computational algorithms that aim to solve these same tasks can learn, based on observing human behavior, the features particularly salient for each task.

The visualization literature offers a particularly rich set of visual analytics tasks that humans can be measured and later modeled on. Unlike a natural image, a visualization requires engagement at many different levels - from integrating textual and visual elements together, to analyzing trends, comparing data points, and making inferences. These higher-level cognitive tasks provide possibilities to capture observers’ attention over longer time periods than used for eye tracking experiments on natural images (typically, 3–5 seconds/image). Within longer viewing intervals, more sophisticated visual exploration patterns can emerge, motivating ever more sophisticated models of cognitive saliency, until one day we can have that artificially intelligent oracle answer all our burning questions about visualizations.



Zoya Bylinskii
Multiple Views: Visualization Research Explained

Zoya is a Research Scientist at Adobe Inc. Her work lies at the interface of human perception & cognition and computer vision & machine learning.