From Data Visualization to Interactive Data Analysis

[Note: this essay is the written, expanded and refined version of the talk I gave at the Uber Data Visualization meetup organized in NYC on Oct. 26, 2017. You can watch the video here (sorry, very bad quality) and get access to the original slides here.]

TL;DR: Visualization projects with high visibility focus on two main purposes: inspiration and explanation. Visualization can however be used (and is actually used) to increase understanding of complex problems through data analysis. These project are less visible but by no means less important.

Three main uses of data visualization

I know I am running the risk of falling into gross simplification. However, I still find it useful to identify three main categories of data visualizations in terms of what their main (intended or unintended) purpose is. This will help me clarify a few ideas later on.

  1. Inspirational. The main goal here is to inspire people. To wow them! But not just on a superficial level, but to really engage people into deeper thinking, sense of beauty and awe. Visualization has an incredible power to attract people’s attention but also to draw them into fantastic artificial worlds that turn abstract concept into more tangible ones. A perfect example of inspiring visualization is the work done by my friend Giorgia Lupi, who has created a whole genre on her own with her fantastic hand-drawn (as well as digital) masterpieces (check this recent one exposed at MOMA).

Why talk more about data analysis?

This essay, and the talk that preceded it, aims at better defining the role of visualization in data analysis and spurring more conversations about what is happening in this area of visualization which, unfortunately, it’s not blessed with the same limelight of the other purposes.

But why focus on analysis? What is so special about it?

My reasoning is that data analysis is a fundamental human-technological activity that has the potential to help people solve important societal and scientific problems. More precisely, my argument is that data analysis is important because it’s the activity that can help people improve their understanding of complex phenomena and, as such, it can help solve important problems. It’s an indirect link, but an important one: if I understand a problem better, there are higher chances I can find a better solution for it.

There is no lack of interesting and important problems in the world we can hope to understand better through data analysis. Here are a couple of examples from my personal experience. I am describing them here not necessarily because they are the most important problems we can address, but mostly because I am familiar with them.

Detecting and understanding medical malpractice. During the last few years my lab has been collaborating with ProPublica, a popular independent newsroom located in New York City. We helped them sift through large sets of medical reviews from Yelp to identify and understand issues people have with medical doctors and their services. How do you make sense of millions of reviews? How do you find suspicious activities? How do you identify interesting comments? It turns out that even something as simple as a well-crafted “faceted search” interface is incredibly useful for this task. We developed a simple little tool called RevEx, which enabled our collaborators to make some progress and publish a couple of interesting articles on their findings.

RevEx. An interactive data exploration tool we developed to help Charles Ornstein from ProPublica sift through millions of reviews from Yelp.

Understanding scamming and scammers. This is a very recent collaboration we set up with a company called Agari. Their main goal is to go after scammers and disrupt their activities. They collect super interesting data about scammer activities and are hoping to use this knowledge to better safeguard people and businesses against vicious attacks. This is super important! Talking with them I now have a better knowledge of how bad scamming can be for some people. Some people’s life has been literally ruined by their nefarious activities. It’s not just spam in your inbox.

Side note on “understanding”

Did you notice? When we talk about data analysis problems we often describe the goal as “understanding” something. We can then postulate that the main purpose of data analysis is to better understand something through data.

Relationship between reality, data/statistical models, human mental models.

This is how things look like: data and models are a description of some reality we want to study. Humans have a mental model of the reality and use data and models to study it so that they can hopefully understand it better. This whole idea deserves a whole blog post, which I hope to write in the near future.

How Does Interactive Data Analysis Work?

Interactive data analysis works mostly in a loop fashion. You start with some sort of loosely specified goal, translate the goal into one or more questions, organize and analyze the data to answer these questions, generate new questions and start over. More precisely I have identified the following steps:

  • Defining the problem. Every projects starts with a problem statement. What problem are you trying to solve? What is your ultimate goal? How is increased understanding derived from data analysis going to bring you closer to your goal?

Important aspects of data analysis

There are a few important aspects of this process that I would like to highlight:

  1. The process is not sequential and is highly iterative. While I have presented these steps as a sequence, the real process is not sequential at all. People jump from one step to another all the time as more of the problem, requirements and limitations are understood. It’s also highly iterative. You typically come up with an initial question, do the work to generate an answer and as you go through this process generate new questions and needs and start over again.

Quo vadis interaction?

You may have noticed I did not mention interaction so far. Why? Because interaction is all over the place. Every time you tell your computer what to do and you computer returns some information back to you, you have some form of interaction. Here is a list of actions we typically need to perform in data analysis:

  • Gather and transform the data;

All of these require some form of direct or indirect interaction.

Direct Manipulation vs. Command-Line Interaction

When we talk about interactive data analysis it is important to clarify what we mean by “interactive”. What constitutes an “interactive” user interface? For many, the assumption is that interactive visualization is only about WIMP interfaces, direct manipulation, clicks, mouse overs, and such. But a command line interface is also interactive: the user tells the computer what to do and the computer reacts and responds accordingly. What changes is the interaction “modality”, not whether something is interactive or not. It seems to me that what we should discuss is more what advantages and disadvantages direct manipulation versus command-line style interactions have in data analysis systems (as well as systems that seamlessly mix the two). While advantages and disadvantages of direct manipulation have been discussed at length elsewhere (the Nielsen and Norman Group has a good summary) we do not have a good understanding of how this plays out in data analysis. Most existing systems rely on command-line interfaces. Why? Is it because they are more effective or because we have not invented better interfaces yet?

Challenges of Interactive Visual Data Analysis

I want to conclude this essay by highlighting a few challenges I think are particularly relevant for interactive data analysis. This is where I believe we need to make more progress in coming years.

Specification (Mind → Data/Model)

When we interact with data through a computer the first thing we need to do is to translate our questions and ideas into specifications the computer can read (SQL is a great example here). This is where language and formalism play a major role. One may think that in order to give instructions to a computer one must necessarily learn some kind of programming language but in practice many interactive systems employ interactive specification methods that translate user actions into statements computer understand and are more natural for human users. A fantastic example of interactive specification system is the VizQL language employed in Tableau, which translates user selections into formal statements the system can understand and use to generate queries and appropriate visual representations.

Should we expect everyone to be a coder?

One relevant question here is: “should we expect everyone to be a coder and learn a specification language in order to perform data analysis?”. I personally believe we have to be as inclusive as possible and realize that there are large segments of the population that might greatly benefit from data analysis tools and yet have no time, resources or incentives to learn how to use formal languages. Therefore, while I am a huge fan of data science programming tools such as R and the Jupyter/Pandas combo, I am not sure we should expect everyone to reach that level of proficiency in order to do useful things with data. A good example of how very complex data processing can be made more accessible to people are Trifacta’s Wrangler, and Open Refine, which enable people to perform lots of data wrangling without writing a single line of code.

Representation (Data/Model → Eyes)

Once results are obtained from querying and modeling, the next step is to find a (visual) representation so that the user can inspect and understand them. This is the realm of data visualization. But while most people, when they hear “data visualization” think about colorful fancy graphics, it is totally appropriate to expect a simple data table to be an effective means to inspect the results. I find it somewhat interesting that we use the word “visualization” to mostly mean complex graphical puzzles, when in fact a table is as visual as anything else.

One aspect of data visualization I have been discovering over the years is that when we talk about data visualization we often think that the choice of which graphical representation to use is the most important one to make. However, deciding what to visualize is often equally, if not more, important, than deciding how to visualize it. Take this simple example. Sometime a graph provides better answers to a question when the information is expressed in terms of percentages than absolute values. I think it would be extremely helpful if we could better understand and characterize the role data transformation plays in visualization. My impression is that we tend to overemphasize graphical perception when content is what really makes a difference in many cases.

“How fancy does a visualization need to be in order to be useful for data analysis?

Another important question here is: “how fancy does a visualization need to be in order to be useful for data analysis?”. I am a huge fan of well-crafted, sleek, attractive, bespoke visualization projects. The beauty of colorful pixels is what made me fall in love with visualization in the first place. Yet, I am not sure how much this counts when the main goal is data analysis. More precisely, while I do think that aesthetics plays a major role in a visualization, I am not sure how much innovation we still need in producing novel metaphors for data visualization.

In my experience (building research-based prototypes for more than 10 years) most visualization problems can be solved with a handful of graphs. It’s rare the situation in which you really have to come up with a new metaphor. “Graphical workhorses” such as bar charts, line charts, scatter plots, pivot tables, etc., are really hard to beat!

At the same time, this does not mean that visualizing data effectively is easy! What is really hard is to use, tweak, and combine these graphs in clever and effective and innovative ways. This is much harder than one may be willing to admit. In a way the innovation and educational effort needed to make progress in visualization should focus more on depth and less on breadth. We need more to understand how to use existing methods well than look out for more metaphors and techniques (even though we of course need to keep innovating and try out crazy things).

Interpretation (Eyes → Mind)

This step is crucial and yet often neglected. Once results are represented, people need to interpret them and understand what they mean. This is a very complex cognitive process that links together several pieces of knowledge. Think about it: what does one need to know in order to reason effectively about the results of modeling and visualization? At minimum you need to be able to understand the representation and the model, the way they link to the real world entities they represent and, finally, but crucially, how they relate to the knowledge you already have in your head. Let me focus on visualization and modeling.

“Are people able to interpret and trust their visualizations and models?”

The important question here is: “Are people able to interpret and trust their models?”. In order to interpret a visualization effectively you need to first understand the visual metaphor and second the visual metaphor itself needs to convey information in the least ambiguous/uncertain way possible. Unfortunately, not all visual representations are like that. A notable example is multidimensional projections (using algorithms such as t-SNE and MDS) which use a somewhat intuitive metaphor (proximity equals similarity) but are also unbearably ambiguous. Here is an example of a projection showing similarity between words extracted from IMDB reviews:

Example of t-SNE projection

What do you learn when you look at it? And when you happen to learn something … How sure are you the thing you learned represent some real phenomenon and not just some statistical fluke?

When we look at interpretation of models we have an even bigger problem. Machine learning methods use incredibly sophisticated procedures to transform data into more abstract structures but in the process we completely lose the ability to understand their content, quality and plausibility. Take “topic modeling”. It’s nightmare. The method takes as an input a document collection and returns a set of “topics” captured as a set of words. The problem is that most of the time the set of words returned make no sense at all. Here is an example from a recent project we have been working on in my lab. These are some of the topics extracted from a collection of articles from Vox:

Example of topics generated by topic modeling (with the LDA method)

What do you think? Does it make sense? Can you extract anything useful out of them? (In all fairness the method returned many more topics which make more sense, but I chose this to illustrate the problem in a more dramatic fashion.)

How do you deal with this? That’s an important question which requires the collaboration of ML experts but also people who understand perception and cognition so that these methods can more effectively produce a human-technological system able to really enhance the human mind.

Suggestions

I have two sets of suggestions: one for practitioners and one for researchers.

For practitioners.

  1. Focus more on (more relevant) problems. There is no lack of relevant problems to solve in the world and data analysis can play a major role to make progress. For good or worse data is everywhere and large portions of the physical world is leaving digital traces that can help us understand some things better. Work for or collaborate with people who want to solve important problems. Pick one domain you like and try to produce better understanding.

For researchers.

  1. Develop better specification methods. Translating what is in people’s head into instructions a machine can understand is still quite challenging. Lots of progress has been made in terms of programming languages but creating specifications without coding is still pretty challenging. Two great examples of interactive specifications systems invented in recent years are Tableau’s visual query language and Trifacta’s interactive methods for data transformation. These cover two very important needs but there is no lack of other situations where interactive specifications are needed. For instance specifying what one wants to do with text collections is still pretty challenging.

Conclusion

In this little essay I am arguing visualization practitioners and researchers should adopt a wider perspective on their role in the data science space. Visualization experts can help people solve complex and important societal problems by focusing on supporting people analyze their data. This can be done by (1) understanding that visualization is one (important) step in a much larger and complex process and (2) seeking out collaborations with people who need their help and (3) developing tools for them to do remarkable things with data. I hope you found this reading, though admittedly a bit long, inspiring. We need an army of visualization enthusiasts like you to do important work with an impact in the world!

Associate Professor at NYU Tandon. Research + Teaching Data Visualization and Visual Analytics. Co-Host of Data Stories Podcast.