From Data Visualization to Interactive Data Analysis
[Note: this essay is the written, expanded and refined version of the talk I gave at the Uber Data Visualization meetup organized in NYC on Oct. 26, 2017. You can watch the video here (sorry, very bad quality) and get access to the original slides here.]
TL;DR: Visualization projects with high visibility focus on two main purposes: inspiration and explanation. Visualization can however be used (and is actually used) to increase understanding of complex problems through data analysis. These project are less visible but by no means less important.
Three main uses of data visualization
I know I am running the risk of falling into gross simplification. However, I still find it useful to identify three main categories of data visualizations in terms of what their main (intended or unintended) purpose is. This will help me clarify a few ideas later on.
- Inspirational. The main goal here is to inspire people. To wow them! But not just on a superficial level, but to really engage people into deeper thinking, sense of beauty and awe. Visualization has an incredible power to attract people’s attention but also to draw them into fantastic artificial worlds that turn abstract concept into more tangible ones. A perfect example of inspiring visualization is the work done by my friend Giorgia Lupi, who has created a whole genre on her own with her fantastic hand-drawn (as well as digital) masterpieces (check this recent one exposed at MOMA).
- Explanatory. The main goal here is to use graphics as a way to explain some complex idea, phenomenon or process. This is an area where graphical representation shines: we are visual creatures and a picture is sometime really worth a thousand words. Data journalism has provided over the years most of the best contributions to the art of explaining complex things through data (see the amazing work done by the New York Times and Washington Post over the years). But this is also the realm of education, especially scientific education which is often based on numbers and graphs. And this is also the area of that beautiful recent trend called “explorable explanations”, pioneered by Bret Victor and popularized by many other fantastic people like Nicky Case.
- Analytical. The main goal here is to extract information out of data with the purpose of answering questions and advancing understanding of some phenomenon of interest. Sure, explanatory visualization is also about helping people understand something. But the main difference here is that in explanatory visualization the author knows already what to visualize (after having performed some analysis), whereas in analysis the main use of visualization is to understand the data in the first place. There are a million names people have used to define this activity. The most recent and fashionable one is Data Science and more specifically that part of Data Science called “Exploratory Data Analysis”, a term invented by the great John Tukey a few decades ago. To make things simple I’ll call this just the good old name: Data Analysis, or maybe, Visual Data Analysis, or even, Interactive Visual Data Analysis to emphasize the use of graphical representations one can interact with (some in academia and business call this Visual Analytics also).
Why talk more about data analysis?
This essay, and the talk that preceded it, aims at better defining the role of visualization in data analysis and spurring more conversations about what is happening in this area of visualization which, unfortunately, it’s not blessed with the same limelight of the other purposes.
But why focus on analysis? What is so special about it?
My reasoning is that data analysis is a fundamental human-technological activity that has the potential to help people solve important societal and scientific problems. More precisely, my argument is that data analysis is important because it’s the activity that can help people improve their understanding of complex phenomena and, as such, it can help solve important problems. It’s an indirect link, but an important one: if I understand a problem better, there are higher chances I can find a better solution for it.
There is no lack of interesting and important problems in the world we can hope to understand better through data analysis. Here are a couple of examples from my personal experience. I am describing them here not necessarily because they are the most important problems we can address, but mostly because I am familiar with them.
Detecting and understanding medical malpractice. During the last few years my lab has been collaborating with ProPublica, a popular independent newsroom located in New York City. We helped them sift through large sets of medical reviews from Yelp to identify and understand issues people have with medical doctors and their services. How do you make sense of millions of reviews? How do you find suspicious activities? How do you identify interesting comments? It turns out that even something as simple as a well-crafted “faceted search” interface is incredibly useful for this task. We developed a simple little tool called RevEx, which enabled our collaborators to make some progress and publish a couple of interesting articles on their findings.
Understanding scamming and scammers. This is a very recent collaboration we set up with a company called Agari. Their main goal is to go after scammers and disrupt their activities. They collect super interesting data about scammer activities and are hoping to use this knowledge to better safeguard people and businesses against vicious attacks. This is super important! Talking with them I now have a better knowledge of how bad scamming can be for some people. Some people’s life has been literally ruined by their nefarious activities. It’s not just spam in your inbox.
Side note on “understanding”
Did you notice? When we talk about data analysis problems we often describe the goal as “understanding” something. We can then postulate that the main purpose of data analysis is to better understand something through data.
This is how things look like: data and models are a description of some reality we want to study. Humans have a mental model of the reality and use data and models to study it so that they can hopefully understand it better. This whole idea deserves a whole blog post, which I hope to write in the near future.
How Does Interactive Data Analysis Work?
Interactive data analysis works mostly in a loop fashion. You start with some sort of loosely specified goal, translate the goal into one or more questions, organize and analyze the data to answer these questions, generate new questions and start over. More precisely I have identified the following steps:
- Defining the problem. Every projects starts with a problem statement. What problem are you trying to solve? What is your ultimate goal? How is increased understanding derived from data analysis going to bring you closer to your goal?
- Generating questions. A problem specification is typically too high-level and loosely specified to translate directly into some data analysis actions (a problem that is often overlooked and not well understood). Typically, the problem needs to be translated (implicitly or, better, explicitly) into a number of data analysis questions.
- Gathering, transforming and familiarizing with the data. Some projects have data available, whereas some others require some degree of data search or generation. In any case, all projects require the analyst to familiarize with the content and its meaning and perform multiple transformations, both to familiarize with the data (e.g., often slicing, dicing and aggregating the data) and to prepare it for the analysis one is planning to perform.
- Creating models out of data. Not all projects require this steps, but some do. Using statistical modeling and machine learning methods can be useful when the question asked can be answered more easily by building a model. While most of what modeling people talk about is prediction, models can be extremely powerful tools for exploration and hypothesis generation. Examples of methods that can be used for this step are clustering, dimensionality reduction, simple regressions and various NLP (natural language processing) methods to convert text into meaningful numbers.
- Visualizing data and models. This is where data is observed through your eyes. Now, most people think fancy charts when thinking about this stage, but simple representations like tables and lists are perfectly reasonable visualization for many problems. This is where the results obtained from data transformation and querying (or from some model) are turned into something our eyes can digest and hopefully understand. This is the step we all, data visualizers, love and live by.
- Interpreting the results. Once the results have been generated and represented in some visual format, they need to be interpreted by someone. This is a crucial step and an often overlooked one. Behind the screen there is a human who needs to understand what all those colored dots and numbers mean. This is a complex activity which includes steps such as: understanding how to read the graph, understanding what the graph communicates about the phenomenon of interest, linking the results to questions and pre-existing knowledge of the problem. Note that interpretation here is heavily influenced by pre-existing knowledge. This includes at least knowledge about the domain problem, the data transformation process, the modeling, the visual representation. This is another aspect of visualization and analysis that is often overlooked.
- Generating inferences and more questions. All of these steps ultimately lead to creating some new knowledge and, most of the time, generating additional questions or hypotheses. This is an interesting property of data analysis: its outcome is not only answers but also questions; hopefully better and more refined ones. One important aspect of this step is that one may generate incorrect inferences here, so not all processes necessarily lead to positive outcomes and not all analyses are equally effective.
Important aspects of data analysis
There are a few important aspects of this process that I would like to highlight:
- The process is not sequential and is highly iterative. While I have presented these steps as a sequence, the real process is not sequential at all. People jump from one step to another all the time as more of the problem, requirements and limitations are understood. It’s also highly iterative. You typically come up with an initial question, do the work to generate an answer and as you go through this process generate new questions and needs and start over again.
- Some activities are exclusively human. Did you notice? Quite a good number of steps in the process are exclusively human (see the red boxes in the figure above): defining problems, generating questions, interpreting the results and generating inferences and new questions. It’s all about human activities, not technological ones. Which leads us to ask: how much do we know about how humans think with data? And how can we expand our knowledge so that we can improve this process?
- Visualization is just a small portion of the process. For data visualizers like us this is a crucial observation. As much as we love the visualization step, we have to recognize that when visualization is used for data analysis, it represents just a small portion of a much more varied set of activities. This is not to say that visualization is not important or challenging, but it’s crucial to realize what the big picture is. The effectiveness of the whole process depends on all of the steps above, not just visual representation.
Quo vadis interaction?
You may have noticed I did not mention interaction so far. Why? Because interaction is all over the place. Every time you tell your computer what to do and you computer returns some information back to you, you have some form of interaction. Here is a list of actions we typically need to perform in data analysis:
- Gather and transform the data;
- Specify a model and/or a query from the data;
- Specify how to represent the results (and the model);
- Browse the results;
- Synthesize and communicate the facts gathered.
All of these require some form of direct or indirect interaction.
Direct Manipulation vs. Command-Line Interaction
When we talk about interactive data analysis it is important to clarify what we mean by “interactive”. What constitutes an “interactive” user interface? For many, the assumption is that interactive visualization is only about WIMP interfaces, direct manipulation, clicks, mouse overs, and such. But a command line interface is also interactive: the user tells the computer what to do and the computer reacts and responds accordingly. What changes is the interaction “modality”, not whether something is interactive or not. It seems to me that what we should discuss is more what advantages and disadvantages direct manipulation versus command-line style interactions have in data analysis systems (as well as systems that seamlessly mix the two). While advantages and disadvantages of direct manipulation have been discussed at length elsewhere (the Nielsen and Norman Group has a good summary) we do not have a good understanding of how this plays out in data analysis. Most existing systems rely on command-line interfaces. Why? Is it because they are more effective or because we have not invented better interfaces yet?
Challenges of Interactive Visual Data Analysis
I want to conclude this essay by highlighting a few challenges I think are particularly relevant for interactive data analysis. This is where I believe we need to make more progress in coming years.
Specification (Mind → Data/Model)
When we interact with data through a computer the first thing we need to do is to translate our questions and ideas into specifications the computer can read (SQL is a great example here). This is where language and formalism play a major role. One may think that in order to give instructions to a computer one must necessarily learn some kind of programming language but in practice many interactive systems employ interactive specification methods that translate user actions into statements computer understand and are more natural for human users. A fantastic example of interactive specification system is the VizQL language employed in Tableau, which translates user selections into formal statements the system can understand and use to generate queries and appropriate visual representations.
Should we expect everyone to be a coder?
One relevant question here is: “should we expect everyone to be a coder and learn a specification language in order to perform data analysis?”. I personally believe we have to be as inclusive as possible and realize that there are large segments of the population that might greatly benefit from data analysis tools and yet have no time, resources or incentives to learn how to use formal languages. Therefore, while I am a huge fan of data science programming tools such as R and the Jupyter/Pandas combo, I am not sure we should expect everyone to reach that level of proficiency in order to do useful things with data. A good example of how very complex data processing can be made more accessible to people are Trifacta’s Wrangler, and Open Refine, which enable people to perform lots of data wrangling without writing a single line of code.
Representation (Data/Model → Eyes)
Once results are obtained from querying and modeling, the next step is to find a (visual) representation so that the user can inspect and understand them. This is the realm of data visualization. But while most people, when they hear “data visualization” think about colorful fancy graphics, it is totally appropriate to expect a simple data table to be an effective means to inspect the results. I find it somewhat interesting that we use the word “visualization” to mostly mean complex graphical puzzles, when in fact a table is as visual as anything else.
One aspect of data visualization I have been discovering over the years is that when we talk about data visualization we often think that the choice of which graphical representation to use is the most important one to make. However, deciding what to visualize is often equally, if not more, important, than deciding how to visualize it. Take this simple example. Sometime a graph provides better answers to a question when the information is expressed in terms of percentages than absolute values. I think it would be extremely helpful if we could better understand and characterize the role data transformation plays in visualization. My impression is that we tend to overemphasize graphical perception when content is what really makes a difference in many cases.
“How fancy does a visualization need to be in order to be useful for data analysis?”
Another important question here is: “how fancy does a visualization need to be in order to be useful for data analysis?”. I am a huge fan of well-crafted, sleek, attractive, bespoke visualization projects. The beauty of colorful pixels is what made me fall in love with visualization in the first place. Yet, I am not sure how much this counts when the main goal is data analysis. More precisely, while I do think that aesthetics plays a major role in a visualization, I am not sure how much innovation we still need in producing novel metaphors for data visualization.
In my experience (building research-based prototypes for more than 10 years) most visualization problems can be solved with a handful of graphs. It’s rare the situation in which you really have to come up with a new metaphor. “Graphical workhorses” such as bar charts, line charts, scatter plots, pivot tables, etc., are really hard to beat!
At the same time, this does not mean that visualizing data effectively is easy! What is really hard is to use, tweak, and combine these graphs in clever and effective and innovative ways. This is much harder than one may be willing to admit. In a way the innovation and educational effort needed to make progress in visualization should focus more on depth and less on breadth. We need more to understand how to use existing methods well than look out for more metaphors and techniques (even though we of course need to keep innovating and try out crazy things).
Interpretation (Eyes → Mind)
This step is crucial and yet often neglected. Once results are represented, people need to interpret them and understand what they mean. This is a very complex cognitive process that links together several pieces of knowledge. Think about it: what does one need to know in order to reason effectively about the results of modeling and visualization? At minimum you need to be able to understand the representation and the model, the way they link to the real world entities they represent and, finally, but crucially, how they relate to the knowledge you already have in your head. Let me focus on visualization and modeling.
“Are people able to interpret and trust their visualizations and models?”
The important question here is: “Are people able to interpret and trust their models?”. In order to interpret a visualization effectively you need to first understand the visual metaphor and second the visual metaphor itself needs to convey information in the least ambiguous/uncertain way possible. Unfortunately, not all visual representations are like that. A notable example is multidimensional projections (using algorithms such as t-SNE and MDS) which use a somewhat intuitive metaphor (proximity equals similarity) but are also unbearably ambiguous. Here is an example of a projection showing similarity between words extracted from IMDB reviews:
What do you learn when you look at it? And when you happen to learn something … How sure are you the thing you learned represent some real phenomenon and not just some statistical fluke?
When we look at interpretation of models we have an even bigger problem. Machine learning methods use incredibly sophisticated procedures to transform data into more abstract structures but in the process we completely lose the ability to understand their content, quality and plausibility. Take “topic modeling”. It’s nightmare. The method takes as an input a document collection and returns a set of “topics” captured as a set of words. The problem is that most of the time the set of words returned make no sense at all. Here is an example from a recent project we have been working on in my lab. These are some of the topics extracted from a collection of articles from Vox:
What do you think? Does it make sense? Can you extract anything useful out of them? (In all fairness the method returned many more topics which make more sense, but I chose this to illustrate the problem in a more dramatic fashion.)
How do you deal with this? That’s an important question which requires the collaboration of ML experts but also people who understand perception and cognition so that these methods can more effectively produce a human-technological system able to really enhance the human mind.
I have two sets of suggestions: one for practitioners and one for researchers.
- Focus more on (more relevant) problems. There is no lack of relevant problems to solve in the world and data analysis can play a major role to make progress. For good or worse data is everywhere and large portions of the physical world is leaving digital traces that can help us understand some things better. Work for or collaborate with people who want to solve important problems. Pick one domain you like and try to produce better understanding.
- More tools, less visualizations. If we want to multiply the power of data analysis and visualization and put it in the hands of those whose job is to solve important problems for us (like doctors, climate scientists, security experts) we need to focus more on tools and less on visualization. Building the next graphics to wow people may be fun, instructive and even useful to some extent, but ultimately I believe we need to build tools for other people to use to harness the full power of data and visualization.
- Make it public. Some of what I am describing is already happening! Probably on a quite large scale even, but it’s not visible. Most of these projects happen behind the closed doors of private organizations, which have little incentives to show what they are doing internally. But this is changing. If you happen to work on a data analysis project, show us how you did it! But don’t just show us the final product, make the whole process visible. Heck! Let us know where you failed and how you coped with it. Show us all the dead-ends you encountered so that maybe we can all learn something out of it. Similarly, if you develop a tool, do your best to put it in the hands of as many people as you can. You never know what someone, somewhere, may be able to do with it. Maybe something remarkable you can’t anticipate?
- Develop better specification methods. Translating what is in people’s head into instructions a machine can understand is still quite challenging. Lots of progress has been made in terms of programming languages but creating specifications without coding is still pretty challenging. Two great examples of interactive specifications systems invented in recent years are Tableau’s visual query language and Trifacta’s interactive methods for data transformation. These cover two very important needs but there is no lack of other situations where interactive specifications are needed. For instance specifying what one wants to do with text collections is still pretty challenging.
- Develop more interpretable methods. As I mentioned above, interpretation is a big challenge, especially when we focus on ML methods that are meant to interface with a human. We first need to better understand how interpretation works and, possibly, how it relates to pre-existing knowledge and expertise. We also need to develop methods that are more interpretable and more flexible in accepting input and feedback from a human agent (while ensuring consistency/correctness).
- Develop a “science” of data analysis. The data analysis process is made of a series of complex cognitive processes which we don’t understand very well. What makes data analysis successful? What role do computational tools work? How can we avoid traps, biases, omissions, etc.? This is really complex! While some basic research in cognitive science exists, there is no accepted model that can guide designers and engineers in developing and evaluating complex interactive systems for data analysis. Making progress in this direction would enable us to better understand how interactive data analysis works and, hopefully, inform us on how to create better tools for thinking with data.
In this little essay I am arguing visualization practitioners and researchers should adopt a wider perspective on their role in the data science space. Visualization experts can help people solve complex and important societal problems by focusing on supporting people analyze their data. This can be done by (1) understanding that visualization is one (important) step in a much larger and complex process and (2) seeking out collaborations with people who need their help and (3) developing tools for them to do remarkable things with data. I hope you found this reading, though admittedly a bit long, inspiring. We need an army of visualization enthusiasts like you to do important work with an impact in the world!