From Data Visualization to Interactive Data Analysis

Three main uses of data visualization

I know I am running the risk of falling into gross simplification. However, I still find it useful to identify three main categories of data visualizations in terms of what their main (intended or unintended) purpose is. This will help me clarify a few ideas later on.

  1. Explanatory. The main goal here is to use graphics as a way to explain some complex idea, phenomenon or process. This is an area where graphical representation shines: we are visual creatures and a picture is sometime really worth a thousand words. Data journalism has provided over the years most of the best contributions to the art of explaining complex things through data (see the amazing work done by the New York Times and Washington Post over the years). But this is also the realm of education, especially scientific education which is often based on numbers and graphs. And this is also the area of that beautiful recent trend called “explorable explanations”, pioneered by Bret Victor and popularized by many other fantastic people like Nicky Case.
  2. Analytical. The main goal here is to extract information out of data with the purpose of answering questions and advancing understanding of some phenomenon of interest. Sure, explanatory visualization is also about helping people understand something. But the main difference here is that in explanatory visualization the author knows already what to visualize (after having performed some analysis), whereas in analysis the main use of visualization is to understand the data in the first place. There are a million names people have used to define this activity. The most recent and fashionable one is Data Science and more specifically that part of Data Science called “Exploratory Data Analysis”, a term invented by the great John Tukey a few decades ago. To make things simple I’ll call this just the good old name: Data Analysis, or maybe, Visual Data Analysis, or even, Interactive Visual Data Analysis to emphasize the use of graphical representations one can interact with (some in academia and business call this Visual Analytics also).

Why talk more about data analysis?

This essay, and the talk that preceded it, aims at better defining the role of visualization in data analysis and spurring more conversations about what is happening in this area of visualization which, unfortunately, it’s not blessed with the same limelight of the other purposes.

RevEx. An interactive data exploration tool we developed to help Charles Ornstein from ProPublica sift through millions of reviews from Yelp.

Side note on “understanding”

Did you notice? When we talk about data analysis problems we often describe the goal as “understanding” something. We can then postulate that the main purpose of data analysis is to better understand something through data.

Relationship between reality, data/statistical models, human mental models.

How Does Interactive Data Analysis Work?

  • Generating questions. A problem specification is typically too high-level and loosely specified to translate directly into some data analysis actions (a problem that is often overlooked and not well understood). Typically, the problem needs to be translated (implicitly or, better, explicitly) into a number of data analysis questions.
  • Gathering, transforming and familiarizing with the data. Some projects have data available, whereas some others require some degree of data search or generation. In any case, all projects require the analyst to familiarize with the content and its meaning and perform multiple transformations, both to familiarize with the data (e.g., often slicing, dicing and aggregating the data) and to prepare it for the analysis one is planning to perform.
  • Creating models out of data. Not all projects require this steps, but some do. Using statistical modeling and machine learning methods can be useful when the question asked can be answered more easily by building a model. While most of what modeling people talk about is prediction, models can be extremely powerful tools for exploration and hypothesis generation. Examples of methods that can be used for this step are clustering, dimensionality reduction, simple regressions and various NLP (natural language processing) methods to convert text into meaningful numbers.
  • Visualizing data and models. This is where data is observed through your eyes. Now, most people think fancy charts when thinking about this stage, but simple representations like tables and lists are perfectly reasonable visualization for many problems. This is where the results obtained from data transformation and querying (or from some model) are turned into something our eyes can digest and hopefully understand. This is the step we all, data visualizers, love and live by.
  • Interpreting the results. Once the results have been generated and represented in some visual format, they need to be interpreted by someone. This is a crucial step and an often overlooked one. Behind the screen there is a human who needs to understand what all those colored dots and numbers mean. This is a complex activity which includes steps such as: understanding how to read the graph, understanding what the graph communicates about the phenomenon of interest, linking the results to questions and pre-existing knowledge of the problem. Note that interpretation here is heavily influenced by pre-existing knowledge. This includes at least knowledge about the domain problem, the data transformation process, the modeling, the visual representation. This is another aspect of visualization and analysis that is often overlooked.
  • Generating inferences and more questions. All of these steps ultimately lead to creating some new knowledge and, most of the time, generating additional questions or hypotheses. This is an interesting property of data analysis: its outcome is not only answers but also questions; hopefully better and more refined ones. One important aspect of this step is that one may generate incorrect inferences here, so not all processes necessarily lead to positive outcomes and not all analyses are equally effective.

Important aspects of data analysis

There are a few important aspects of this process that I would like to highlight:

  1. Some activities are exclusively human. Did you notice? Quite a good number of steps in the process are exclusively human (see the red boxes in the figure above): defining problems, generating questions, interpreting the results and generating inferences and new questions. It’s all about human activities, not technological ones. Which leads us to ask: how much do we know about how humans think with data? And how can we expand our knowledge so that we can improve this process?
  2. Visualization is just a small portion of the process. For data visualizers like us this is a crucial observation. As much as we love the visualization step, we have to recognize that when visualization is used for data analysis, it represents just a small portion of a much more varied set of activities. This is not to say that visualization is not important or challenging, but it’s crucial to realize what the big picture is. The effectiveness of the whole process depends on all of the steps above, not just visual representation.

Quo vadis interaction?

You may have noticed I did not mention interaction so far. Why? Because interaction is all over the place. Every time you tell your computer what to do and you computer returns some information back to you, you have some form of interaction. Here is a list of actions we typically need to perform in data analysis:

  • Specify a model and/or a query from the data;
  • Specify how to represent the results (and the model);
  • Browse the results;
  • Synthesize and communicate the facts gathered.

Challenges of Interactive Visual Data Analysis

I want to conclude this essay by highlighting a few challenges I think are particularly relevant for interactive data analysis. This is where I believe we need to make more progress in coming years.

Specification (Mind → Data/Model)

When we interact with data through a computer the first thing we need to do is to translate our questions and ideas into specifications the computer can read (SQL is a great example here). This is where language and formalism play a major role. One may think that in order to give instructions to a computer one must necessarily learn some kind of programming language but in practice many interactive systems employ interactive specification methods that translate user actions into statements computer understand and are more natural for human users. A fantastic example of interactive specification system is the VizQL language employed in Tableau, which translates user selections into formal statements the system can understand and use to generate queries and appropriate visual representations.

Should we expect everyone to be a coder?

One relevant question here is: “should we expect everyone to be a coder and learn a specification language in order to perform data analysis?”. I personally believe we have to be as inclusive as possible and realize that there are large segments of the population that might greatly benefit from data analysis tools and yet have no time, resources or incentives to learn how to use formal languages. Therefore, while I am a huge fan of data science programming tools such as R and the Jupyter/Pandas combo, I am not sure we should expect everyone to reach that level of proficiency in order to do useful things with data. A good example of how very complex data processing can be made more accessible to people are Trifacta’s Wrangler, and Open Refine, which enable people to perform lots of data wrangling without writing a single line of code.

Representation (Data/Model → Eyes)

Once results are obtained from querying and modeling, the next step is to find a (visual) representation so that the user can inspect and understand them. This is the realm of data visualization. But while most people, when they hear “data visualization” think about colorful fancy graphics, it is totally appropriate to expect a simple data table to be an effective means to inspect the results. I find it somewhat interesting that we use the word “visualization” to mostly mean complex graphical puzzles, when in fact a table is as visual as anything else.

“How fancy does a visualization need to be in order to be useful for data analysis?

Another important question here is: “how fancy does a visualization need to be in order to be useful for data analysis?”. I am a huge fan of well-crafted, sleek, attractive, bespoke visualization projects. The beauty of colorful pixels is what made me fall in love with visualization in the first place. Yet, I am not sure how much this counts when the main goal is data analysis. More precisely, while I do think that aesthetics plays a major role in a visualization, I am not sure how much innovation we still need in producing novel metaphors for data visualization.

Interpretation (Eyes → Mind)

This step is crucial and yet often neglected. Once results are represented, people need to interpret them and understand what they mean. This is a very complex cognitive process that links together several pieces of knowledge. Think about it: what does one need to know in order to reason effectively about the results of modeling and visualization? At minimum you need to be able to understand the representation and the model, the way they link to the real world entities they represent and, finally, but crucially, how they relate to the knowledge you already have in your head. Let me focus on visualization and modeling.

“Are people able to interpret and trust their visualizations and models?”

The important question here is: “Are people able to interpret and trust their models?”. In order to interpret a visualization effectively you need to first understand the visual metaphor and second the visual metaphor itself needs to convey information in the least ambiguous/uncertain way possible. Unfortunately, not all visual representations are like that. A notable example is multidimensional projections (using algorithms such as t-SNE and MDS) which use a somewhat intuitive metaphor (proximity equals similarity) but are also unbearably ambiguous. Here is an example of a projection showing similarity between words extracted from IMDB reviews:

Example of t-SNE projection
Example of topics generated by topic modeling (with the LDA method)

Suggestions

I have two sets of suggestions: one for practitioners and one for researchers.

  1. More tools, less visualizations. If we want to multiply the power of data analysis and visualization and put it in the hands of those whose job is to solve important problems for us (like doctors, climate scientists, security experts) we need to focus more on tools and less on visualization. Building the next graphics to wow people may be fun, instructive and even useful to some extent, but ultimately I believe we need to build tools for other people to use to harness the full power of data and visualization.
  2. Make it public. Some of what I am describing is already happening! Probably on a quite large scale even, but it’s not visible. Most of these projects happen behind the closed doors of private organizations, which have little incentives to show what they are doing internally. But this is changing. If you happen to work on a data analysis project, show us how you did it! But don’t just show us the final product, make the whole process visible. Heck! Let us know where you failed and how you coped with it. Show us all the dead-ends you encountered so that maybe we can all learn something out of it. Similarly, if you develop a tool, do your best to put it in the hands of as many people as you can. You never know what someone, somewhere, may be able to do with it. Maybe something remarkable you can’t anticipate?
  1. Develop more interpretable methods. As I mentioned above, interpretation is a big challenge, especially when we focus on ML methods that are meant to interface with a human. We first need to better understand how interpretation works and, possibly, how it relates to pre-existing knowledge and expertise. We also need to develop methods that are more interpretable and more flexible in accepting input and feedback from a human agent (while ensuring consistency/correctness).
  2. Develop a “science” of data analysis. The data analysis process is made of a series of complex cognitive processes which we don’t understand very well. What makes data analysis successful? What role do computational tools work? How can we avoid traps, biases, omissions, etc.? This is really complex! While some basic research in cognitive science exists, there is no accepted model that can guide designers and engineers in developing and evaluating complex interactive systems for data analysis. Making progress in this direction would enable us to better understand how interactive data analysis works and, hopefully, inform us on how to create better tools for thinking with data.

Conclusion

In this little essay I am arguing visualization practitioners and researchers should adopt a wider perspective on their role in the data science space. Visualization experts can help people solve complex and important societal problems by focusing on supporting people analyze their data. This can be done by (1) understanding that visualization is one (important) step in a much larger and complex process and (2) seeking out collaborations with people who need their help and (3) developing tools for them to do remarkable things with data. I hope you found this reading, though admittedly a bit long, inspiring. We need an army of visualization enthusiasts like you to do important work with an impact in the world!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store