The size of Napoleon’s army during his invasion of and subsequent retreat from Russia. Click here for a translated version.

Data and Virtual Reality — Part I

Evan Warfel
kineviz-blog
Published in
8 min readNov 2, 2016

--

What can VR bring you that typical data visualization doesn’t? A tour of high-dimension visualization techniques, Iris data, Tim Cook’s faces, node-and-edge graphs, and VR.

San Francisco’s DataVR meetup group met for the first time a few weeks ago. I was there, and am happy to report that there was a lively group discussion centered around the merits of VR data visualization and analysis tools.

Implicitly, we all understood that we are dealing with a classic Chicken and Egg problem — it’s difficult to build a suite of VR tools that people will use without knowing how said people will use VR data tools.

In this piece, I will explore how VR can help with a) information density and b) intuitively understanding data, by examining how high-dimensional data visualization and graph exploration can benefit from VR.

1. High-Dimensional Data Visualization

“Graphs are essential to good statistical analysis.”F.J. Anscombe

Provided your dataset has two dimensions or fewer, the respective data is relatively easy to visualize with graphs or charts:

Anscombe’s famous quartet, taken from Wikipedia. Each data set has the same mean, correlation, variance, and best-fit line. Going counterclockwise from the top left corner, I propose labeling the graphs as follows: Violin, Viola, Cello, and Bass.

For each dataset above, the mean of all of the X coordinates is 9, the mean of all of the Y coordinates is 7.50, the variance of the X coordinates is 11, the correlation between the X and Y coordinates is .816, and the equation for the best-fit line in each case is Y = 3 + 5x.

In other words, these four datasets are seemingly statistically identical, even though their true nature is betrayed by visualization.

If you have three dimensions worth of data, you could always use a three dimensional plot, and VR can potentially help in this regard. If you have high dimensional data, or a plenty of columns if you coerced your data into an Excel spreadsheet, you are mostly out of luck.

Because while it is easy enough to think in 2D, the trouble with having a lot of columns (like 10,000, for instance, but also anything greater than 3) in your dataset is that it is impossible to visualize more than three spatial dimensions.

To demonstrate how VR can help with high dimensional data visualization, I’ll first cover some of the existing techniques people can use.

One approach to visualizing higher dimensional data is known as parallel coordinates, where each record is plotted as a kind of wavy line. The idea is that differences between the lines should indicate the difference between the each “record” or row of data.

On each plot below, each Iris from the celebrated 4 dimensional Anderson Iris dataset is visualized as a line with two kinks in it. (Anderson measured 4 attributes — Petal Length, Petal Width, Sepal Length, Sepal Width — for 150 different flowers.)

Both of these plots attempt to visualize Anderson’s Irises by connecting each dimension with a line. Notice the differences in the green lines (the Setosa Iris) between each plot. The reason for this is that the ordering of the dimensions are completely arbitrary and have no spatial meaning. (For the technically inclined: the graphs display unnormalized data.)

You will notice that conceiving of a flower as an arbitrary collection of unordered line segments is only marginally helpful; most people look at these graphs and just scratch their head. The only really meaningful thing you can say from these two charts is that the green bits seem very different from the orange and purple bits, which are similar.

For reference, here is what the creatures look like in real life:

From left to right: Virginica, Veriscolor, Setosa

If you massage the four data points to be part of a curve and then analyze that curve with Fourier transform, you can make an Andrews plot. It turns out this isn’t much better or worse than the Parallel Coordinates depiction:

Once again, the main conclusion you can draw is that the green bits are different from everything else, and the orange and purple bits are similar.

If you used polar coordinates — i.e. if you took the bottom x-axis and bent both ends around to form a triangle or a circle, you’d get a radar or star plot or something similar.

If you are like me, you are probably thinking that Parallel Coordinates and related techniques aren’t exactly easy to understand or interpret.

But there are other ways of representing dimensions. A triangle, for instance, could be used to represent three dimensions of data, if you mapped each dimension to the length of a side. You could, if you really wanted, utilize a red-blue spectrum and a light-dark spectrum to color in the middle of the triangles and blamo! You’ve got five continuous dimensions all in one. Compare each triangle, and you might spot anomalies or heretofore hidden patterns and relationships. That’s the theory, anyways.

It turns out that a researcher named Paul Chernoff explored a variant of this idea in the 1970s — instead of lengths of triangle-sides, he mapped dimensions of data to different characteristics of cartoon faces.

I’ll let you judge how well this worked by way of L.A. Times infographic:

Eugene Turner — Life in Los Angeles (1977), L.A. Times. The four facial dimensions, the geographic distribution of each face and the community-line information mean you are looking at six dimensions of data.

Your gut reaction will be to dismiss this method of data presentation, as it looks silly, vaguely racist, and hard to interpret. But I urge you to give it a second look — can you spot the buffering row of communities in between the poor and affluent parts of town?

One reason Chernoff faces don’t get wider use, I submit, is that they look too cartoonish. (And seeing how science is very Serious Business, it wouldn’t be proper for plots to be cartoon faces…)

While realistic Chernoff faces solve the cartoonishness problem, they highlight another issue: though they seem like they could be intuitive, we all have too much experience with faces and real emotions to evaluate arbitrarily constructed ones.

In the depictions below, parameters of Tim Cook’s face — like the slope of his eyebrows — have been mapped to various Apple financial data-points for the year in question.

From Christo Allegra. Each version of Tim Cook’s face represents Apple’s financial data for the year in question. The width of Tim Cook’s nose represents the amount of debt taken on by Apple; the closed-ness of Cook’s mouth represents the revenue of that year; the size of his eyes represents the earnings per share, and so on. For serious uses of Chernoff faces, check out Dan Dorling’s work.

Clearly, there are some issues with this approach too. One thing that stands out is that not every aspect of a face conveys emotional information on the same scale as, for instance, the smile.

In other words, the perceptual difference between one face and another doesn’t match the actual differences between the data.

This, I submit, is one of the properties that makes plots and graphs so useful, and something that is missing from current approaches high-dimensional data visualization.

Virtual Reality can solve several of the aforementioned issues. Instead of Faces, a Chernoff-like technique can be applied to control how neutral objects look, move, interact and are distributed.

For example, all of the following properties of tables can be used to represent different data dimensions: height, area of table-top, color, leg-length, degree of table polish, as well as type and location of stains and burns. If you have 15 dimensional data, you could do worse than translate the dimensions to parameters that would control how tables might look.

Being able to walk around a table that represents 15 dimensions means that you would be able to pick up on subtle differences that are hard to perceive in 2D. The advantage of VR is that it allows you to perceive the true meaning of a table that is twice as tall as another; or the meaning of having different coefficients of friction on the table top.

Some testing could ensure that the differences in dimensions carry the same perceptual weight. Moreover, the methodology for how to go about this has been thoroughly explored in the realm of Psychophysics and color perception — researchers have spent a vast amount of time measuring how people perceive both tiny and large differences in different kinds of sensations.

In other words, VR and a little psycho-physics could make understanding complex data could be as easy (or stress inducing) as walking through IKEA.

2. Graph Exploration

Due to one of the more unfortunate flukes in the history of mathematics, collections of objects that consists of points and connections are also called Graphs.

These kinds of graphs generally look like this:

Force Directed Spring Layout of Wiki data, taken from Wikipedia.

Each dot above represents a wiki-page, each line represents a connection between the pages.

Graphs are useful in seeing, in the abstract, the relationships between objects or data points, especially when the type and number of connections are important.

For example, the graph below represents every interaction between every gene in the yeast species Saccharomyces cerevisiae.

Left: A Node and Edge Graph representation of a yeast genome. Right: important clusters of genes. From http://science.sciencemag.org/content/353/6306/aaf1420

While interesting, you have surely noticed that both graphs above are very busy in the middle. Something similar happens if you explore the Panama Papers/Offshore Leaks Dataset — the graph of connections gets busy, quick.

The fact that most graphs become hard to understand due to the number of overlapping connections in the center of the graph can be an issue for understanding how entities relate to each other, which was the whole point of using a graph in the first place.

As you might imagine, three-dimensional graph visualization allows for much more breathing room.

A three dimensional graph visualization of different connected networks in the brain. From http://ieeexplore.ieee.org/document/6579594/

On top of that, VR makes exploring three-dimensional graphs even easier, because the perceptual difference between the size of a node and its position is best understood at real-life scales; like the kind you can achieve with VR (Stay tuned for a 3D graph exploration demo that we are working on.)

The advantage of VR is that it can be used to make perceiving differences in data easier. Especially when visualizing both high-dimensional datasets as well as node-and-edge graphs. As VR technology develops, I predict we’ll see more data analysis done from within a headset.

In a future post, I’ll explore how VR data can be leveraged to make sense of cities. If you are curious, you can check out Kineviz here.

--

--

Evan Warfel
kineviz-blog

Soon to be a UC Davis Psych Grad Student / Writer / Data Scientist / Humanist.