How Data Visualization in VR Can Revolutionize Science

Lessons learned from classifying unknown X-ray sources in the cosmos

Published in

Nightingale

14 min readAug 4, 2020

Screenshot from the Virtual Data Cosmos — Visualizing data clusters in the Virtual Data Cosmos

Astronomy has become a big data discipline, and the ever growing databases in modern astronomy pose many new challenges for analysts. Scientists are more frequently turning to artificial intelligence and machine learning algorithms to analyze multidimensional data sets. However, it is not only a methodological and technical challenge: it is also a visual one! Data visualization is driving discovery in astronomy and is also helping with communicating new findings to the general public. The history of information graphics shows how the transformation of data into knowledge is vital for understanding the data at hand, a subject I have previously written about here.

The problem of visualizing complex data and exploring it interactively is by no means new or limited to research. Examples from digital information design in bioinformatics and medicine (e.g. Genome Valence by Ben Fry or Meviatis by Ricarda Schuhmann) show how visualization can support the understanding of structures within data sets and facilitate exploration. The representation of the data’s dimensions (i.e. its parameter values) can result in dynamic and aesthetic data sculptures. Such visualizations are often quite beautiful in themselves but, crucially, their interactive features enable users to quickly make comparisons and interpret the data.

Today’s digital media allows us to go beyond designing interactive on-screen three-dimensional applications. Both augmented reality (AR) and virtual reality (VR) make it possible for users to take a fresh look at their data and explore parameter spaces in 3D. There is so much potential for using these technologies in the field of information design. For VR, the advantages are obvious:

More space! VR offers a larger field of view than 2D images. This allows for multiple views to be arranged in space, making it easier to draw cross-references and connections.
More dimensions! Compared to 2D graphics, VR visualizations offer additional parameters that can represent data (e.g. sound, haptics, lighting, interaction).
More structure! The perception of space and depth is more intuitive; enabling shapes and volumes to be recognized more quickly.
More fun! Immersing yourself in the data and the ability to go from overview to detail by scaling the space is a powerful immersive experience.

Understanding the nature of the unknown

Inspired by the above research examples, the hypothesis I chose to explore for my bachelor thesis in Information Design was:

The presentation of scientific data with new digital media, especially VR, offers great potential for data analysis in science.

I wanted to test this hypothesis on a data set from my previous research which I had been struggling to get an overview of. During my PhD in Astrophysics, I was involved in the EXTraS project, which aimed to automatically classify unknown and newly discovered X-ray sources in the cosmos. The sources were observed by the X-ray satellite XMM-Newton from the European Space Agency (ESA). I set about designing the Virtual Data Cosmos as a way of grouping data with similar properties and visualizing these groups.

As more and more data is collected by X-ray satellites, the data archives of these satellites are growing annually. The records detail millions of sources that emit X-rays, and from which any newly found source could yield new physical discoveries. The classification of unknown sources is therefore hugely important in modern astronomy and, due to the sheer amount of data, intelligent algorithms are increasingly being adopted by astronomers worldwide.

The image below shows an image of the entire sky in the optical wavelength as seen from Earth. This projection scan be seen as analogous to a world map in which the galactic plane lies on the equator and the galactic center is in the center of the map. Just as in a normal world map there are longitudes and latitudes, shown as white grid lines. This is typically referred to as a sky map. Laid over the optical image are white dots; each represents a region observed by the X-ray satellite XMM-Newton. Each white dot includes several unknown X-ray sources. The objective of the project was to classify each of these sources.

Sky map of the universe at optical wavelength as seen from Earth with positions of unknown X-ray sources laid over. — An optical sky map of the universe (Source: ESA) adapted to show the positions of unknown X-ray sources

In order to understand the nature of each X-ray source, astronomers compare its features (specifically the energetic and temporal properties observed) to those of objects with known classification types such as binary star or Seyfert galaxy. Questions like these help:

What are the correlations between the properties of the X-ray source and those of known object classification type?
Where are the differences?
Has the unknown object been discovered elsewhere in the electromagnetic spectrum which could yield further hints on its nature?

In order to describe the similarity between an unknown and a known X-ray source we astronomers use statistics as well as visualization. In this case, machine learning algorithms (supervised decision tree algorithms to be precise) automatically characterized every source in this large and complex data set by comparing their precise parameter values (e.g. observed X-ray intensity) with those of known objects. Ultimately, the algorithms calculate the probability of an X-ray source belonging to various classification types and allocate it to the class that is most likely.

For example: The X-ray source with ID 1 has a 45% probability of being a single star, a 30% probability of being a binary star and a 0.01% probability of being a galaxy. The algorithm therefore assigns the class with the highest probability as the final classification of the unknown source. In this case, source ID 1 would be classified as single star.

Once the algorithm has classified all unknown sources in this way, the task of the astronomer is to carefully screen and control the results. How did the algorithm perform? Did it make mistakes? Since more than one algorithm was tested one would need to compare the results of each to answer these questions. Did different algorithms classify the same unknown source into different classes? Also, as a scientist, one also wants to know why an algorithm classified an object as it did. The astronomer requires an understanding of the relationship between different parameters and source classification types, and does this with the help of visualization.

The limitations of traditional science viz

A typical method is to create multiple scatterplots in which the X-ray properties of unknown cosmic sources are compared with each other while taking into account the results of a single algorithm. This is done by assigning a unique color and symbol to a specific source classification and depicting X-ray sources with specific class symbols in the plot. We astronomers can then analyze whether the positions of sources depicted with the same symbol form patterns that help to distinguish different classification types.

Scatterplot of X-ray properties for cosmic sources. Because the data points overlap different classes cannot be distinguished — Typical scatterplots used in astronomy to explore data set dimensions. Classification types (e.g.: stars, galaxies, etc.) are coded by color and symbol.

For example: these scatterplots were created to investigate the relationships between parameter HR1 and parameters HR2, HR3, and HR4. The parameters are abstract properties used to describe specific radiation energies of the cosmic sources and visualizing them in the abstract plane enables us to look for patterns that may characterize the properties of different objects. The data points represent all unknown cosmic sources observed by the satellite.

In this case, green triangles represent the class Seyfert galaxies, while purple squares depict the class of single variable stars that exist within our Milky Way. We see that the sources overlap if we only look at the HR1 parameter, but they occupy very different regions in the HR1-HR2 plane in the first scatterplot. Hence from that plot we can conclude that sources with a low HR1 and HR2 value belong to the purple square (variable star) class.

But what about sources with high HR1 and HR2 values? Comparing only these parameters would put them in the galaxy (green) class. But there are many other classes which also occupy this region, e.g. blue triangles, which represent a kind of binary star system and this confuses the picture. To get a clearer understanding we now need to compare the HR1-HR2 parameter plane with the other scatterplots. If we now look at the second image, which illustrates the HR1-HR3 plane, we see that the sources shown in green and blue symbols are slightly more separated. And by combining the information of the first and second plots, we can identify the specific combinations of HR1, H2 and HR3 parameters that differentiate variable stars (purple), galaxies (green) and binary star systems (blue) .

With each additional scatterplot we gradually form a mental model of a multidimensional parameter space in which each source class is located in a unique location. In principle this is what the algorithms do and is why our parameters are also known as the ‘dimensions’ of a data set. However, the larger the number of parameters and classes, the more difficult it is for humans to keep an overview of all relationships. It is simply not possible for us to imagine more than three dimensions at once.

In our sample, the size of the data set and the fact there were more than 50 parameters made it impossible to get an overview of all the relationships between parameter values and source classifications. The scatterplots required were simply too many and, due to the size of the data set, many regions were occupied by multiple source classes. The overlap of their symbols made it very difficult to see the data patterns.

In addition, these plots correspond to the classification by a single algorithm. So as we increase the number of algorithms in use, the number of plots would quickly become unmanageable. I concluded that this traditional 2D visualization did not allow a proper overview of the data, and was frustrated that the decision-making mechanisms of the algorithm remained opaque.

Designing the Virtual Data Cosmos

Visualizing the data directly

To come up with a new way to visualize this big data set, I first did some research on the history and principles of data visualization. I was fascinated by the creativity with which designers and scientists mapped their data.

Excellence in statistical graphics consists of complex ideas communicated with clarity, and efficiency.

Edward Tufte coined the term ‘graphic excellence’ in data visualization. He postulated various properties that statistical graphics require to be successful. His theory was that data should be displayed directly without the user being distracted by the design itself. Furthermore, statistical graphics should serve a clear purpose (either description, exploration, tabulation or decoration) and should show several levels of detail, from a rough overview to the fine structure of the data.

Similar claims were made by a 2015 study on the visualization of big data in VR and AR. The authors concluded that for a data visualization to serve as an analysis tool, it requires the data concerned to be represented exactly. The implication for my work was that the data mapping had to be done through coding. This meant that the data values themselves would define the visual aesthetic of the virtual environment.

In addition, the interaction and scalability in a VR scene would allow the user to be fully immersed in the data and literally dive into it. One could easily move around and take different perspectives on the data set. Similarly, the user would be able to zoom out and get an overview, effectively holding the data in their hands. The data set could even be turned around and explored as though it were a physical object.

This, for me, was the most important aspect of the VR approach: it combined the advantage of data physicalization with the possibility to shape and manipulate the data environment, which is not possible in the real world.

A sketch illustrating two immersive moments in VR: holding the data in your hands versus diving into the data. — A sketch illustrating two immersive moments in VR: holding the data in your hands versus diving into the data

Regardless of how the X-ray source data was organized, my principle idea was to pull the cluster of X-ray parameters and probabilities apart and display them in three-dimensional space. The goal was an interactive data visualization in VR in which the data could be explored directly. By interacting with a concrete virtual environment anyone could explore this abstract data space.

My solution for the problem resulted in the Virtual Data Cosmos. I’ll talk you through the design concept here. A detailed description of the design process will be explained in the next article in this series.

Applying the design concept

I wanted to ensure that the visualization would first give the user an overview of the data and only then allow them to go into the detail. By zooming in on their chosen classification type, one would finally reach the DNA of the X-ray source (i.e., they would find details of its spectral parameters) and therefore understand why the algorithm assigned the source to a certain class.

The VR experience consists of two spaces; users can choose to zoom in and out to seamlessly move from one space to the other:

The class room represents the entire cosmos and includes all data points, grouped according to their classification by the algorithms.
The parameter space represents the observed parameter values of a user-selected subsample of the X-ray sources, and their classification by a selected algorithm.

The starting point was to create the ‘class room’, within which each classification type has its own three-dimensional volume. The class room visualizes the classification results of the X-ray sources by the various algorithms and allows users to explore the probability distributions within the database. It prompts questions such as:

How did an algorithm classify the unknown X-ray sources?
What is the probability of a source of belonging to that source class?
What could be an alternative classification?

Sketch of the VR showing the class room and parameter space, and how data points serve as a portal between the scenes. — A sketch of the VR concept showing the class room and parameter space, and how data points serve as a portal between the two spaces.

Visualizing the complete data set in the class room was a very exciting moment! For the first time since the start of the EXTraS project, we were able to clearly visualize more than 500,000 data points without compromise, and compare the results of various algorithms all at once. I felt that I finally got a clear overview of the results and could easily see the distribution of all classified X-ray sources.

Here are some screenshots from the VR class room:

Overview of the classification results in the class room shown as color-coded point clouds of different algorithms. — Overview of the classification results in the class room.

Activated data points in the class room which show additional information and can be selected to be shown in parameter space. — Zooming in on the details of the data set in the class room.

The next step was to understand how an algorithm distinguished between different classes. By zooming in and comparing the features of various selected X-ray sources one enters the parameter space. There is a lot to view here, and again we faced the problem of how to visualize all parameter dimensions at once.

The desire to pull the data points apart eventually led to the final approach: to let each source perform a ‘walk’ through space, each source starting from the same point. Their parameter values were used to define the direction and length of each step. This mapping yields that each source produced a unique path (or trace) in space, and objects with similar properties ended up in similar locations in the virtual cosmos.

For example, the following image shows the possible walks of three sources belonging to different classes. This one image allows us to draw the same conclusions that we received from comparing the three scatterplots from above.

Sketch which shows examples of parameter walks for three different source classes. — Example of parameter walks for three different source classes.

In this sketch, four steps are defined based on the values of the parameters HR1, HR2, HR3, and HR4. Their values mainly define the direction of the step, while the step length is defined by the selected algorithm.

We see that the HR1and HR2 steps already help us to separate variable stars from galaxies or binary star systems. The additional parameters then help to differentiate between the latter two classes.

We can see how an algorithm classified an object by the color of the objects path. More detailed information on the data mapping will be given in a subsequent article.

This is a screenshot of the VR parameter space for a large number of sources that were classified to three different classes (named CV, BL and STAR):

The parameter space which shows the data traces of various selected X-ray sources. Similar sources occupy the same regions. — Exploring the parameter space

In the image above, there are three classes: variable stars (blue), a very active kind of elliptical galaxies (light green) and normal stars (dark green). We can see that sources whose parameters generated a similar path have been assigned to the same class. We can also see situations where the parameter values caused the path to take on a strange shape, causing confusion for the algorithm.

This representation yielded a much better understanding of why a machine-learning algorithm classified a source in a certain way and made clear why it failed to characterize other sources when their paths overlapped.

Summary

Creating the Virtual Data Cosmos convinced me not only of my hypothesis that VR offers great potential for scientific data analysis in science, but also that the pure presentation of big data can create interesting and aesthetic virtual spaces when determined by the specific parameters of the data. This generative approach implies that by exploring the virtual world, users can actually examine an abstract parameter space that is not necessarily visual in nature. By interacting with the virtual elements, the visualization becomes an extremely useful tool.

The scalability in VR is just one advantage over traditional science viz methods. Additionally, the immersive data visualization is fun to work with. It encourages one to focus longer on the data and have a more complete sense of what information might otherwise be hidden.

There is of course plenty more to be explored in this area. Once I was free from using conventional methods to represent the data, designing the parameter space using the radiation properties of the sources raised many new questions for me. How could the parameters be separated more precisely? Are there better representations that would allow the parameter correlations to be analyzed even more clearly? I’ll talk more about how I improved upon the first version by manipulating the parameters in the next article in this series.

The example of the Virtual Data Cosmos illustrates how applying principles of data visualization in VR can support the sciences by enabling the creation of mental models for multidimensional data. This project shows just how thinking outside the box and coming up with new ways to visualize big data opens many exciting possibilities for science.

I hope I was able to inspire you to create your own VR data visualization experience. A walk-through of the VR experience I created is available on http://annok.de/vdc-2/

During my years in astronomy, data visualization has been an elemental part of my research. Toward the end of my PhD, I encountered a challenge quite common in modern astronomy: understanding and visualizing information of a big dataset. Since I was also studying information design at the University of Applied Sciences, I started my exploration into data visualizations and how it could be a tool in processing multidimensional data in science or industry. In this series of articles I will describe my adventure, which eventually led to the development of the Virtual Data Cosmos.