When Scientific American contacted OCR with the prompt of visualizing the impact of Einstein’s General Relativity theory on the centenary of its publication, we were pretty excited. We had a few ideas of avenues to explore, but first we needed a dataset.
First attempts and dead-ends
Princeton University recently digitized all the papers and correspondence that Albert Einstein wrote from his youth until 1923, which can be browsed on their digital collection. While Einstein was developing the general theory of relativity for about a decade, it wasn’t until the Field Equations of Gravitation paper, published in November 1915, that he finalized the mathematical models necessary to support his theory. Our initial concept was to use this paper as a starting point in a citations trail to map the reaches of the General Relativity paper over the past 100 years.
We looked at a number of open and closed scientific databases, such as Scopus, Google Scholar and Web of Science, to see how we could go about creating a citations trail. However, while some databases were more complete than others, very few contained citations data that predated 1970. This has to do with the fact that citations in scientific papers were not as standardized in the early part of the twentieth century as they are now, and that scientific databases are more likely to upload newer papers and citations data than historical data.
There has to be a better way…
Since there were too many gaps charting the impact of general relativity over time through citations, we decided to use the plethora of current scientific papers to take a snapshot of research actively engaging with principles derived from the theory of general relativity.
We decided to use Cornell University Library’s open database arXiv.org, which some consider to be the most current repository of research since scientists can upload papers before they’ve been published. ArXiv.org also categorizes papers that pertain to different areas of science, and within that, physics. One of arXiv.org’s subcategories of Physics is General Relativity — Quantum Cosmology (gr-qc). For our dataset, we selected papers with gr-qc as their primary category, since we could be sure that they related to general relativity. We looked at all papers in the gr-qc category that were uploaded to arXiv.org in 2014, to give us the most recent complete year of papers.
The ins and outs of an API
While arXiv.org is great in that it’s an open database (not behind a paywall), and has an API, the API is a bit confusing to use and not fully documented. For instance, we were interested in accessing references and citations for papers, which can be done on the website, but there was no call in the API outlining how to do so (we were left to our own devices to get this data…). In addition, we couldn’t query by year, but had to use the somewhat clunky “start” and “max_results” filters with the sortBy filter set to “submittedDate” to get the year we were interested in.
While we did collect citations data from arXiv.org, there weren’t that many citations linkages since all the papers we examined are from 2014, and don’t often cite one another since they’re all very recent. We needed to generate our own metrics to determine the most popular research topics related to general relativity. For each paper in the the General Relativity — Quantum Cosmology category we collected the following information:
- List of Authors
- Primary Category
- Subcategories (if any)
- References (if any)
- Citations (if any)
- Published / Not yet published
If at first we don’t succeed, let artificial intelligence take the lead
Since we couldn’t rely on citations data to create links between papers, we decided to look to the text of the abstracts to see if we could group papers based on common research areas. We turned to the Alchemy API to process the 2,435 paper abstracts of all the papers added to arXiv.org in 2014 and tagged with General Relativity — Quantum Cosmology (gr-qc) as their primary category. The Alchemy API, which is now part of IBM’s Watson platform, lets users leverage machine learning capabilities for image and text processing. We were specifically interested in the AlchemyLanguage Concept Tagging API, which we used to analyze the paper abstracts.
When we ran the corpus of abstracts through the Alchemy API it returned a list of concepts, as well as a score of relevance for each detected concept, with a range of 0.0–1.0, 1.0 being the most relevant. Alchemy returned over 1500 concepts after analyzing the 2,435 abstracts. We totaled the scores for each concept to determine the “most popular concepts” among all the abstracts. This list was heavily edited to cull redundant words, topics, and finally reduced to 61 concept terms that the editors at SciAm deemed relevant to physics and general relativity.
Laying out the network — making connections and forming relationships
First, we wanted to see how the concepts themselves related to one another. Using a network graph layout, we created links between concept terms that were found in the same papers. When we evaluated a pair of concept terms, we counted what percentage of the total papers they share in common, as well as their combined Alchemy relevance score. Using the toxiclibs Processing library, we generated a basic layout model for the concepts. If two terms have a higher percentage of papers that share those terms, they have a stronger network graph link. The terms with more connections were more fixed (and more popular), and provided the central nodes that the other terms organized around.
After laying out the 61 concept terms in the network diagram, we needed to arrange the papers around the terms they referenced. Using a particle physics simulation, terms acted as gravitational attractors to the papers. Daniel Shiffman’s Box2D processing library was used to prevent articles from overlapping while they were being pulled towards their preferred location. This pushed articles upward that grouped around more popular terms, creating peaks around terms like ‘Black holes’ and ‘Quantum gravity.’ Although there were over 1500 concepts that Alchemy had returned from all the papers, a few of the papers did not contain any chosen concept terms since we’d narrowed it down to 61. If papers did not contain one of the 61 chosen concepts, we used a combination of shared references, citations, and keyword matching with the abstract text to locate the papers nearest to the papers most relevant to them.
While the visualizations for the print version of Scientific American are more legible as top and side views, we created them with a 3D layout built with Processing.
Moving from a 2D to a 3D world
For the interactive component to the graphic, we decided to let users explore the 3D environment, allowing them to zoom in and out on areas that interested them. In order to draw in 3D on the web we turned to three.js, the wonderful WebGL library created by mrdoob. We saved out the positions of the terms and papers from the print version generated in Processing, and used these values to draw the shapes in three.js.
Some notes on our approach
Since the graphic was published, we’ve seen questions pop up on Twitter about our process for laying out the papers. Someone suggested we might have used the t-SNE method to layout the papers, which visualizes multi-dimensional aspects of data in 2D space. We didn’t use a multi-dimensional approach, but relied on the Alchemy API and network layout models to group the papers after analyzing their abstracts.
Our visualization approach was not as straightforward as you would get in a layout function from Gephi, but more an interpretive analysis that allowed us to spatialize the dataset around concepts that we curated for relevance. The Alchemy API allowed us to generate numerical relationships between the papers themselves when a citations link did not exist. As we laid out the articles in a sort of generative terrain, we realized that we needed to give more “gravity” to unpopular terms, so that the large groupings forming around popular terms would disperse into more discrete clusters.
And some lessons learned
One aspect we found interesting are the articles with more than 850 authors, which are highlighted in red in the visualization. Articles with this many authors are all related to the LIGO Detector gravitational wave detection, and authored by the LIGO Scientific Collaboration. It was also surprising to realize that one paper, submitted in March of 2014, had already been cited 85 times a year later. This paper focused on advancements of tests of the general theory, which are still being done today, which made us realize how much of science is a constant process of experimentation and re-evaluation, even with ideas that have been given the status of “theory.”
This project involved a fair amount of trial and error. Sometimes there is a promising germ of an idea but it’s hard to predict what the data will actually yield. The kind of visualization that is possible to create depends on the data that can be found and understood. As is often the case, datasets are not always complete, or readily accessible. Working with a historian of science to map citations in early physics papers would have been great, but we needed to change directions once we realized that wasn’t feasible in our scope of time. As designers, there are a many ways we can tell a story, and determining the best way forward is an iterative process of research, sketching and prototyping. Sometimes a visualization utilizes “neutral” scientific data, while other times, like this, it depicts a collaboration between natural language processing algorithms and scientific expertise and curation. In any case, it’s not quite a surprise that Einstein’s ideas are still relevant today, but it is amazing to realize how many aspects of research in the quest to better understand our universe they’ve made possible.