The Past, Present, and Future of Scientific Literature

Hamish
Litmaps
Published in
9 min readMar 10, 2021

Everyone could do with more science in their life. Imagine a world where doctors knew all the latest medical discoveries, entrepreneurs had complete knowledge of economics, and humans knew everything there was to know about health and productivity.

Making scientific research legible is a big deal. Although accessibility has always been a hard problem, the existing tools and processes are becoming less useful in the modern context and are falling behind the times.

In 1900, the best way to stay on top of science was to hang out in the Bodleian Library in Oxford. When you wanted to understand a topic, a librarian would point you to a physical document or collection. If you wanted to follow up any references from that document, you would then return to the librarian and the circle of life would repeat.

The modern Bodleian Library is the internet, the modern document is digital, and the modern librarian is Google scholar. But aside from these details, the process is essentially the same.

But this process is not scaling well. The volume of scientific papers goes into the hundreds of millions, and each additional year brings a further one million papers to Pubmed alone. And this rate is accelerating by 8–9% per year. Finding papers on a topic is not the problem. The problem is efficiently navigating between papers and understanding the scientific literature in aggregate.

Academic Search Engines: The Status Quo

Suppose I want to understand this deep learning thing I keep hearing about. If I type “deep learning” into my preferred academic search engine I get something like this:

What do I learn from these results?

  • There are a few books and papers called “deep learning” which have hundreds to tens of thousand of citations.
  • All the results are between 2015 and 2017.
  • Some topics related to deep learning are “object detection”, “classification”, and “medical images”.

However, I’m missing some important context:

  • Is tens of thousands of citations a lot by the standards of deep learning?
  • Did research on deep learning exist before 2015 or continue beyond 2017?
  • Does deep always involve images?

Another problem: all the top results are called exactly “deep learning”. How can I decide which to read first? (In fact Google scholar is actually confused here. The first entry is actually a review of the book “Deep Learning” by Goodfellow et al.)

Let’s say I choose to read the third entry. While reading, I think reference 12 seems interesting. To learn more, I scroll down to the reference, and scan for number twelve.

Perhaps I can decide if the paper is worth reading based title, year, and authors. It’s more likely I will have to at least read the abstract. In this case I have to copy paste the title into Google scholar, and a link to the pdf or html (assuming it isn’t behind a paywall).

Once I’ve found some papers I would like to read, I will typically either download the pdfs and organise them in a new directory, or print them out to read the physical copy. As I read through them, I will often want to take notes, either with document annotations, or in a separate file.

Often I need to turn my reading into a bibliography, either because I’m writing a literature review or because I want to add sources for some factual writing. If I’m using a reference manager, then creating a bibliography will be fairly straightforward. If not, then I’ll have to go back to Google Scholar for each paper, and copy paste the citation details into a document or BibTeX file.

I’m sure you’ll agree: this process could be improved.

How Litmaps adds Context and Navigability

Litmaps is motivated by the theory that the status quo approach to reading scientific literature (as described above) has two major problems.

The first problem was steps which could be automated with software needed to be done manually. These include:

  • Finding additional details about cited papers
  • Navigating from a paper to one of its citations
  • Keeping track of papers I’ve visited and converting them into a bibliography

The second problem was a lack of context. The user had little sense of how any specific paper fitted into the scientific literature as a whole. With visualisations, the following context could easily be communicated to the user:

  • The distribution of papers over time
  • The distribution of citation counts per paper
  • The citation network between papers

We built the Litmaps app to fix these problems. Litmaps uses a database of over a billion citation connections, and a sleek modern web interface to make navigating and understanding scientific literature not just easy but actually enjoyable.

When we run our “deep learning” search on Litmaps, we get the following:

Unlike Google Scholar, we can quickly learn from looking at this page that:

  • In 1999, 2003, and 2009 there were a few early publications on deep learning
  • Deep learning research started growing exponentially starting in 2011–2012
  • Deep learning research peaked in 2018–2019, and declined slightly in 2020
  • The LeCun paper which turned up in the Google Scholar can be seen here as the big circle in the 2015 column. Circle size is proportional to the log of the number of citations, so we can tell that this book does in fact have an exceptionally high citation count compared to other deep learning papers.

We thus have the context we were missing from the Google Scholar results. From prior knowledge, I have some guesses about why the literature looks like this:

  • In 2012, the deep learning model AlexNet won the competitive ImageNet Large Scale Visual Recognition Challenge. This prompted a lot of interest in deep learning, and could explain why research took off around 2011–2012.
  • Deep learning research has definitely continued at pace since 2019. I suspect the decline we’re seeing in 2020 may be due to “deep learning” becoming too generic for most purposes. Instead, I suspect researchers are referring to specific architectures like “transformer”, “generative adversarial network”, “convolutional network”.

We are thus able to quickly form a mental landscape of the literature. As we learn more about specific papers, we can incorporate this knowledge into a greater schema.

By hovering over each circle in the “literature map” visualisation, we can quickly find additional details:

By hovering over a few circles in this manner, we can quickly build up an intuition of the distribution of citations. LeCun 2015 is an outlier with tens of thousands of citations. Most of the other large-ish circles are in the hundreds of citations, and most the small circles have ten or less citations.

We can click on LeCun 2015 to bring up even more details:

The panel on the left displays the abstract and list of citations for this paper, and a list of other papers which cite it. On the right we have a visualisation of the papers which cite this paper.

From this panel we can easily navigate to

  • The site hosting the paper (by clicking on the title)
  • A pdf of the of the paper, if one is available
  • The details of the papers which cite or are cited by this publication

Thus automating the navigation problems encountered when using Google Scholar.

How Litmaps Tracks Exploration and Helps You Find New Papers

When we find a paper which is important for our task, we can add it to our project by clicking the “add” button:

This automates the process of keeping track of important papers.

Once we’re ready to do something with our project, we can export a bibliography as text, BibTeX, or an image of the literature map as PNG or PDF.

We’re still not done making things more efficient. To recap, when you’re trying to get a foothold on a new field, a common procedure is:

  • Find some promising papers
  • Go through these papers’ references, and see if any of those look promising
  • After several iterations of this, it should be apparent what the most important and fundamental papers in a field are, because they get cited all the time

To save you the trouble of going through all the references yourself, we’ve created network analysis tools for Litmaps which does it for you. After you’ve added some papers to your project, you can activate the “suggestions radar”. This scans through all the citations and references (collectively referred to as “citation connections”) of all the papers in your project. A paper which has multiple citation connections to your project is a candidate recommendation, and the more citation connections it has, the higher it is ranked as a candidate. This effectively triages papers which are navigable from your current set of papers, so you can prioritise those which are “deeply” connected to your research.

Litmaps also lets you combine keyword search with network analysis. The “relevance search” tool displays search results prioritised by the number of citation connections to the current project.

This can be very useful for finding papers in the intersection of disciplines. For example, if we want to find papers on using Bayesian statistics for forecasting, we can create a project with several papers on forecasting, then do a a “relevance search” for the keyword “Bayesian”. This will show us papers with the keyword “Bayesian” which are connected to the forecasting papers.

Relevance search is also useful for disambiguation. A keyword may have different meanings in different disciplines. By populating a project with papers from the target discipline, running a relevance search with the target keyword will only return results which have some connection to the discipline.

Network analysis of course does have its downsides. The biggest problem is the Mathew Effect, or the propensity for the “rich” to get “richer”. In the case of scientific literature, this means that other things being equal, the most citations will go to the papers which already have many citations.

Stigler’s Law of Eponomy illustrates how pernicious the Mathew Effect can be in science. Stigler’s Law states that no discovery is named after its original discoverer. Instead, discoveries tend to be named after famous scientists, either because many people first encounter the discovery only after a famous person starts talking about it, or just because the story works better when a discovery was made by a famous person. Examples of Stigler’s Law include Hubble’s law, the Pythagorean theorem, and Stigler’s Law itself.

Clearly assigning credit according to existing reputation rather than originality is unfair and creates perverse incentives. As we continue to build out Litmaps, we will be thinking carefully about how to combat the Mathew Effect. By making the citation network more visible to researchers, we hope that contributions can be more accurately traced to those who really deserve credit.

Conclusion

Exploring scientific literature version 1.0 was a librarian helping you find physical papers in a library and using reference lists to identify papers worth reading. Version 2.0 was using an academic search engine to find digital papers, but still using reference lists to identify papers worth reading. Version 3.0 will be fluidly moving between digital papers in an interactive citation network, with visual context cues, and algorithms guiding you to papers worth reading. At Litmaps we are building this future.

--

--