How to utilize scientific literature trends to gain intuition about a topic

6 min readAug 3, 2018

What is lacking in the current way we do literature searches

Understanding trends within the scientific literature is critical for any scientific professional. Professors, physicians, CEOs, investors, and analysts alike need this intuition (often quickly) to make the right decisions. The problem is that it is difficult to do: scientific innovation is speeding up, and fields are becoming more interdisciplinary.

As I have transitioned from academic researcher to scientific consultant, the questions I am faced with now often span many fields. I quickly need to know the seminal papers in the field, the key thought leaders, controversial issues, and general trends over time. You can’t get all of this from the PubMed search bar.

I think we can build tools to give a user this type of intuition within a single click. This is the first of a series of posts where I explore this idea, integrating available programming libraries and packages to mine the scientific literature along with other resources (eg. grants) to give us a more complete view of what is going on within a given topic. This has helped me in my scientific consulting work, so maybe it will help you too.

PubMed Trends

I have a not so small obsession with Google Trends. However, I had problems getting clean information out of the tool when I was searching for niche topics, like “mass cytometry” as opposed to non-niche topics, like “The Avengers.” Accordingly, I decided to have a look at trend information in scientific publications, using PubMed. Bar graphs of publication rates exist in the upper right corner of a given PubMed search, and there is also a nice user interface to explore trends (check out this site ) but I wanted to have a bit more control over the input and output as you’ll see below.

To this end, I utilized the RISmed Rpackage in CRAN. It allows one to perform automated PubMed searches and extract information accordingly. I modified a few functions to deal with some issues and tailor the results, but I won’t get into that here. Check my GitHub code for details when it’s up. Let’s just jump in.

Single cell RNA sequencing recently outpaced mass cytometry (though both are growing)

Single cell biology now comes with more cells and more features. This means you can ask a lot more questions and make more interesting predictions. Two key tools to achieve this end are mass cytometry and single cell RNA sequencing. The latter has really started gaining ground more recently. Or has it been popular all along? Let’s check the trend information.

In these charts, the x axis is the year, and the y axis is the number of publications in that year from a PubMed search query, effectively giving the publication rate in papers per year. The year 2018 above is a projection based on the current number of publications plus the expected number of publications for the rest of the year. If we were halfway through 2018, I’d simply multiply double the current number of publications.

The most striking result here is that single cell RNA sequencing overtook mass cytometry in 2016 and seems to show no sign of slowing down. The first major uptick appears in 2015, which coincides with the emergence of new methods that allowed for more cells to be sequenced per experiment.

Mass cytometry publication rates started increasing in 2013, with another slight uptick around 2016. Interestingly, Fluidigm bought DVS (the mass cytometry company) on Feb 13, 2014, coinciding with the first uptick.

This type of analysis could be valuable for a mass cytometry analysis company trying to determine whether it should expand into the single cell RNA sequencing market.

However, for high throughput single cell analysis, flow cytometry is still a much larger market. Have a look at the graph below.

Papers that contain the search term “flow cytometry” are 1.5 orders of magnitude (notice the log10 scale) above mass cytometry and single cell RNA sequencing. If you are developing a new analysis tool for mass cytometry data, it might be a good idea to tailor it to fluoresence flow cytometry as well.

Crowdedness of the -omics landscape

It’s often important to know how popular and crowded a given field is. If a field is too crowded, you will face lots of competition and paranoia of getting scooped (I’ve been on two projects that have been scooped. I would know). If you’re in a less crowded field, it would help to know if it is emerging or stagnant.

“Crowdedness of a field” is a complicated topic that goes well beyond this article. Given my systems biology background, I’ll start simple by comparing the publication rates of some of the common -omics technologies.

In this plot, you can see that genomics has the highest publication rate for all years tested, and there is an interesting spike in publication rate starting in 2013 that levels off in 2017. Proteomics and metabolomics are behind, but still visible on the linear scaled graph.

If you’re doing bioinformatics, graphs like this can be used to determine which classes of bioinformatics tools might be the most needed and how that might change in the future.

Here is the same plot on a log10 scale to look at the smaller fields. Interestingly, papers containing the word “epigenomics” have leveled off since 2010. This might change as ATAC seq gains single-cell capabilities.

Connectomics emerged most recently, of all the -omics terms shown here, and its rate is still increasing. This plot would be good, for example, for convincing investors that one’s connectome-based startup is “the next big thing.”

Conclusions and where I am going with this

These results are for PubMed, and one major confound here to be addressed in later posts is the increased use of pre-prints (eg. F1000, BioRxiv) in the single cell analysis field. These don’t have available APIs to mine them (as of now), but I have at least seen a web scraper written for BioRxiv mining that I intend to integrate into my analysis so I don’t have to re-invent the wheel.

For exhaustive analysis on a given subject, one needs to check all possible wordings. Single cell RNA sequencing, can also be called single cell RNA seq, scRNA seq, etc. Mass cytometry is also called “CyTOF.” I encourage you to try this tool accordingly with different wordings to double-check any of the results above. You won’t be able to change the scaling, do 2018 projections, or more complex analysis, but that touches upon the value of using code over GUIs.

Future posts will explore questions I alluded to in the beginning, like identifying thought leaders for a given field. As my scripts mature, I will release them to the public for use on my GitHub. If you have any particular requests and would rather me do it, contact me. I’ll make announcements accordingly on LinkedIn , Twitter , and my website as these come about.