Using and mining pre-prints to stay ahead of your field, with the help of Twitter

Tyler Burns
8 min readMar 27, 2019
Image from Adobe Stock user Worawut

Why do pre-prints matter?

In recent years, pre-prints have gained traction in biology as a place to quickly make scientific findings public while the manuscript is going through peer-review. In fast moving fields like single-cell analysis, this serves a number of purposes, from not getting scooped to simply getting necessary tools and results out to the community as fast as possible.

These days, I do a lot of strategy consulting for organizations within the single-cell analysis field both on the wet-lab and dry-lab side. This requires me to be completely up-to-date with new relevant methods and paradigms, which is no easy task. Knowing how to utilize, parse, and filter the pre-prints gives me a window into where single-cell analysis is moving. This is a complex topic, and here I’m only going to focus on identifying important pre-prints.

One of the best known pre-print servers is arXiv, which was developed in 1991 and today encompasses many fields such as physics, astronomy, mathematics, and computer science. Here, I focus on bioRxiv, which was introduced in 2013 motivated by the success of arXiv over the years.

Motivating example 1: dimension reduction for single-cell data

Interestingly, these days by the time a pre-print gets published, it is often “old news.” For example, take the groundbreaking dimension reduction tool UMAP, developed by Leland McInnes (2018). Single-cell biologist Evan Newell and his group was able to demonstrate the effectiveness and value of the tool for mass cytometry and single-cell sequencing data early on, and quickly got the results onto bioRxiv on April 10, 2018 (Becht, Dutertre, et al., 2018a). From there, it was quickly adapted across relevant research groups and companies. By the time their paper was published in Nature Biotechnology roughly 8 months later on December 3, 2018 (Becht, McInnes, et al., 2018b), it had already been widely adapted by the single-cell community. To disregard the pre-prints in this case would be to completely miss a change in how single-cell biologists are analyzing data.

The process of a pre-print changing a large piece of data analysis and visualization for the single-cell field. Reading the pre-print potentially gave scientists an eight month lead over those who don’t read pre-prints.

Motivating example 2: clustering mass cytometry data

Back in the first half-decade of mass cytometry’s implementation, there were many scientists developing many clustering tools for this type of data. There was a lot of uncertainty as to which clustering tools were more suited for which types of data. To address this problem, Lukas Weber and Mark Robinson rigorously evaluated the top clustering tools for flow and mass cytometry data using the F1 score across multiple datasets in comparison to expert manual gating (see graphic below). The authors found that a self-organizing map and meta-clustering based tool called Flow-SOM most effectively balanced run-time and accuracy across multiple datasets, with another mean-shift based tool called X-shift being most suited for rare subset identification.

This paper was initially published as a pre-print in bioRxiv on April 8, 2016 (Weber & Robinson, 2016a). It wasn’t until December 19, 2016 that it was published in Cytometry Part A (Weber & Robinson, 2016b). To read the pre-print was to have 8 months of lead-time in understanding which clustering algorithm to use on your dataset that just came out of the machine.

F1 score vs. run-time for various clustering tools on a mass cytometry dataset (one of many). I’ve seen this exact plot in many mass cytometry-centric talks, and it has motivated analytics companies like FlowJo and Cytobank to adapt FlowSOM into their software. Picture taken from (Weber & Robinson, 2016a).

How do you mine the preprints?

Now that I’ve motivated the use of the pre-prints, let’s talk about how exactly I go about reading them, with my focus on bioRxiv. The goal is to effectively see what has been published and what to focus on. One main problem here is that the scientific literature is overwhelming in volume. How do we determine whether a pre-print is notable and relevant? One cannot track citation counts for the pre-prints unfortunately (that I know of). Furthermore, citation tracking means nothing for pre-prints that are only days to weeks old.

Fortunately, every new bioRxiv submission gets automatically posted to Twitter via bots, where there is a robust community of scientists liking, commenting, and re-tweeting the papers. A heavily-liked pre-print on twitter doesn’t necessarily mean it’s going to change the world. It only serves as a proxy for what the scientific community (or at least the subset of them on twitter) is buzzing about at the time, which is indeed valuable to know.

Accordingly, twitter has an API that allows you to at least mine recent tweets. There are many ways to mine twitter using this API in whichever language is your favorite. Here, I use the TwitteR R package only because R is the language I usually use. I pull tweets from various bioRxiv-relevant handles, and turn it into a spreadsheet ordered by number of times it was liked. I exclude anything I find that is a retweet, as retweets by popular users will have more likes and therefore bias the results.

Simplified process map of twitter mining the pre-prints.

The API allowed me to go back 9 days, which was perfect because I had been on vacation for a week. While I care mainly about the single-cell literature in terms of new wet-lab and dry-lab methods, I pulled all results so I could see where the single-cell papers stacked up in relation to other domains.

Example of what the data look like after the twitter mining script. Columns are from left to right the twitter user (bioRxiv bots), the screen name, when the paper-linking tweet was made, number of likes, number of retweets, and the title of the paper.

The results

I pulled bioRxiv tweets between March 15 and March 24, 2019. The spearman correlation between likes and re-tweets was 0.65. For both likes and re-tweets the distribution was heavily tailed, with most papers getting zero or near zero of each. Focusing on the top 10 ranked by number of times liked, two of them were in the single-cell domain. Both of these revolved around new methods, which was very relevant to what I need to keep ahead of.

Left: BioRxiv articles linked from twitter bots, arranged by number of likes. Right: relationship between number of likes and number of re-tweets for these articles.

From pre-prints to strategy

The top liked paper across all domains was one where single-cell transcriptomes of human brain tissue (from a previous paper) were merged with an interactome reference to provide novel cell type specific disease-specific gene networks (Mohammadi, Davila-Velderrain, & Kellis, 2019). This got me thinking about the trend of building reference maps in single-cell data, while this paper providing an example of how we would leverage reference datasets to find novel biology in new datasets. As more reference maps are built out, like the Human Cell Atlas, we will likely see a trend toward figuring out how to leverage these as we analyze new data. Perhaps this will be more entrenched in the community after this paper gets through peer review.

The second single-cell paper in the list was a method called Sci-Hi-C (Ramani et al., 2019). This is a single-cell version of Hi-C, which maps 3-D genome organization. The authors developed the single-cell version of this to provide resolution that is not blurred by per-cell heterogeneity of chromatin organization. The paper in its current state provides proof of concept, showing the potential of this method. This points toward the trend of turning every type of next-gen sequencing method possible into a single-cell method. It reminds me also of the old days of mass cytometry, where a lot of effort went into turning every possible fluorescence flow cytometry method (eg. live-dead staining) into a mass cytometry counterpart.

Conclusions

An interesting pre-print I came across in my analysis compared the quality of pre-prints with peer-review articles (Carneiro et al., 2019). Using their criteria, the authors found that there was a statistically significant difference between pre-prints and peer-review papers in terms of quality, but this difference was small.

Even if the pre-prints are slightly lower quality, they still serve their purpose of helping me stay ahead of the most up-to-date tools and paradigms in the single-cell field. I don’t want myself, or any of my clients, to miss any opportunities like UMAP because I was looking only at PubMed and not the pre-prints.

To sum up: I have a specific process of handling the pre-print literature once I’ve found papers of interest:

  1. I look for things that could become widely applicable. If it’s a novel bioinformatic tool, I look for the software. It’s not always there, and it’s not always in the most readable form. This is where you have to make a judgement call.
  2. I look for new concepts, paradigms, and trends that could shape the field in the near future. This is critical to helping my clients with overall direction.

Use the pre-prints. They are a free half-year glimpse into the future, along with tools you can start to use today.

The code, necessary files, and instructions for doing this exact analysis are on my GitHub at https://github.com/tjburns08/twitter-mining-scilit. If you have any questions or comments, either comment below, message me directly through this site, or contact me at info@tylerjburns.com.

References

Becht, E., Dutertre, C.-A., Kwok, I. W. H., Ng, L. G., Ginhoux, F., & Newell, E. W. (2018a). Evaluation of UMAP as an alternative to t-SNE for single-cell data. bioRxiv, 1–10. http://doi.org/10.1101/298430

Becht, E., McInnes, L., Healy, J., Dutertre, C.-A., Kwok, I. W. H., Ng, L. G., et al. (2018b). Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnology, 1–10. http://doi.org/10.1038/nbt.4314

Burns, T. J. (2018). How to utilize scientific literature trends to gain information about a topic. Medium. https://medium.com/@tjburns_72591/how-to-utilize-scientific-literature-trends-to-gain-intuition-about-a-topic-b5c554e3d280.

Carneiro, C. F. D., Queiroz, V. G. S., Moulin, T. C., Carvalho, C. A. M., Haas, C. B., Rayêe, D., et al. (2019). Comparing quality of reporting between preprints and peer-reviewed articles in the biomedical literature. bioRxiv, 1–10. http://doi.org/10.1101/581892

McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018

Mohammadi, S., Davila-Velderrain, J., & Kellis, M. (2019). Single-cell interactomes of the human brain reveal cell-type specific convergence of brain disorders, 1–23. http://doi.org/10.1101/586859

Ramani, V., Deng, X., Qiu, R., Lee, C., Disteche, C. M., Noble, W. S., et al. (2019). Sci-Hi-C: a single-cell Hi-C method for mapping 3D genome organization in large number of single cells. bioRxiv, 34(13), i96–25. http://doi.org/10.1101/579573

Weber, L. M., & Robinson, M. D. (2016a). Comparison of Clustering Methods for High-Dimensional Single-Cell Flow and Mass Cytometry Data, bioRxiv, (pp. 1–28).

Weber, L. M., & Robinson, M. D. (2016b). Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytometry, 89(12), 1084–1096. http://doi.org/10.1002/cyto.a.23030

--

--

Tyler Burns

www.tylerjburns.com. I sit at the intersection of biology, data science, and management.