Using and mining pre-prints to stay ahead of your field, with the help of Twitter
Why do pre-prints matter?
In recent years, pre-prints have gained traction in biology as a place to quickly make scientific findings public while the manuscript is going through peer-review. In fast moving fields like single-cell analysis, this serves a number of purposes, from not getting scooped to simply getting necessary tools and results out to the community as fast as possible.
These days, I do a lot of strategy consulting for organizations within the single-cell analysis field both on the wet-lab and dry-lab side. This requires me to be completely up-to-date with new relevant methods and paradigms, which is no easy task. Knowing how to utilize, parse, and filter the pre-prints gives me a window into where single-cell analysis is moving. This is a complex topic, and here I’m only going to focus on identifying important pre-prints.
One of the best known pre-print servers is arXiv, which was developed in 1991 and today encompasses many fields such as physics, astronomy, mathematics, and computer science. Here, I focus on bioRxiv, which was introduced in 2013 motivated by the success of arXiv over the years.
Motivating example 1: dimension reduction for single-cell data
Interestingly, these days by the time a pre-print gets published, it is often “old news.” For example, take the groundbreaking dimension reduction tool UMAP, developed by Leland McInnes (2018). Single-cell biologist Evan Newell and his group was able to demonstrate the effectiveness and value of the tool for mass cytometry and single-cell sequencing data early on, and quickly got the results onto bioRxiv on April 10, 2018 (Becht, Dutertre, et al., 2018a). From there, it was quickly adapted across relevant research groups and companies. By the time their paper was published in Nature Biotechnology roughly 8 months later on December 3, 2018 (Becht, McInnes, et al., 2018b), it had already been widely adapted by the single-cell community. To disregard the pre-prints in this case would be to completely miss a change in how single-cell biologists are analyzing data.
Motivating example 2: clustering mass cytometry data
Back in the first half-decade of mass cytometry’s implementation, there were many scientists developing many clustering tools for this type of data. There was a lot of uncertainty as to which clustering tools were more suited for which types of data. To address this problem, Lukas Weber and Mark Robinson rigorously evaluated the top clustering tools for flow and mass cytometry data using the F1 score across multiple datasets in comparison to expert manual gating (see graphic below). The authors found that a self-organizing map and meta-clustering based tool called Flow-SOM most effectively balanced run-time and accuracy across multiple datasets, with another mean-shift based tool called X-shift being most suited for rare subset identification.
This paper was initially published as a pre-print in bioRxiv on April 8, 2016 (Weber & Robinson, 2016a). It wasn’t until December 19, 2016 that it was published in Cytometry Part A (Weber & Robinson, 2016b). To read the pre-print was to have 8 months of lead-time in understanding which clustering algorithm to use on your dataset that just came out of the machine.
How do you mine the preprints?
Now that I’ve motivated the use of the pre-prints, let’s talk about how exactly I go about reading them, with my focus on bioRxiv. The goal is to effectively see what has been published and what to focus on. One main problem here is that the scientific literature is overwhelming in volume. How do we determine whether a pre-print is notable and relevant? One cannot track citation counts for the pre-prints unfortunately (that I know of). Furthermore, citation tracking means nothing for pre-prints that are only days to weeks old.
Fortunately, every new bioRxiv submission gets automatically posted to Twitter via bots, where there is a robust community of scientists liking, commenting, and re-tweeting the papers. A heavily-liked pre-print on twitter doesn’t necessarily mean it’s going to change the world. It only serves as a proxy for what the scientific community (or at least the subset of them on twitter) is buzzing about at the time, which is indeed valuable to know.
Accordingly, twitter has an API that allows you to at least mine recent tweets. There are many ways to mine twitter using this API in whichever language is your favorite. Here, I use the TwitteR R package only because R is the language I usually use. I pull tweets from various bioRxiv-relevant handles, and turn it into a spreadsheet ordered by number of times it was liked. I exclude anything I find that is a retweet, as retweets by popular users will have more likes and therefore bias the results.
The API allowed me to go back 9 days, which was perfect because I had been on vacation for a week. While I care mainly about the single-cell literature in terms of new wet-lab and dry-lab methods, I pulled all results so I could see where the single-cell papers stacked up in relation to other domains.
The results
I pulled bioRxiv tweets between March 15 and March 24, 2019. The spearman correlation between likes and re-tweets was 0.65. For both likes and re-tweets the distribution was heavily tailed, with most papers getting zero or near zero of each. Focusing on the top 10 ranked by number of times liked, two of them were in the single-cell domain. Both of these revolved around new methods, which was very relevant to what I need to keep ahead of.
From pre-prints to strategy
The top liked paper across all domains was one where single-cell transcriptomes of human brain tissue (from a previous paper) were merged with an interactome reference to provide novel cell type specific disease-specific gene networks (Mohammadi, Davila-Velderrain, & Kellis, 2019). This got me thinking about the trend of building reference maps in single-cell data, while this paper providing an example of how we would leverage reference datasets to find novel biology in new datasets. As more reference maps are built out, like the Human Cell Atlas, we will likely see a trend toward figuring out how to leverage these as we analyze new data. Perhaps this will be more entrenched in the community after this paper gets through peer review.
The second single-cell paper in the list was a method called Sci-Hi-C (Ramani et al., 2019). This is a single-cell version of Hi-C, which maps 3-D genome organization. The authors developed the single-cell version of this to provide resolution that is not blurred by per-cell heterogeneity of chromatin organization. The paper in its current state provides proof of concept, showing the potential of this method. This points toward the trend of turning every type of next-gen sequencing method possible into a single-cell method. It reminds me also of the old days of mass cytometry, where a lot of effort went into turning every possible fluorescence flow cytometry method (eg. live-dead staining) into a mass cytometry counterpart.
Conclusions
An interesting pre-print I came across in my analysis compared the quality of pre-prints with peer-review articles (Carneiro et al., 2019). Using their criteria, the authors found that there was a statistically significant difference between pre-prints and peer-review papers in terms of quality, but this difference was small.
Even if the pre-prints are slightly lower quality, they still serve their purpose of helping me stay ahead of the most up-to-date tools and paradigms in the single-cell field. I don’t want myself, or any of my clients, to miss any opportunities like UMAP because I was looking only at PubMed and not the pre-prints.
To sum up: I have a specific process of handling the pre-print literature once I’ve found papers of interest:
- I look for things that could become widely applicable. If it’s a novel bioinformatic tool, I look for the software. It’s not always there, and it’s not always in the most readable form. This is where you have to make a judgement call.
- I look for new concepts, paradigms, and trends that could shape the field in the near future. This is critical to helping my clients with overall direction.
Use the pre-prints. They are a free half-year glimpse into the future, along with tools you can start to use today.
The code, necessary files, and instructions for doing this exact analysis are on my GitHub at https://github.com/tjburns08/twitter-mining-scilit. If you have any questions or comments, either comment below, message me directly through this site, or contact me at info@tylerjburns.com.
References
Becht, E., Dutertre, C.-A., Kwok, I. W. H., Ng, L. G., Ginhoux, F., & Newell, E. W. (2018a). Evaluation of UMAP as an alternative to t-SNE for single-cell data. bioRxiv, 1–10. http://doi.org/10.1101/298430
Becht, E., McInnes, L., Healy, J., Dutertre, C.-A., Kwok, I. W. H., Ng, L. G., et al. (2018b). Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnology, 1–10. http://doi.org/10.1038/nbt.4314
Burns, T. J. (2018). How to utilize scientific literature trends to gain information about a topic. Medium. https://medium.com/@tjburns_72591/how-to-utilize-scientific-literature-trends-to-gain-intuition-about-a-topic-b5c554e3d280.
Carneiro, C. F. D., Queiroz, V. G. S., Moulin, T. C., Carvalho, C. A. M., Haas, C. B., Rayêe, D., et al. (2019). Comparing quality of reporting between preprints and peer-reviewed articles in the biomedical literature. bioRxiv, 1–10. http://doi.org/10.1101/581892
McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018
Mohammadi, S., Davila-Velderrain, J., & Kellis, M. (2019). Single-cell interactomes of the human brain reveal cell-type specific convergence of brain disorders, 1–23. http://doi.org/10.1101/586859
Ramani, V., Deng, X., Qiu, R., Lee, C., Disteche, C. M., Noble, W. S., et al. (2019). Sci-Hi-C: a single-cell Hi-C method for mapping 3D genome organization in large number of single cells. bioRxiv, 34(13), i96–25. http://doi.org/10.1101/579573
Weber, L. M., & Robinson, M. D. (2016a). Comparison of Clustering Methods for High-Dimensional Single-Cell Flow and Mass Cytometry Data, bioRxiv, (pp. 1–28).
Weber, L. M., & Robinson, M. D. (2016b). Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytometry, 89(12), 1084–1096. http://doi.org/10.1002/cyto.a.23030