Considering assessments of scientific productivity and ‘ghost authors’

We will start with the final figure in the preprint (Figure 9. Reference 1). They aim to assess lab productivity as a function of lab funding, in dollars. The dollars analysis has the same counting flaw as the GSI scoring, but even more extreme. All of the funds for each P01 or U19 were assigned to the Lead PI for virtually all P01/U19s. The Lead PI was assigned ~$2 million dollars, and the remaining 3–5 PIs in the program were assigned $0. Even though all of the PIs usually receive approximately equal amounts of money. If one group of individuals is counted as having ~$2 million dollars, and another group of individuals is counted as having $0 dollars, but the two groups actually get approximately equal dollars from the grant, which group is going to look better in a graph of productivity? The PIs assigned $0. In the dataset, there are 2,563 Lead PIs of P01/P42/U19s who are all penalized this way (the entire > 21 annual GSI cohort was 939, for comparison). In contrast, there are 10,982 non-lead PIs of P01s, P42s, and U19s, who get assigned $0 dollars, each. Almost 11,000 PIs out of the 71,000 in the dataset get “free” money in the counting system used. Of course observed rates of return decrease at somewhat less than $1 million when graphed, because the counting scheme is disconnected from the reality of the grants. Conclusions were made from a fundamentally flawed dataset.

On being a scientist

Let’s step back for a moment. Is it a worthwhile endeavor to attempt to assess scientific productivity? Absolutely. Attention to productivity has been one of the things that has made America and Americans so successful over the past 150+ years. Measurement of productivity is straightforward for some jobs, such as assembly line production of cars. But even for a game as simple as basketball it is quite difficult to quantify the value and productivity of individual players. For soccer, the most popular sport in the world, there is no accepted quantitative metric of player productivity. Being a scientist is complex job. There are many different ways to be a scientist. Some scientists are extremely creative. Some scientists are workhorses. Some scientists are fantastic at being the ‘glue’ that brings other scientists together to create something new in synergy, the Connectors described by Malcolm Gladwell. Some scientists have small labs, like outstanding scientist Bruce Alberts, who has been a proponent of small labs since at least 1985. Other scientists have large labs, like many of the labs that accomplished the Human Genome Project, which was probably the most monumental scientific feat of the past 20 years. Additionally, some science is simply more expensive than other science. Bacterial genetics is fantastic, and fantastically powerful, and wonderfully inexpensive because of the nature of the materials involved. Much of immunology is quite expensive, because it usually necessitates extensive animal or human studies because of the complexities of the in vivo biology. Science is also a risk-based endeavor. To accomplish true breakthroughs in science can require a lot of failures, with perhaps 90% of the experiments failing and necessitating the attempt of projects that eventually never succeed. Thus, it is not surprising that there is currently no good quantitative metric for scientific productivity. The few that have been attempted or implemented to date (e.g., number of papers published in high impact journals) have been widely panned as trivialization of scientific productivity. Again, is it a worthwhile endeavor to attempt to assess scientific productivity? Absolutely. But, to have any predictive power at all, it will require a sophisticated set of a number of metrics which have been validated to be measuring causal effects, and which have yet to be developed.

GSI / RCI Points

As noted above, in the preprint (1), one group of scientists has been counted as having ~$2 million dollars, and another group of individuals has been counted as having $0 dollars, but the two groups actually get approximately equal dollars from the grant. It is a forgone conclusion that by any measurement of productivity the PIs assigned ‘free money’ ($0) will appear very good compared to those penalized $2 million of money they don’t have. This same problem is present in the GSI point system, but less extreme. The GSI based analysis is based on counting the number of grants (average per year) using a point system. The stated point system was: R01 PI = 7 GSI points. P01/U19 Lead PI = 7 GSI points. P01/U19 other PI = 6 GSI points. From this it was concluded that 21 GSI points should be the maximum any PI can have at any time, and that was claimed to be equivalent to 3 R01's. However, looking at the raw data, that is inaccurate. R01's are counted as 7 points, but being the head of a U19 or P01 with a project frequently was counted as at least 19 GSI points in the dataset, not 7.

7 points as head of the P01/U19

6 points for being a PI of a project

6 points for being the PI of the default Admin Core

Additionally, that point penalty was not assigned consistently. Looking at only a dozen randomly selected examples of such grants (out of 1000's) the GSI points assigned to the Lead PI was frequently 19 GSI points, but it ranged from 7 to 24. One individual was actually assigned 48 GSI points for a single grant, and they were not even the Lead PI. In part, this occurs because PIs of core facilities are assigned 6 points per core facility, which was not clear from the published scoring table. But, the Lead PI penalty is different. By looking at the source of the data imported into this study, crossreferenced to the data tables, one can see that this systematic inconsistency occurs because of the NIH RePORTer grant tracking system files grants in multiple ways. So that when Lauer et al. imported the data they took data fields that resulted in single or double or triple counting of a U19/P01 lead PI, depending on unrelated grant filing differences in RePORTer (i.e., whole project plus subproject). That isn’t particularly surprising, because RePORTer is an excellent NIH database that was designed for other purposes. Regarding data analysis, it comes as no surprise that when a PI is penalized by double or triple counting the same grant by GSI that results in an apparent reduction in the productivity metric for some PIs with higher GSI counts. That pattern is made murkier by systematically inconsistent GSI counting of the same grant type for different PIs. P01s and U19s and related grants are a majority category of grant for highly active labs. Additionally, the Lead PI of the P01/U19 is likely to be the most senior investigator, independent of the resources allocated, and that would skew GSI to productivity associations because of an additional ‘empty’ 7 GSI points assigned. Thus, directly analogous to the grant dollars analysis, it is a forgone conclusion that PIs assigned 2–3x excess ‘ghost’ GSI points will incorrectly appear less productive, introducing an artifactual flattening of productivity.

The publications dataset: Missing papers and ‘ghost authors’

The scientific papers dataset used to measure productivity misses papers, lots of papers. This occurs because the dataset counts papers that are linked to a PI through a grant in certain ways. In one personal example (it was easiest for me to validate and triple check my own papers, given that the dataset has an anonymizing person_id), my most highly cited paper is not included (a 99.9%ile paper. 1,400 citations by Google Scholar. Annual Review of Immunology 2011 (2)). (At the end of that particular paper I cited “grants from the NIH”, because some journals said they did not allow specific grant citations at the time; and I cited specific grant numbers in the relevant grant reports to the NIH, but specific grant numbers did not make it into a particular online database.) Examining the dataset in greater detail, 6 out of 34 of my papers were missed (for 2005–2014, the funding period examined). By authorship, those are errors for an accurate assessment of productivity. As it happens, each of those 6 missing papers is a > 95%ile ranked paper by citations, which has obvious implications for assessment of productivity (2–7). For this one PI examined, the dataset is only 82% accurate by paper count. It is only 66% accurate for counting RCR points (determined by using the NIH iCite tool). Corrected attribution of those 6 papers shifted the apparent PI productivity from 13%ile to 5%ile among all PIs. All of the productivity metrics used in the analysis depend on the paper and citations associations being accurate for each PI, which they are not.

Much more troubling was that the dataset attributed 20 papers (of 54) to me that are not mine. I am not an author of those papers. I am not acknowledged. I am a ‘ghost author’ of those papers in the dataset. That makes 26 of 53 = 49% of all papers associated with me in the database are misattributed — either missing or erroneously added.

Ghost authors are widespread in the dataset. In the papers assigned to me in the dataset, the average number of PIs assigned credit was 10, when it should be ~2. There are papers with ~12 real authors and over 20 ghost authors. Most dramatically, ghost authors are likely to be given falsely high ‘maximum RCR’ scores for ‘their most influential paper’ for papers they were not actually authors on. I have a single author paper ranked 99.3%ile by citations (8). The GSI database has spuriously assigned 15 extra ghost authors to that paper. To reiterate, my single author paper has 16 authors in the dataset. I therefore checked how this misattribution of authorship affected the perceived productivity of those PIs in the maximum RCR outcome. For at least 5 of those 15 PIs, my single author paper is attributed to them as their max RCR paper (I quit counting after 5 examples). While I consider myself generous, giving other people credit for my single author paper is not valid. Multiple of those PIs had low GSI’s, and that max RCR score, from my paper, then puts those PIs far above average productivity for the low GSI group. To reiterate: a single author high impact paper of mine is erroneously assigned to at least 5 other PIs as their best paper. Clearly these types of errors prevent any meaningful return on investment analysis.

How did this happen? It turns out that the introduction of ‘ghost authors’ was on purpose. Extraordinarily, authorship of papers was defined in a novel way. Credit for a paper was assigned based on association to any grant cited in the paper. Thus, every PI on every project and every core facility was given equal credit in the dataset for any paper published by any author who cites a grant in a paper for any degree of support. Such attribution of ‘ghost authors’ is a new concept. Therefore, in the definition of scientific credit used in the analysis, it was considered entirely accurate to give 15 other PIs full credit for my single PI paper, when attempting to assess individual PI productivity. That creation of ‘ghost authors’ is a distinct break from 100+ years of scientific authorship.

Dilution of credit

Another major decision in the data analysis approach used was to divide credit equally for a paper equally across all PIs assigned authorship. The citation metric used was RCR. RCR is a ratio, developed by another group, George Santangelo (9). An average paper gets a score of 1. A top 1% paper gets a score of approximately 14. Highly influential papers — the one out of 1000 most influential papers — score 29 or better. By the way, the iCite website is excellent, and the RCR metric itself is very well done for doing what it was designed to do (9). Kudos to George Santangelo and his team at the NIH for that work.

Let’s consider this scoring system in action, when adding papers. How many average papers does it take to equal a top one out of 1000 papers? By RCR the answer is 29 average papers. So, the mount Everest analogy is surprisingly accurate. Climbing 29 hills is considered equal to climbing Mount Everest by aggregating RCRs.

But, in reality the paper scores become more diluted in the Lauer et al. analysis, because that RCR score is taken and then divided among all authors with a grant cited. Credit is distributed equally among all authors, be they real authors or ghost authors. Thus, for a highly influential, top 1 in 1000 paper, that has one corresponding author but also includes a middle author who contributed a reagent and whose grant was cited, the corresponding author gets a score of 14, and the minor collaborator gets 14. The equivalent of 14 average papers in the literature. Each grant cited by a different author divides that score further.

The same math applies for a top 1 in 100 paper, just starting from 14. So, a top 1% paper with three author grants cited is counted as equivalent to less than 5 average papers in the literature.

High impact papers tend to have more authors and more collaborators, and so high impact papers suffer severely in this scoring system. Scientists who collaborate a lot, or are in consortia, also suffer badly in this scoring system, as minor contributions by a series of authors rapidly brings a good paper score to below average. Example: Scientist A publishes a high impact, Top 1 in 100 paper. They got reagents and mice from 5 collaborators, but they are the main contributor and the sole corresponding author. Instead of Scientist A getting 14 points for that paper, they get 2.3. Scientist C published one other paper, which was an average paper, and so Scientist C actually gets a higher productivity score than Scientist A in the scoring system used.

Table 1. Division of credit. A collaborative scenario is on the left. A non-collaborative scenario is on the right. The productivity measurement outcomes are shown on the bottom.

The authorship misattribution in the dataset also amplified the dilution effect of the RCR scoring system used, because of the decision to divide paper citations equally across all authors in the RCR/productivity scoring system. Because of ‘ghost authors’, the average citation dilution factor of papers assigned to me in the dataset was a dividing factor of 10. That means that an average paper from me would be counted as 1/10 what an average paper would be counted for a PI with a single R01 in the scoring system used. That also means for a 99%ile paper in a top journal, the productivity scoring assignment for me in their data analysis instead of 14 RCR could be ~1.4, which is the same as publishing 1 average paper (RCR = 1). Thus, the double whammy of ghost authorship in the dataset, combined with the citation division penalty for collaborating, make it a foregone conclusion that PIs with higher GSI scores will appear to be less productive.

Predictive power

For any analysis of productivity, the underlying data have to be accurate. In addition, for any analysis that one would want to use to predict future outcomes (e.g., set national science funding policy), the model would also need to be highly predictive (i.e., have a very strong correlation between input and output). As a hypothetical, if I were to go to the NIH and ask for $1B for an experiment, how good would the data have to be? The answer is that it would have to be both highly accurate data and highly predictive of outcomes. As pointed out by Jeremy Berg (10), there is enormous scatter in the data provided on PI GSI to wRCR relationships — as expected — which limits any predictive power based on the averages.

How to measure scientific productivity?

Again, scientific productivity cannot be measured by citations alone, but it is a good starting point. When considering citations, what if one were to take a different approach for measuring the importance of scientific papers? One scoring system would be a %ile rank ratio. Set a paper that ranks in the top 1% of citations as 100 points. Set highly influential papers that rank as the best one in 1000 papers as 1000 points. An average paper would be 1 point. Secondly, for distribution of credit, the corresponding author(s) is giving 90% of the credit, and the remaining 10% is distributed amongst the other authors with cited grants.

Let’s compare the two scoring systems, for two PIs who were each funded for 5 years (the RCR # is divided by years in Lauer et al). Scientist A was the corresponding author on a highly influential Top 1 in 100 paper (ranks in the top 1% of citations), with Scientist B as a middle author. Scientist B also published five average papers as corresponding author.

Table 2. The wRCR division scoring is on the left. A %ile rank ratio based scoring system on the right (paper scores are being divided over 5 years). The measured productivity outcomes are on the bottom row.

In the GSI system used in the preprint (‘wRCR division scoring’), Scientist B scores much better than Scientist A. In the %ile rank ratio scoring, the opposite occurs and Scientist A greatly outperforms Scientist B. Is one system right and the other wrong? No, they measure different outcomes. I again note that the George Santangelo RCR metric does as very good job at what it was specifically designed to do: normalize citations across fields for individual papers. It was not designed to be used for assigning/subdividing individual author credit. In the alternative scoring system described above, breakthrough discoveries and more impactful discoveries are emphasized. The example illustrates how dramatically different the perceived outcomes can be based on straightforward changes in how one measures and attributes scientific progress.

On calculations of marginal returns

This debate is better saved this for another time, but suffice it to say, just because it can be graphed does not mean a derivation calculation (a diminishing marginal returns) can be done (11). That can only be done if you know a causal relationship. To have such a relationship, one must demonstrate examples of individual PIs obtaining a new grant and moving along the curve — or vice versa, publishing a cited paper and moving along the curve. That has not been done. To put it more colloquially, Jonathan Epstein said, ““Because [summer] ice cream sales correlate with the number of drownings doesn’t mean we should have a national policy to stop eating ice cream.” (12) Lastly, again, for any analysis of productivity, the underlying data have to be accurate. For any analysis that one would want to use to predict future outcomes (e.g., set national science funding policy), the model would also need to be highly predictive (i.e., have a very strong correlation between input and output).

Summary

Attempting to measure scientific productivity is a worthy endeavor. But there are many ways to be an excellent scientist, running either a small lab or a large lab, and there is a long way to go to develop metrics for quantitatively assessing productivity. Being able to use such yet-to-be-defined metrics to the predict future productivity has even further to go.

Cited Sources

1. Lauer MS, Roychowdhury D, Patel K, Walsh R, Pearson K. Marginal Returns And Levels Of Research Grant Support Among Scientists Supported By The National Institutes Of Health. bioRxiv. Cold Spring Harbor Labs Journals; 2017 May 26;:142554.

2. Crotty S. Follicular helper CD4 T cells (TFH). Annu Rev Immunol. 2011 Apr 23;29:621–63.

3. Johnston RJ, Choi YS, Diamond JA, Yang JA, Crotty S. STAT5 is a potent negative regulator of TFH cell differentiation. J Exp Med. 2012 Feb 13;209(2):243–50.

4. Ejrnaes M, Filippi CM, Martinic MM, Ling EM, Togher LM, Crotty S, et al. Resolution of a chronic viral infection after interleukin-10 receptor blockade. J Exp Med. 2006 Oct 30;203(11):2461–72.

5. Burton DR, Ahmed R, Barouch DH, Butera ST, Crotty S, Godzik A, et al. A Blueprint for HIV Vaccine Discovery. Cell Host Microbe. 2012 Oct 18;12(4):396–407.

6. Eto D, Lao C, DiToro D, Barnett B, Escobar TC, Kageyama R, et al. IL-21 and IL-6 are critical for different aspects of B cell immunity and redundantly induce optimal follicular helper CD4 T cell (Tfh) differentiation. PLoS ONE. 2011;6(3):e17739.

7. Cubas RA, Mudd JC, Savoye A-L, Perreau M, Van grevenynghe J, Metcalf T, et al. Inadequate T follicular cell help impairs B cell immunity during HIV infection. Nature Medicine. 2013 Apr;19(4):494–9.

8. Crotty S. T follicular helper cell differentiation, function, and roles in disease. Immunity. 2014 Oct 16;41(4):529–42.

9. Hutchins BI, Yuan X, Anderson JM, Santangelo GM. Relative Citation Ratio (RCR): A New Metric That Uses Citation Rates to Measure Influence at the Article Level. Vaux DL, editor. PLoS Biol. Public Library of Science; 2016 Sep 6;14(9):e1002541.

10. Berg J. Sciencehound: Research output as a function of grant support: The scatter matters [Internet]. 2017 [cited 2017 Jun 8]. Available from: http://blogs.sciencemag.org/sciencehound/2017/06/06/research-output-as-a-function-of-grant-support-the-scatter-matters/

11. Crotty S. The new NIH “Rule of 21” Threatens to Give Up on American Preeminence in Biomedical Research Based… [Internet]. medium.com. 2017 [cited 2017 Jun 8]. Available from: https://medium.com/@shane_52681/the-new-nih-rule-of-21-threatens-to-give-up-on-american-preeminence-in-biomedical-research-based-c40060bd3022

12. Kaiser J. Critics challenge NIH finding that bigger labs aren’t necessarily better. Science. 2017 Jun 7.

Show your support

Clapping shows how much you appreciated Shane Crotty’s story.