A citation confession

There has been some recent discussion about incorrect citations and the ethics of citations, and of course I’ve previously discussed these issues. Basically, the problem is that many people cite papers without reading them, which leads to incorrect references, and then people read the paper with an incorrect reference, and use the same reference in that context. This cycle continues until a paper is incorrectly cited far more than it is correctly cited. I’m guilty of starting one such cycle, so I’d like to try and put an end to it here.

When I cite papers I do try to at least skim them, and I also try to limit the number of papers I cite. I do self-cite some of my papers, but only when appropriate. So I think my citation morals are fairly good. But I am aware of one paper which I cited somewhat incorrectly. Let me explain.

When I was in graduate school I couldn’t believe you couldn’t get all the survival correlations for every gene for various TCGA cancers anywhere. I mean, the data is just sitting there for anyone to download, why hasn’t someone done this? So I started downloading the data and running Kaplan-Meier analyses.

But I knew that wasn’t the best technique for survival analyses so I met with a statistician and he told me the standard technique for survival analyses is Cox regression. I wasn’t that familiar with regression, but I knew large outliers could be a problem, so I asked him how the RNA-SEQ values should be normalized. He suggested I do inverse normal transformation (Blom transformation).

I then went through the literature to try and find what other people were doing to normalize their RNA-SEQ counts before Cox regression, but most survival analyses were done with microarray data instead of RNA-SEQ. I did find one paper that specifically investigated different ways of transforming RNA-SEQ data for Cox regressions, and Blom transformation appeared to perform well.

Great, so I had a statistician telling me to use Blom transformation and a paper showed Blom transformation works well, so I decided to go ahead and use that, and I cited the paper.

The problem?

The paper that found support for Blom transformation was using regularization, whereas I don’t include regularization in my models. At the time I didn’t even know what regularization was, but I have since taken a couple machine learning courses.

As a result, in my paper I claim Blom is good for RNA-SEQ Cox regressions and cite a paper that doesn’t necessarily support that claim. I’m not saying Blom transformation is bad for Cox regressions (I’ve gotten good results as far as I can tell), but it doesn’t appear there is evidence to suggest a Blom transform would be better than a simple log transform.

I will say one thing in support of Blom transforming your data however. A Blom transform puts all your data on the same scale, and the size of Cox coefficients depends on the scale of the data. So if you want to compare Cox coefficients within a cancer for different genes, such as what I did for OncoRank, you will have to do something like a Blom transform.

Why does this matter again? Well my friend already committed a drive by citation and cited the paper I cited. Will the cycle end here, or will it continue? I don’t know, I also don’t know if I feel bad about the citation. I think we need people to investigate the best ways to transform RNA-SEQ data, so I am glad I acknowledged this group’s work, but I’m sorry I didn’t cite it in the correct context.