On a Reproduction Study Chock-Full of Problems

Vladimir Filkov
Nov 19 · 10 min read

by Prem Devanbu, Vladimir Filkov

“The first principle is that you must not fool yourself — and you are the easiest person to fool.” — Richard Feynman, 1974

Replication studies are relatively rare. Scientists wish that weren’t so, as independent replication is at the core of what good science should be. Unfortunately, it seems funding for replications is less available than funding for doing new things. So, we were delighted to hear that Dr Jan Vitek and his student were interested in doing a replication of our Communications of the ACM (CACM) 2017 paper A Large-Scale Study of Programming Languages and Code Quality in Github. We happily provided our data from an earlier version of our paper, presented at the FSE 2014 conference, so they could do so. The paper by Berger, Hollenbeck, Maj, Vitek, and Vitek, published in TOPLAS 2019, On the Impact of Programming Languages on Code Quality: A Reproduction Study, is their final product: a reproduction and critique of our CACM 2017 (CACM17 for short) paper, in which they also refer to our paper’s earlier, conference version from FSE 2014 (FSE14 for short).

We have read their TOPLAS 2019 (TOPLAS19 for short) paper carefully, and have written a detailed, technical response, posted on arXiv. In summary, we find that TOPLAS19 largely replicated our results, and agree with us in their conclusion: that the effects (in a statistical sense) found in the data are small, and should be taken with caution, and that it is possible that an absence of effect is the correct interpretation. One of the authors of TOPLAS19, Jan Vitek in a public presentation at Curry ON! London, 2019, admitted that an earlier peer review of their paper pointed out that their and our results are actually in agreement. Of course, there are also a number of small differences in our results, which do not change the nature of our conclusions; we discuss those too in our ArXiv response. We reiterate: our CACM17 paper’s conclusions have not been shown wrong: they still hold, even more so now that they’ve been reproduced, and our paper is eminently citable.

And yet, in spite of TOPLAS19 having essentially reproduced our findings on all research questions (RQs) they attempted, the paper is replete with claims to the contrary. Indeed, right in the abstract, they state, about their results: “Moreover, the practical effect size is exceedingly small. These results thus undermine the conclusions of the original study”, suggesting that we incorrectly found strong effect sizes…indeed we did NOT, and we point out already in our abstract that “even these modest effects might be due to …intangible, process factors”. We point out below other claims in TOPLAS19 that are, like the above, not well supported, going through each of the original research questions.

RQ1. Some languages have a greater association with defects than others, although the effect is small.

Although TOPLAS19 concludes otherwise, in the repetition part of their study, they essentially reproduced our RQ1 results from the earlier version, FSE14 (compare column a and c of the following Table (Table 2 from TOPLAS19)):

The small differences between column b (CACM17) and c (TOPLAS19) result from the additional data processing that we did during our CACM submission and reviewing process (which we detail in our arXiv document), which happened after having shared the data and scripts with Dr. Jan Vitek and his student.

TOPLAS19 also presented a reanalysis study which includes additional data gathering and processing. These results are in Table 6 in TOPLAS19, which oddly omits the CACM17 column, and includes just the FSE14 column, apparently to support the claims of differences in findings. To correct this, we have copied and aligned the respective columns from our Table 6, from CACM17 and Table 6 (column c) from TOPLAS19 (see the table below). Note the headings of the two columns in TOPLAS19; our discussions are focused on “FDR” (the False Discovery Rate, FDR, is a technique for adjusting p-values to account for the risk of accidentally arriving at an incorrect lower number when testing several independent hypotheses on the same data. It is an improved technique over the older, more conservative, Bonferroni correction).

As can be seen above, our findings in CACM17 and Table 6c of TOPLAS19 are quite similar! Even after regathering and further processing their data, their FDR Column shows surprisingly good agreement (in what we find statistically significant, or p<0.05) to ours from CACM17. They claim in their paper that their “cleaner” dataset yielded differences in “C, Objective-C, Javascript, Typescript, PHP, and Python”, but they appear to mis-observe values in their own table, as being larger than 0.05, the grayed out rows, which affects their count of disagreements with our results (see Note 1, below). Their FDR corrected results agree with our results in CACM17 for 7 out of the 9 languages we found significant! Even in the two cases where the two disagree, for PHP and Python, our results are at 0.05 statistical significance, and theirs at 0.075, a small difference. As this is after a great deal of additional data gathering, cleaning, and processing (described in detail in TOPLAS19), this is to a remarkable degree a confirmatory reproduction study of our CACM19 RQ1! In spite of that, TOPLAS19 misrepresents their results as disproving our study, perhaps because of their mis-observation of their Table 6c.

While they show FDR results in their Table 6, they also use the very conservative, and problematic Bonferroni correction (see Note 2, below). They do so in a number of places; this is puzzling to us since, one of the TOPLAS 19 authors, Dr Olga Vitek, a noted bio-statistician, has in her prior published work consistently chosen and recommends the FDR over using Bonferroni (see Note 3, below). Not surprisingly when using the Bonferroni correction, and not FDR, in their final analyses in TOPLAS19, they end up with only 5 significant results after Bonferroni correction.

To get their final numbers in Table 6, they apply two further statistical treatments, which, as we argue in our arXiv rebuttal, are either unnecessary (zero-sum contrasts) or suffer, again, from the “deleterious” Bonferroni correction and additional assumptions (bootstrap).

RQ2. There is a small but significant relationship between language class and defects

Although TOPLAS19 claims otherwise, they reproduced our RQ2 results in the Repetition study. Shown below is their Table 4, compare columns 4a and 4b.

As part of their repetition of our RQ2, they further changed our classification of programming languages, because they disagreed with it. After the reclassification they got qualitatively the same results as ours: comparing table 4c in TOPLAS19 (above) and our Table 7 from CACM17 (see below), it is clear they both imply that the functional language categories (first and seventh rows of their language classes) are associated with (slightly) lower bug proneness, and that those findings are statistically significant.

We find it odd that instead of comparing the implications from Table 4 qualitatively, like we did above, in TOPLAS19 they compared their and our regression model coefficients and counted their per variable disagreements. That is unsound, and goes against standard statistical practice for regression models with different predictor variables, especially when some are subcategories of others, as in this case (see note 4, below).

RQ3. There is no general relationship between application domain and language defect proneness.

They reproduced our RQ3 results, and confirm that our conclusion likely holds (last sentence in TOPLAS19, Sect. 3.2.3). They implemented their own methods for RQ3 and concluded the same as we did from ours in CACM17: no evidence is found for a correlation between domain and defect proneness. Thus, this is a confirmatory reproduction study of ours.

On the TOPLAS19 Conclusion

In both of our FSE14 and CACM17, abstracts and conclusions, though we found statistically significant results in some of our RQs, we very carefully discussed that those are very small and are to be taken with caution. We were also clear that we were talking of associations and correlations, not causality. Thus, the TOPLAS19 final conclusion, even after reanalysis, is virtually the same as ours: whatever positive effects were found (they found 7 significant associations before the very conservative Bonferroni correction, and 4 after; we found 9) were very small or small, and thus are possibly not real and need to be taken with caution.

On the TOPLAS19 Defect Commit Hand Labeling

We did find the last part of their work, on manually labeling defect commits, interesting. Unfortunately, the precise protocols used are not documented in TOPLAS19. Their reported false positive rate of 36% is very high compared to the usual rates reported in such studies. When we attempted to relabel some of the commits ourselves, using all the available information, we found that 11 out of 12 commits that they had labeled as false-positives, were in fact true positives (the details are in our full arXiv rebuttal). It’s not clear from TOPLAS19 what information was used by their labelers. This is an ongoing, time-consuming project, for which we have requested and obtained more data from the TOPLAS19 authors, and we’ll have more to say on this later.

Beyond TOPLAS19

Finally, in their recorded video presentations, (available on YouTube) at Curry ON! and SPLASH! conferences, some of the TOPLAS19 authors have made several incorrect statements or suggestions pointedly attacking our work. In particular, Dr Jan Vitek makes a number of incorrect statements in those talks, beyond and above those made in the TOPLAS19 paper. He states in his talks that: (1) we picked the top 3 projects for each language based on size, and mocks us for supposedly picking a small Perl project. We did not! We picked projects based on popularity (most stars), as explained in CACM17. Also, the Perl project that was comically small to him, at the time of study, had 784 lines of Perl code. And we never used it in the analysis, it was filtered out for having too few commits at the time. (2) Dr Jan Vitek incorrectly accuses us of using projects of various sizes, without controlling for size. We knew better than to do that: in the software engineering community, Khalid El-Emam showed us about 20 years ago that size matters; ever since, using log-scaled size as a control variable in regression models is de rigueur; we certainly did that in all our models. Dr Jan Vitek knew that we did, since his replication included our size control. (3) Dr Jan Vitek repeatedly accuses us of confusing correlation for causality, in his talks and TOPLAS19. In fact, neither CACM17 nor FSE14 do that. He resorts to mining fragments of text out of context, even out of our unpublished internal documents, to support his accusation that we confuse the two, and also insinuates that we are to blame when other people confuse the two (a non-sequitur). (4) Finally, Dr Jan Vitek publicly accuses us of burying in the paper the fact that the effects we found were modest. In fact, we state this in the abstract of our paper, in addition to the places in the paper. To misquote one of Senator Bernie Sanders’ colorful epithets, it’s in the Damn Abstract! Respectfully, what more could we have done in our paper to point out that the effects we found were small and caution the reader?

In conclusion: TOPLAS19 per se has several inaccuracies and mis-representations. The presentation in Dr Jan Vitek’s talks, alas, strays much further: they incorporate several exaggerated, incorrect claims, and ad-hominem attacks, that, in our view, are simply out-of-place in rigorous scientific dialogue.

Going forward, and in conclusion: on the whole, we appreciate this kind of study, and look forward to further debates and discourse. But we do hope that “irrational exuberance” can be put aside, and we can move towards more sober discussions.

Notes:

[1] Section 4.3, first paragraph of TOPLAS19, states: “Table 6(b)-(e) summarizes the reanalysis results. The impact of the data cleaning, without multiple hypothesis testing, is illustrated by column (b). Grey cells indicate disagreement with the conclusion of the original work. As can be seen, the p-values for C, Objective-C, JavaScript, TypeScript, PHP, and Python all fall outside of the “significant” range of values, even without the multiplicity adjustment. Thus, 6 of the original 11 claims are discarded at this stage. Controlling the FDR increased the p-values slightly, but did not invalidate additional claims. However, FDR comes at the expense of more potential false-positive associations. Using the Bonferroni adjustment does not change the outcome. In both cases, the p-value for one additional language, Ruby, loses its significance.” This is at odds with the numbers in their own table 6c, shown above, compare FDR, left, and Bonf., right.

[2] Most practitioners recognize that Bonferroni is too conservative and not to be used in practice, especially as it yields many false negatives and can lead to unsound statistical inference in epidemiological studies, e,g.: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1112991/, and https://academic.oup.com/beheco/article/15/6/1044/206216.

[3] E.g., see O. Vitek’s talk here: https://skyline.ms/wiki/home/software/Skyline/events/2015%20US%20HUPO%20Workshop/download.view?entityId=aa2abc6d-ad47-1032-966e-da202582cf3e&name=1-Intro%20Stat%20%28Vitek%29.pdfhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3489540/ and her paper here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3489535/

[4] Their model, TOPLAS19, Table 4, has different variables than ours: they derived two of their categories from one of ours. Thus, the models are not directly comparable, only the implications of those models can be compared. Comparing coefficients across models with different variables is not a standard statistical practice, and can result in unsound conclusions, especially when the new categories result from splitting up old categories. E.g., see Ecological fallacy, https://en.wikipedia.org/wiki/Ecological_fallacy, and Simpson’s paradox, https://en.wikipedia.org/wiki/Simpson%27s_paradox.

Written by

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade