Published in


A Replication of “DeepBugs: A Learning Approach to Name-based Bug Detection”

Photo by Alina Grubnyak on Unsplash

This is a brief for the research paper “A Replication of ‘DeepBugs: A Learning Approach to Name-based Bug Detection’”, published in the artifact track of ESEC/FSE 2021 [0]. This paper resulted from a course project in my course ECE 595: Advanced Software Engineering at Purdue University.

Original paper

In 2018, Pradel & Sen published a paper called DeepBugs that described a software defect detection tool [1].

Pradel & Sen target the software defects that arise when software engineers use variables in the wrong order. For example, if there is a function calcCylinderVolume(int radius, int height) that is accidentally invoked as calcCylinderVolume(someHeight, someRadius), the calculated volume will be incorrect.

Type checking cannot help with this problem, since the parameters have the same type. However, you could find such a defect if you could reason about the semantics embedded in variable names — e.g. that a variable named “someHeight” probably stores height information, while a variable named “someRadius” stores radius information.

Pradel & Sen hypothesized that Engineers typically use similar variable names for variables with similar purposes. (See also “best practice” recommendations [2] and the “Naturalness Hypothesis” [3, 4]). Based on this hypothesis, for these swapped-argument bugs, they proposed the following detection algorithm:

DeepBugs converts source code into Abstract Syntax Trees (ASTs), then to semantic-encoded vectors via Word2Vec. A neural network determines whether meanings match usage contexts. Here, a developer has written a function call for setDims(width, height) using setDims(y, x). Deep-Bugs learns that x and width are semantically similar, as are y and height, so it predicts that the arguments are swapped.
  • Identify function invocations using the program’s AST.
  • Extract the names of the variables used for each function invocation.
  • Numerically model the concept of “similar variable names” by training a Word2Vec model [5], contextualized with the name of the function being invoked (e.g. the meaning of “someHeight” and “someRadius” in the context of the invocation of the function “calcCylinderVolume”). The Word2Vec model learns the variable names that are usually passed as the first parameter, the second parameter, and so on.
  • Use this Word2Vec model to identify usages where some variable name is “unusual” as defined by the vector calculated by the trained Word2Vec model.
After calculating the Word2Vec vectors, the DeepBugs algorithm uses a small neural network as a classifier for name-based bugs.

These unusual usages can be interpreted as either (1) examples of poor variable names, or (2) defects. When multiple arguments are present, with unusual usages, and the usage would become normal if the arguments were swapped, then we are probably looking at a swapped-argument defect.

I expect that most software engineers have made this error themselves. I certainly have!


The results from Pradel & Sen were compelling. However, part of the scientific method is to externally reproduce findings. My team set out to do so. In particular, we wanted to see if we could use the same data, but an independent implementation, and obtain the same results.

Shared dependencies

We did share some dependencies in our implementation:

Same dataset, some shared components, no shared code.

Pradel & Sen already did all the hard work of determining what neural network architecture to use. Our implementation task was therefore not too complex — getting the pipeline to match was the hardest part.

Clerical error

We misread their paper, and used a Word2Vec window size of 200 tokens instead of 20 tokens. Oops.

However, stable results!

Their approach was stable even with this order-of-magnitude typo.

On the swapped-argument case from the 150k JavaScript Dataset, our DeepBugs replication successfully captured similar performance to the original authors’ work.

Partial replication

We describe our work as a partial replication because Pradel & Sen evaluated their approach on several types of defects — we only looked at the swapped-argument kind. However, we still use the word replication because we independently obtained the same main result.


Pradel & Sen did a great job documenting their work carefully enough that a third party — my team — could replicate it. The one thing they left out was their RNG seed. It is possible that the changed seed could have caused the mild performance discrepancy, but I suspect that our Word2Vec window size error was the real cause.

Overall I think this was a great graduate-level course project. The team was exposed to some state-of-the-art techniques in software engineering tools, applied the ML knowledge they’ve learned in other courses, completed a successful scientific replication, and published a peer-reviewed artifact.

More information

  1. The artifact is available here, including the source code and paper.


[0] ACM definitions:

[1] Pradel & Sen, 2018. Deepbugs: A learning approach to name-based bug detection.

[2] Martin, 2009. Clean code: a handbook of agile software craftsmanship.

[3] Hindle et al., 2016. On the naturalness of software.

[4] Allamanis et al., 2018. A Survey of Machine Learning for Big Code and Naturalness.

[5] Mikolov et al., 2013. Efficient estimation of word representations in vector space.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
James Davis

James Davis

I am a professor in ECE@Purdue. I hold a PhD in computer science from Virginia Tech. I blog about my research findings and share tips for engineering students.