A Replication of “DeepBugs: A Learning Approach to Name-based Bug Detection”
This is a brief for the research paper “A Replication of ‘DeepBugs: A Learning Approach to Name-based Bug Detection’”, published in the artifact track of ESEC/FSE 2021 [0]. This paper resulted from a course project in my course ECE 595: Advanced Software Engineering at Purdue University.
Original paper
In 2018, Pradel & Sen published a paper called DeepBugs that described a software defect detection tool [1].
Pradel & Sen target the software defects that arise when software engineers use variables in the wrong order. For example, if there is a function calcCylinderVolume(int radius, int height) that is accidentally invoked as calcCylinderVolume(someHeight, someRadius), the calculated volume will be incorrect.
Type checking cannot help with this problem, since the parameters have the same type. However, you could find such a defect if you could reason about the semantics embedded in variable names — e.g. that a variable named “someHeight” probably stores height information, while a variable named “someRadius” stores radius information.
Pradel & Sen hypothesized that Engineers typically use similar variable names for variables with similar purposes. (See also “best practice” recommendations [2] and the “Naturalness Hypothesis” [3, 4]). Based on this hypothesis, for these swapped-argument bugs, they proposed the following detection algorithm:
- Identify function invocations using the program’s AST.
- Extract the names of the variables used for each function invocation.
- Numerically model the concept of “similar variable names” by training a Word2Vec model [5], contextualized with the name of the function being invoked (e.g. the meaning of “someHeight” and “someRadius” in the context of the invocation of the function “calcCylinderVolume”). The Word2Vec model learns the variable names that are usually passed as the first parameter, the second parameter, and so on.
- Use this Word2Vec model to identify usages where some variable name is “unusual” as defined by the vector calculated by the trained Word2Vec model.
These unusual usages can be interpreted as either (1) examples of poor variable names, or (2) defects. When multiple arguments are present, with unusual usages, and the usage would become normal if the arguments were swapped, then we are probably looking at a swapped-argument defect.
I expect that most software engineers have made this error themselves. I certainly have!
Replication
The results from Pradel & Sen were compelling. However, part of the scientific method is to externally reproduce findings. My team set out to do so. In particular, we wanted to see if we could use the same data, but an independent implementation, and obtain the same results.
Shared dependencies
We did share some dependencies in our implementation:
Pradel & Sen already did all the hard work of determining what neural network architecture to use. Our implementation task was therefore not too complex — getting the pipeline to match was the hardest part.
Clerical error
We misread their paper, and used a Word2Vec window size of 200 tokens instead of 20 tokens. Oops.
However, stable results!
Their approach was stable even with this order-of-magnitude typo.
Partial replication
We describe our work as a partial replication because Pradel & Sen evaluated their approach on several types of defects — we only looked at the swapped-argument kind. However, we still use the word replication because we independently obtained the same main result.
Reflection
Pradel & Sen did a great job documenting their work carefully enough that a third party — my team — could replicate it. The one thing they left out was their RNG seed. It is possible that the changed seed could have caused the mild performance discrepancy, but I suspect that our Word2Vec window size error was the real cause.
Overall I think this was a great graduate-level course project. The team was exposed to some state-of-the-art techniques in software engineering tools, applied the ML knowledge they’ve learned in other courses, completed a successful scientific replication, and published a peer-reviewed artifact.
More information
References
[0] ACM definitions: https://www.acm.org/publications/policies/artifact-review-and-badging-current
[1] Pradel & Sen, 2018. Deepbugs: A learning approach to name-based bug detection.
[2] Martin, 2009. Clean code: a handbook of agile software craftsmanship.
[3] Hindle et al., 2016. On the naturalness of software.
[4] Allamanis et al., 2018. A Survey of Machine Learning for Big Code and Naturalness.
[5] Mikolov et al., 2013. Efficient estimation of word representations in vector space.