Chaining Comparisons: Seeking Information from the Audience
At the last standards committee meeting in Albuquerque, the spaceship operator was adopted into the working draft for what will eventually be C++20. I’m already pretty excited about that. But one of the initial “optional” parts of Herb Sutter’s original spaceship proposal (which was dropped early) was to support chaining comparisons. To quote the idea from the paper (section 3.3, pages 25–26):
Without chaining, today we either perform double evaluation or introduce a temporary variable. I’ve many times wanted to write code like
0 <= expr < maxwithout either evaluating
exprtwice or else having to invent a temporary variable (and usually a new scope) to store the evaluated value. A number of times, I’ve actually written the code without thinking, forgetting it wasn’t supported, and of course it either didn’t compile or did the wrong thing. As an example of “did the wrong thing,” this proposal does change the meaning of some code like the following that is legal today, but that is dubious because it probably doesn’t do what the programmer intended:
int x = 1, y = 3, z = 2;
assert(x < y < z); // today, means "if (true < 2)" — succeeds
In this proposal, the meaning of the condition would be
((1 < 3) && (3 < 2))and the assertion will fire. To use Stroustrup’s term, I consider this “code that deserves to be broken;” the change in meaning is probably fixing a bug. (Unless of course we do a code search and find examples that are actually intended.)
I think this is a great idea. But this post isn’t about me trying to convince you of this. It’s about this very last part of the quote. A proposal to support chaining comparisons “properly” would take well-formed code today and change its meaning (even if that well-formed code likely doesn’t do something meaningful) and that’s the kind of change that needs to be undertaken very carefully. I have strong suspicions that the amount of such code is minimal, because:
- It’s well-known and understood that chaining does not actually do the mathematical thing in code, and I’m not sure the expectation exists that it does (in my personal experience, I distinctly remember being very surprised that in Python it did do “The Right Thing.” Your mileage may vary).
- It’s an easy bug to catch in code review, since you don’t even have to know the types of any of the arguments to identify the likely incorrect behavior (notable exceptions being for DSLs, and I think it’s likely clear from context anyway).
- It’s also an easy bug to catch in testing, since if you have tests, you will catch it. The behavior of the chained comparison is just wildly different.
Put those things together, and the amount of such preexisting code should be small. However, these are just based on my assumptions, and my assumptions aren’t worth the paper that they aren’t even printed on. We need real data on this stuff!
To that end, I decided to fork the git mirror of clang-tools-extra and write a clang-tidy check named “bugprone-chained-comparison” for the kinds of expressions that I would like to see changed. Specifically, I’m looking for any comparison expression where either operand is itself an unparenthesized comparison expression, regardless of its type. Unparenthesized to allow potentially weird cases where you want the current behavior of comparison chaining, the parentheses could help show intent (e.g.
a == b == c is likely bad, but
(a == b) == c might intentionally be checking something. Python works the same way,
5 > 4 > 3 is
(5 > 4) > 3 is
False). I’m intentionally casting a wide net and not restricting the type of the each of the comparisons to
bool — there is code out there that has comparisons that yield
false_type and I want to include those (thanks T.C. for pointing this out). Additionally, I want to get a feel for how much DSL code there is.
With help from this blog post, I put together a check. More on the details of the check itself at the bottom, for those of you who are interested. I went ahead and ran this check on clang’s own code base and some code bases that I work on (whose orders of magnitude are ~1 million LOC). Zero matches for this warning so far. I also ran it against Boost’s code base, inclusive of tests. The warning matched 16 lines, three of which pointed to a line in a gil header that is actually a Boost bug (this line is missing a
template keyword) and the other 13 of which are various Boost DSLs (2 hits on
multi_array::range(), one in Process piping
std_in, and the other 10 in Spirit). So far so good — no cases where the expressions are remotely
bool-like. But I want more data.
So I would like to ask readers, if you’re willing to take the time, to use my check (
bugprone-chained-comparison) to run on any code bases they have easy access to and let me know:
- How many LOC?
- Did the warning trigger at all?
- If so, how many intentional uses of chaining were found, vs unintentional bugs, vs instances of various DSLs?
And, of course, if anybody sees issues with the check that I wrote, please tell me! You can email me (firstname.lastname@example.org) or post on reddit. Thank you very much in advance!
You can check out the various LLVM/Clang repositories was as follows (though see also LLVM’s documentation):
My commit with the check can be found here. As a pro-tip, if you’re like me and have limited memory available on your box, set the build type to Release. The debug symbols for LLVM and Clang take over 25GB. I don’t know how much more over 25GB, since that was all I had available at the time…
The key difficulty with this check is that the expression
a < b is a
BinaryOperator node in the AST if this is a builtin comparison, but otherwise is a
CXXOperatorCallExpr. These further have different members, so you can’t just wrap things in an
anyOf() and call it a day. So the approach that I took was to match both kinds of expressions separately:
Now once I have one expression, I need to check if either of its subexpressions (which are
getRHS() for the the former case, and
getArg(1) for the latter) is an unparenthesized comparison. This also was a little tricky, since I need to consider lots of cases:
BinaryOperator— easy, just switch on the operator
CXXOperatorCallExpr— likewise, just different switch
ParenExpr— explicitly want to consider this
- Other. This is the trickier case to handle. There’s lots of stuff that could go here. Temporary materialization, implicit casts, call expressions, etc. My initial approach was if the node has exactly one child, recurse into it, otherwise return
false. This worked great (reject all the nodes with multiple arity and the leaves, keep walking through implicit casts) until I ran into this in a Boost test:
Obviously that’s not a chained comparison, but my check warned about it anyway. As a result, I added another early bailout:
ExplicitCastExpr. And so far, it’s seemed alright. It’s likely there’s other nodes I need to reject on, but I haven’t run into that code yet?