Not a Number: A quiet horror?
Joe Armstrong recently asked the following question on Twitter:
Any pointers to articles/papers on the subject of “quiet NaN propagation” being bad? @joeerl, 2017–11–25
This made me think. I tried posting a quick response, but 140 (or was it 256) characters really is too short to explain everything that is going on here, so therefore I decided to write this slightly(?) longer article.
What is a ‘NaN’ exactly?
At some point some smart minds realized that Floating Point Arithmetic would be a really nice way to be able to quickly calculate the results of applying common models in physics and chemistry to certain situations. The concessions made in the IEEE 754 floating point standards were very much made with these practical uses in mind:
- To keep calculations fast, floating point numbers always take up a fixed amount in space:
- a) This means that dedicated hardware (the FLU- floating-point logic unit) can work with them.
- b) This unfortunately also means that numbers that are too small or too large cannot be represented. Also, many numbers in-between cannot be represented and are rounded to the nearest representable number. For sensor data, this should not be a problem however, since the rounding error should be smaller than the measurement error of the sensor. (But it is disastrous if you start working with floats for e.g. monetary values!)
- In these fields, the rules of the mathematical field of Calculus, like working with limits and related the concept of infinity (mathematically speaking, infinity is not a number).
- This unfortunately also means that there are many operations that are ill-defined. Infinity is not a number, but many of the arithmetics you plug infinity in also cannot give a sensible output, so a special, non-infinity ‘not a number’ value was required.
So, there you have it: NaN, ‘Not a Number’. A very weird value with some very strange properties. To generate a NaN yourself, you could, for instance, (in an IEEE 754 context) attempt to take the square root of a negative number, or divide infinity by infinity.
When are they useful?
They are incredibly useful when you are working with large amounts of input from, for instance, sensor arrays. Such an might contain a couple of sensors that are bad. However, rather than filtering the bad values from the vector or matrix of data that is obtained from these sensors, the location of these missing values themselves are significant information.
For some algorithms, working with NaNs does not make sense at all. But for many, they can either be propagated (remaining NaN in the output) or are all together discarded in the arithmetic of the algorithm.
And this is nice, because algorithm implementations on a local piece of data (inside the fast cache) that do not need control-checking to jump away on an error case can be optimized ridiculously (and can harness, for instance, the power of Single-Instruction-Multiple-Data chip-operations).
When are they horrible?
There are two things about NaN that can be seen as very bad:
NaN is not equal to itself, and other comparisons are also surprising
While ‘logical’ in some weird, twisted way (because you cannot sensibly compare two things that are not numbers in a numerical way), this is disastrous when your program tries to sort things. Or when you do type-checking: Idris’ type-checking used to be broken because of this flaw.
It also means that IEEE 754 floats are only partially ordered. This has important implications for many algorithms, because many of them depend on the transitive property holding, which is thus not true for floats.
NaNs are quietly (implicitly!) propagated
NaNs are very secretive; when performing normal arithmetic, you will not come across them. It’s only when working with malformed data, or models that only make sense for a certain range of values, that they show up. This means that many programmers do not keep NaNs in mind when developing their programs, and that things often do not break during development/testing but only in production. Using a QuickCheck might help a little (since it might actually throw some NaNs your algorithm implementation’s way during testing), but the only way to prevent stuff to go wrong is to be very mindful when using floats in your program.
NaNs are very similar to ‘null’ (‘the Billioon Dollar Mistake’) in this regard. But maybe it is more proper to compare NaN to the Maybe or Optional type that many functional languages contain, where a special ‘nothing useful here’ value is added onto the existing values of a type. The obvious and important difference here, is that Maybe is very explicit in nature, requiring pattern-matching to take things out and put things in.
That said, the propagation of failure is often the only sensible choice when faced with malformed input: Yes, in certain cases it is better to prevent malformed input at all, but in other cases what should happen on malformed input is something that depends on external things (like the requirements of the program that ends up using your library) that make it impossible to perform this choice now.
And thus it seems to me that propagation is not inherently/always bad; as long as no ‘NaN’s end up in the user interface, that is ;-P.
Quiet NaNs vs Signaling NaNs
Theoretically, The IEEE 754 standard defines both Quiet NaNs (that propagate) and Signaling NaNs (that raise some kind of exception or error handler as soon as they are encountered). However, in many contexts, the latter is not implemented or exposed to the user. Also (and possibly because of this; perpetuating the problem), signaling NaNs are a lot less known.
Floating-point arithmetic is a great tool to be used when it is appropriate. It is appropriate when working with sensor data, especially when having large amounts of it because working with floats is fast. Quiet NaN-propagation is useful because it is the only way to have certain calculations done on current hardware in reasonable time.
However, floats should not be used when working with money, arbitrarily-large numbers or life-critical calculations because of their rounding and implicit handling of edge cases.
This also means that I find floats to be a very bad choice to be the default (or only, the horror!) number type to be available in a certain environment.
There are many other kinds of numbers that are unfortunately not available in most standard libraries:
- Arbitrary-sized Natural numbers (Although BigNums are on the rise!),
- Arbitrary-size floating decimal numbers,
- Fixed-size floating-point-numbers that only work with reals and do not work with limits, infinities or NaN. (Erlang has these!)
- Complex numbers,
- integers modulo N,
- lazy irrational numbers and symbolic math,
When computer types are no longer similar to the type of the data of the problem we are modeling, we’ll end up with surprising and possibly disastrous results and this is indeed a very bad thing. I do not think that this means that ‘quiet NaN propagation is bad’, but rather a case of ‘use the correct tool for the correct job’ and ‘be mindful of the edge cases of the constructs you use’.
Still, I’d love to see Signaling NaN become the default, with Quiet NaN-propagation becoming opt-in, because it seems like the far more reasonable way that does not suddenly shoot people that are new to floats in the foot.
These are my current thoughts on the matter. I hope to discuss this more with the bright minds out there, because I’m sure there is more to be said, and also sure that I am probably very wrong in one or multiple important ways :-).
I’d love to hear your thoughts!