Doubles or Floats, Let .Net7 Help

8 min readNov 24, 2022

Typically, teams needing real numbers reach for 64-bit doubles instinctively, but .Net7 has made it easier to work with 32-bit floats. This might get one wondering what should we ask, or look out for, when straying off familiar, beaten-path?

The C# language has always encouraged doubles: 3.14 is a double, you need a suffix 3.14f to make a float. BCL’s Math.Xxx(...)functions lack overrides for float, e.g Abs(double), so you needed casts (hence conversions) to use them with floats. Even in the 32-bit app era, 64-bit doubles were the orthodox choice. Why risk running out of range, propagating precision-errors, or piling-up conversion costs?

Of course, prior to .Net7, though, there were pockets of specialized types preferring float, e.g. Vector3, Quaternion, etc. And now with .Net7’s Generic Math interfaces re-implementing the Math.Xxx methods (and more) on more numeric types, float could be worth a second look.

float theta          = float.Pi / 4.0f;   
var (sinVal, cosVal) = float.SinCos( theta );    👈 whoa, look at that!

(Alternatively, there’s theMathF class, but you can access several more type-specific things via the type.extension form, e.g. short.Log2(…))

Pros of single-precision float

Here’s some reasons to prefer float over double:

They take up half the memory-space (obviously).
Your look-up table of ip-address to lat/lon centroid would suddenly take half the number of gigabytes. What’s not to love?
Some FPU ops will run faster on smaller types.
Transcendentals (trig, sqrt, log/exp) and even division may perform better. Others (fmul/fadd, …) are inherently fast and likely agnostic to size (BTW, try to favor such ops in the inner-loop anyway, e.g. use x,y,z vectors over trig angles,)
Twice as many values will fit on a cache-line.
A time-series float[] would get more L1/L2 cache-hits (and if data must be iteratively revisited, fewer self-evictions).
…and twice as many values vectorized by SIMD intrinsics.
(If you know of SIMD, you know float/double tradeoffs already, but if you’re What’s a SIMD? it’s via Vector intrinsics that they massively sped-up Enumerable.Max, .Sum, .Average, and ML.Net and physics-engines in gaming, etc.)

Cons of using float

Range is limited: 10³⁸ (vs 10³⁰⁸), though usually not the main concern.
Precision is limited: The IEEE-754 standard for 32-bit floats gives them a 24 bit significand (64-bit doubles have 53 bits resolution, see float-toy). 0.5²⁴ ~= 10^-6 (vs 10^-16) decimal digits of accuracy.
Often that suffices. With a small exponent, e.g. 5, we could express a range of latitudes between 32° — 64°, approximately Jacksonville, Florida up past Inukjuak, Quebec. Resolution would be ~0.425 meters, better than GPS User Range Error .64m (and likely better by far than your company’s privacy/anonymizer truncation policy ;-)

float EarthRadius = 6_378_137.0f;  // meters 
float midLat = 45.0f;
float secLat = float.BitIncrement(midLat);  // significand += 1
float DegToRad = (MathF.PI / 180.0f);
float res = EarthRadius * (secLat - midLat) * DegToRad; 👈 == 0.4246

Note: precision dilates with scale. Had you used float to store elapsed seconds, say (the unit-choice doesn’t matter much), then after 60s, the resolution is 3.8μs, but after 1hr, 0.24ms, and 1 day, 7.8ms. Large values between 2²⁴ to 2³¹, a 32-bit int will give better precision than float ; a float can store integers exactly up to 2²³, then even integers to 2²⁴ (with rounding jumps, and ++ oper doing nothing), every fourth integer up to 2²⁶, etc.

(Out of scope for this article, using fixed-point math to counter dilation.)

Ok, so ensure you’re within constraints of the ‘cons’ and benchmark to check the ‘pros’. Et voila, c’est tout. Right?

When the math on the data is minimal, I suspect oui, c’est tout, but the more involved the computation, the longer state feeds and changes itself, etc., the more possibility you will see a wrinkle. Try some of the following before reverting your double-to-float PR.

Precision and Noise-Propagation

The following may surprise folks about floating-point equality.

Console.Write(
  $"\r\n  0.1 + 0.1 + 0.1 == 0.3 : {(0.1 + 0.1 + 0.1) == 0.3}" + 
  $"\r\n  123 / 10 * 10 == 123   : {((123 / 10) * 10) == 123}" +
  $"\r\n  (double)(float)47.136 == 47.136 : " + 
                                $"{((double)(float)47.136) == 47.136}" + 
  $"\r\n  1.1f == 1.1            : {1.1f == 1.1}");

The problem, of course, is due to quantization/rounding noise. Those were _doubles_ and still they failed. It’s because 0.1 is imperfectly expressible, thus quantized, the sum of two quantized 0.1s is then also quantized as is the sum of 0.2 and 0.1, so little wonder a quantized expression of 0.3 doesn’t equate (0.1.ToString(“G17) = “1.00..001”, 0.3.ToString(“G17”) = “2.99999…”).

Floats will be worse. The last line also shows that even a constant coerced from double to float won’t equal that same constant!

<aside> And so, my jaw dropped in a code review seeing Dictionary<double, …>. One can just imagine the very poor recall rate, and sky-high memory use from ‘updates’ rarely actually replacing elements. The dev did explain he’d only store consts that were integers, but I still worried new-hire might learn bad ideas.</aside>

Numerical Stability

I’m not a numerical-methods guy, so take this with a grain of salt, but … Stable algorithms are crafted to try to attenuate the flow of rounding-errors. Moreover, good algos may even re-arrange the data to perform ops in the least noisy alternative order (e.g. matrix methods re-map rows and columns as they go). Some of those hand-techniques you used in Math 201 — Intro to Linear Algebra actually propagate and amplify noise, Cramer’s Rule being such a beast.

A few stability-minded heuristics:

Subtracting very similar numbers amplifies noise.
x.xxxxxx1 — x.xxxxxx0 has rounded both values already, so the answer 1.1E-6 approximates what in reality could be 1.0E-6 or 1.2E-6. Small absolute errors result in a 10% relative error here. This phenomenon is called catastrophic cancellation and can get far bigger. The classic way to avoid is to favor adds, multiplies, and divides over subtracts, e.g.

// cancel risk minimized by promoting to double, as fsub unavoidable
double b4ac = ((double)((b * b)) - (4.0 * a * c));   
float  b4ac_sqrt = float.Sqrt( (float) b4ac );
float  x1 = -1.0f * 
            (b + (float.Sign(b) * b4ac_sqrt))
            / (2.0f * a);
-float  x2 = -1.0f *
-            (b - (float.Sign(b) * b4ac_sqrt))   // cancellation risk!
-            / (2.0f * a);
+float  x2 = c / (a * x1);    // equiv form avoiding subtraction

Subtracting very dissimilar numbers can destabilize the result.
xxx.xxxxxx — y.yyyyyyE-6 doesn’t remove enough from the larger term. If we want to know what percentage of a wall’s length the player is at, better to temp = wallEnd — player; pct = (wallLen — temp)/wallLen; than to start = wallEnd — wallLen; pct = (player-start)/wallLen; assuming Len is very small compared to the open-world coordinate of End (I’m told some game data starts coords at 2.0E9 to avoid other rendering artifacts, else this example may seem puzzling).
Recursion should aim to diminish noise per step.
For instance, instead of En+1 = 1 — n*En, n = 1,2,3…, which multiplies prior errors by ever larger factors, try going backwards En-1 = (1 — En)/n, n = …3, 2, 1?
You may have better luck by trying to keep intermediate terms much smaller than the final answer. Converge to the final answer with small corrections.

Likely good tricks to keep in mind even if stick with doubles. Alas, I don’t think there’s a simple master recipe for stability, but awareness helps.

I wish I could say checked { … } blocks could help catch overflow problems, but sadly, not in .Net. Peppering float.IsInfinity(res) checks/asserts about may help. (Underflow, e.g. float.Epsilon / 2.0f, results in 0.0f, less convenient to infer a range issue, but happily usually less a show-stopper.) If infinity (overflow) is encountered, one technique might be to pre-scale, incurring more ops, yes, so benchmark to see if the size/cache-line/etc. benefits don’t evaporate. For instance, …

-// if the straight-forward sum-of-squares overflows ... 
-float magSquared = x1*x1 + x2*x2 + ... xn*xn;

+// Psuedo-code of pre-divide 
+float scale     = max(x1, x2, ... xn );
+float scaleInv2 = 1.0f / (scale * scale);  // fmul is much faster than fdiv
+float mag = scale * Math.Sqrt( scaleInv2*x1*x1 + ... + scaleInv2*xn*xn );

For cases like computing eigenvalues, matrix-inverse, etc., I’d probably just stick with double (and also use well-established LAPACK-ish libraries, vs. rolling my own “faster” algo, ’cause edge-cases abound and neither I nor my maintaining teammates are specialists). One can still use floats for data-streams, as deltas, etc., and surgically use doubles in critical, deep crunching methods.

.Net7’s Generic Math however does include a few methods to eliminate some intermediate-result down-conversions. Intel FPUs internally hold 80 bit values, up-converting both double and float to this higher precision, so intrinsics that combine math ops may eliminate down-conversion commonly performed steps, sometimes likely faster, e.g. for multiply-accumulate scenarios perhaps like for Horner’s Method, float.FusedMultiplyAdd(x,y,acc) might be used, or for vector-projections (v1 * v2 / |v1||v2|), float.ReciprocalSqrtEstimate(mag). Searching the literature, you can also find some compensating algorithms such as Kahan Summation, if error propagation is suspected.
(Confession, I’ve never resorted to these, so my guess at example usage may now showcase them well.)

Often, though, no hunt for accuracy-preserving techiques is needed.
Again, awareness can pay dividends, so...

Even Smaller Than Float!

There’s also a new 16-bit Half type, a scant 1/4 the size of a double! And in .Net7, it supports a lot of ops on it. At 10 significant bits, I doubt you’ll want to use it for your geospatial logic, but machine-learning folk find a lot of use for this size, e.g. in memory-bound deep-learning. GPUs support this type, and/or a 16-bit cousin, BFLOAT16 (which I’m unaware of .NET’s plans around), this format that trades 3 bits of significand for wider exponent range. So, clearly, very useful things can still be achieved with limited precision (encouraging for float, no?).

Conclusion

For big collections or quickly repeated transforms on numeric collections, see if picking float instead of defaulting to use doubles might be a win. We’ve seen a couple contra-indications around range or precision, but more tasks are amenable than you might think. And .Net7 makes using floats easier now than ever. Peeking afresh at the System.Numerics namespace is a good idea regardless.

Hope you enjoyed the read.

Misc Resources

System.Numerics namespace
Grigory Sapunov’s FP64 … BFLOAT16… and other members of the ZOO
Bruce Dawson’s excellent series on Floating Point
Floating point guide references