How floating-point no is stored in memory?
4 min readOct 13, 2019
This is cross-post from my blog.
This article is just a simplification of the IEEE 754 standard. Here, we will see how floating-point no stored in memory, floating-point exceptions/rounding, etc. But if you will want to find more authoritative sources then go for
- What Every Computer Scientist Should Know About Floating-Point Arithmetic
- https://en.wikipedia.org/wiki/IEEE_754-1985
- https://en.wikipedia.org/wiki/Floating_point.
Floating-point numbers stored by encoding significand & the exponent (along with a sign bit)
- Above line contains 2–3 abstract terms & I think you will unable to understand the above line until you read further.
Floating point number memory layout
+-+--------+-----------------------+
| | | |
+-+--------+-----------------------+
^ ^ ^
| | |
| | +-- significand(width- 23 bit)
| |
| +------------------- exponent(width- 8 bit)
|
+------------------------ sign bit(width- 1 bit)
A typical single-precision 32-bit floating-point memory layout has the following fields :
- sign
- exponent
- significand(AKA mantissa)
Sign
- The high-order bit indicates a sign.
0
indicates a positive value,1
indicates negative.
Exponent
- The next 8 bits are used for the exponent which can be positive or negative, but instead of reserving another sign bit, they’re encoded such that
1000 0000
represents0
, so0000 0000
represents-128
and1111 1111
represents127
. - How does this encoding work? go to exponent bias or see it in next point practically.
Significand
- The remaining 23-bits used for the significand(AKA mantissa). Each bit represents a negative power of 2 countings from the left, so:
01101 = 0 * 2^-1 + 1 * 2^-2 + 1 * 2^-3 + 0 * 2^-4 + 1 * 2^-5
= 0.25 + 0.125 + 0.03125
= 0.40625
OK! We are done with basics.
Let’s understand practically
- So, we consider very famous float value
3.14
(PI) example. - Sign: Zero here, as PI is positive!
Exponent calculation
3
is easy:0011
in binary- The rest,
0.14
0.14 x 2 = 0.28, 00.28 x 2 = 0.56, 000.56 x 2 = 1.12, 0010.12 x 2 = 0.24, 00100.24 x 2 = 0.48, 001000.48 x 2 = 0.96, 0010000.96 x 2 = 1.92, 00100010.92 x 2 = 1.84, 001000110.84 x 2 = 1.68, 001000111And so on . . .
- So,
0.14 = 001000111...
If you don't know how to convert decimal no in binary then refer this float to binary. - Add
3
,11.001000111... with exp 0 (3.14 * 2^0)
- Now shift it (normalize it) and adjust the exponent accordingly
1.1001000111... with exp +1 (1.57 * 2^1)
- Now you only have to add the bias of
127
to the exponent1
and store it(i.e.128
=1000 0000
)0 1000 0000 1100 1000 111...
- Forget the top
1
of the mantissa (which is always supposed to be1
, except for some special values, so it is not stored), and you get:0 1000 0000 1001 0001 111...
- So our value of
3.14
would be represented as something like:
0 10000000 10010001111010111000011
^ ^ ^
| | |
| | +--- significand = 0.7853975
| |
| +------------------- exponent = 1
|
+------------------------- sign = 0 (positive)
- The number of bits in the exponent determines the range (the minimum and maximum values you can represent).
Summing up significand
- If you add up all the bits in the significand, they don’t total
0.7853975
(which should be, according to 7 digit precision). They come out to0.78539747
. - There aren’t quite enough bits to store the value exactly. we can only store an approximation.
- The number of bits in the significand determines the precision.
- 23-bits gives us roughly 6 decimal digits of precision. 64-bit floating-point types give roughly 12 to 15 digits of precision.
Strange! But fact
- Some values cannot represent exactly no matter how many bits you use. Just as values like 1/3 cannot represent in a finite number of decimal digits, values like 1/10 cannot represent in a finite number of bits.
- Since values are approximate, calculations with them are also approximate, and rounding errors accumulate.
Let’s see things working
#include <stdio.h>
#include <string.h>/* Print binary stored in plain 32 bit block */
void intToBinary(unsigned int n)
{
int c, k;
for (c = 31; c >= 0; c--)
{
k = n >> c;
if (k & 1) printf("1");
else printf("0");
}
printf("\n");
}int main(void)
{
unsigned int m;
float f = 3.14; /* See hex representation */
printf("f = %a\n", f);
/* Copy memory representation of float to plain 32 bit block */
memcpy(&m, &f, sizeof (m));
intToBinary(m); return 0;
}
- This C code will print binary representation of float on the console.
f = 0x3.23d70cp+0
01000000010010001111010111000011
Where the decimal point is stored?
- The decimal point not explicitly stored anywhere.