How floatingpoint no is stored in memory?
4 min readOct 13, 2019
This is crosspost from my blog.
This article is just a simplification of the IEEE 754 standard. Here, we will see how floatingpoint no stored in memory, floatingpoint exceptions/rounding, etc. But if you will want to find more authoritative sources then go for
 What Every Computer Scientist Should Know About FloatingPoint Arithmetic
 https://en.wikipedia.org/wiki/IEEE_7541985
 https://en.wikipedia.org/wiki/Floating_point.
Floatingpoint numbers stored by encoding significand & the exponent (along with a sign bit)
 Above line contains 2–3 abstract terms & I think you will unable to understand the above line until you read further.
Floating point number memory layout
++++
   
++++
^ ^ ^
  
  + significand(width 23 bit)
 
 + exponent(width 8 bit)

+ sign bit(width 1 bit)
A typical singleprecision 32bit floatingpoint memory layout has the following fields :
 sign
 exponent
 significand(AKA mantissa)
Sign
 The highorder bit indicates a sign.
0
indicates a positive value,1
indicates negative.
Exponent
 The next 8 bits are used for the exponent which can be positive or negative, but instead of reserving another sign bit, they’re encoded such that
1000 0000
represents0
, so0000 0000
represents128
and1111 1111
represents127
.  How does this encoding work? go to exponent bias or see it in next point practically.
Significand
 The remaining 23bits used for the significand(AKA mantissa). Each bit represents a negative power of 2 countings from the left, so:
01101 = 0 * 2^1 + 1 * 2^2 + 1 * 2^3 + 0 * 2^4 + 1 * 2^5
= 0.25 + 0.125 + 0.03125
= 0.40625
OK! We are done with basics.
Let’s understand practically
 So, we consider very famous float value
3.14
(PI) example.  Sign: Zero here, as PI is positive!
Exponent calculation
3
is easy:0011
in binary The rest,
0.14
0.14 x 2 = 0.28, 00.28 x 2 = 0.56, 000.56 x 2 = 1.12, 0010.12 x 2 = 0.24, 00100.24 x 2 = 0.48, 001000.48 x 2 = 0.96, 0010000.96 x 2 = 1.92, 00100010.92 x 2 = 1.84, 001000110.84 x 2 = 1.68, 001000111And so on . . .
 So,
0.14 = 001000111...
If you don't know how to convert decimal no in binary then refer this float to binary.  Add
3
,11.001000111... with exp 0 (3.14 * 2^0)
 Now shift it (normalize it) and adjust the exponent accordingly
1.1001000111... with exp +1 (1.57 * 2^1)
 Now you only have to add the bias of
127
to the exponent1
and store it(i.e.128
=1000 0000
)0 1000 0000 1100 1000 111...
 Forget the top
1
of the mantissa (which is always supposed to be1
, except for some special values, so it is not stored), and you get:0 1000 0000 1001 0001 111...
 So our value of
3.14
would be represented as something like:
0 10000000 10010001111010111000011
^ ^ ^
  
  + significand = 0.7853975
 
 + exponent = 1

+ sign = 0 (positive)
 The number of bits in the exponent determines the range (the minimum and maximum values you can represent).
Summing up significand
 If you add up all the bits in the significand, they don’t total
0.7853975
(which should be, according to 7 digit precision). They come out to0.78539747
.  There aren’t quite enough bits to store the value exactly. we can only store an approximation.
 The number of bits in the significand determines the precision.
 23bits gives us roughly 6 decimal digits of precision. 64bit floatingpoint types give roughly 12 to 15 digits of precision.
Strange! But fact
 Some values cannot represent exactly no matter how many bits you use. Just as values like 1/3 cannot represent in a finite number of decimal digits, values like 1/10 cannot represent in a finite number of bits.
 Since values are approximate, calculations with them are also approximate, and rounding errors accumulate.
Let’s see things working
#include <stdio.h>
#include <string.h>/* Print binary stored in plain 32 bit block */
void intToBinary(unsigned int n)
{
int c, k;
for (c = 31; c >= 0; c)
{
k = n >> c;
if (k & 1) printf("1");
else printf("0");
}
printf("\n");
}int main(void)
{
unsigned int m;
float f = 3.14; /* See hex representation */
printf("f = %a\n", f);
/* Copy memory representation of float to plain 32 bit block */
memcpy(&m, &f, sizeof (m));
intToBinary(m); return 0;
}
 This C code will print binary representation of float on the console.
f = 0x3.23d70cp+0
01000000010010001111010111000011
Where the decimal point is stored?
 The decimal point not explicitly stored anywhere.