How floating-point no is stored in memory?

4 min readOct 13, 2019

This is cross-post from my blog.

This article is just a simplification of the IEEE 754 standard. Here, we will see how floating-point no stored in memory, floating-point exceptions/rounding, etc. But if you will want to find more authoritative sources then go for

Floating-point numbers stored by encoding significand & the exponent (along with a sign bit)

Above line contains 2–3 abstract terms & I think you will unable to understand the above line until you read further.

Floating point number memory layout

+-+--------+-----------------------+
| |        |                       |
+-+--------+-----------------------+
 ^    ^                ^
 |    |                |
 |    |                +-- significand(width- 23 bit) 
 |    |
 |    +------------------- exponent(width- 8 bit) 
 |
 +------------------------ sign bit(width- 1 bit)

A typical single-precision 32-bit floating-point memory layout has the following fields :

sign
exponent
significand(AKA mantissa)

Sign

The high-order bit indicates a sign.
0 indicates a positive value, 1 indicates negative.

Exponent

The next 8 bits are used for the exponent which can be positive or negative, but instead of reserving another sign bit, they’re encoded such that 1000 0000 represents 0, so 0000 0000 represents -128 and 1111 1111 represents 127.
How does this encoding work? go to exponent bias or see it in next point practically.

Significand

The remaining 23-bits used for the significand(AKA mantissa). Each bit represents a negative power of 2 countings from the left, so:

01101 = 0 * 2^-1 + 1 * 2^-2 + 1 * 2^-3 + 0 * 2^-4 + 1 * 2^-5 
      = 0.25 + 0.125 + 0.03125 
      = 0.40625

OK! We are done with basics.

Let’s understand practically

So, we consider very famous float value 3.14(PI) example.
Sign: Zero here, as PI is positive!

Exponent calculation

3 is easy: 0011 in binary
The rest, 0.14

0.14 x 2 = 0.28, 00.28 x 2 = 0.56, 000.56 x 2 = 1.12, 0010.12 x 2 = 0.24, 00100.24 x 2 = 0.48, 001000.48 x 2 = 0.96, 0010000.96 x 2 = 1.92, 00100010.92 x 2 = 1.84, 001000110.84 x 2 = 1.68, 001000111And so on . . .

So, 0.14 = 001000111...If you don't know how to convert decimal no in binary then refer this float to binary.
Add 3, 11.001000111... with exp 0 (3.14 * 2^0)
Now shift it (normalize it) and adjust the exponent accordingly 1.1001000111... with exp +1 (1.57 * 2^1)
Now you only have to add the bias of 127 to the exponent 1 and store it(i.e. 128 = 1000 0000) 0 1000 0000 1100 1000 111...
Forget the top 1 of the mantissa (which is always supposed to be 1, except for some special values, so it is not stored), and you get: 0 1000 0000 1001 0001 111...
So our value of 3.14 would be represented as something like:

0 10000000 10010001111010111000011
    ^     ^               ^
    |     |               |
    |     |               +--- significand = 0.7853975
    |     |
    |     +------------------- exponent = 1
    |
    +------------------------- sign = 0 (positive)

The number of bits in the exponent determines the range (the minimum and maximum values you can represent).

Summing up significand

If you add up all the bits in the significand, they don’t total 0.7853975(which should be, according to 7 digit precision). They come out to 0.78539747.
There aren’t quite enough bits to store the value exactly. we can only store an approximation.
The number of bits in the significand determines the precision.
23-bits gives us roughly 6 decimal digits of precision. 64-bit floating-point types give roughly 12 to 15 digits of precision.

Strange! But fact

Some values cannot represent exactly no matter how many bits you use. Just as values like 1/3 cannot represent in a finite number of decimal digits, values like 1/10 cannot represent in a finite number of bits.
Since values are approximate, calculations with them are also approximate, and rounding errors accumulate.

Let’s see things working

#include <stdio.h>
#include <string.h>/* Print binary stored in plain 32 bit block */ 
void intToBinary(unsigned int n)
{
        int c, k;
        for (c = 31; c >= 0; c--)
        {
                k = n >> c;
                if (k & 1)  printf("1");
                else        printf("0");
        }
        printf("\n");
}int main(void) 
{
        unsigned int m;
        float f = 3.14;        /* See hex representation */
        printf("f = %a\n", f);  
        /* Copy memory representation of float to plain 32 bit block */
        memcpy(&m, &f, sizeof (m));     
        intToBinary(m);        return 0;
}

This C code will print binary representation of float on the console.

f = 0x3.23d70cp+0
01000000010010001111010111000011

Where the decimal point is stored?

The decimal point not explicitly stored anywhere.

[Click here to read more . . . !]