Floating Point Representation

Rukshani Athapathu

Published in

Coder's Corner

4 min readMay 22, 2018

Numbers with fractions that can be put in the form,

can be represented as floating point numbers in computers.

These numbers are called floating points because the binary point is not fixed. Up until about 1980s different computer manufacturers used different formats for representing floating point numbers, but with the introduction of IEEE standard 754, nowadays almost all the computers follow the said standards which greatly increased the portability of floating point data.

IEEE Floating Point Representation

IEEE standard defines three formats for representing floating point numbers,

Single Precision (32 bits)
Double Precision (64 bits)
Extended Precision (80 bits)

and with this standard, floating point numbers are represented in the form,

s represents the sign of the number. When s=1, floating point number is negative and when s=0 it is positive. F represent the fraction (which is also called mantissa) and E is the exponent.

Structure of the two most commonly used formats are shown below.

Single Precision (32-bit)

Double Precision (64-bit)

Now let’s see how we can convert a given decimal number to a floating point binary representation. Lets take -4.40625 as an example.

Step 1:

First convert the integral part which is 4 to binary.

Step 2:

Then we can multiply the fractional part repeatedly by 2 and pick the bit that appears on the left of the decimal to get the binary representation of the fractional part. For example,

Step 3:

Now we need to normalize the number by moving the binary point so that it takes the form,

One very important thing to remember here is, that the leading 1 bit does not need to be stored since it is implied. That is a clever trick used by the standard to get additional space for fractional part. So for the fractional part(23 or 52 bits) we need only to save 0001101 in this case and fill the rest of the its bits on to the right with 0s.

Step 4:

Now the exponent is represented as a integer in biased form. So if the exponent has k-bits then the bias equals to,

Add this bias to the exponent and place it in the exponent section. With single precision, k has 8 bits so the exponent value in this example equals to,

Finally the sign bit is set according to the original sign of the number. That is 1 for negative and 0 for positive.

So the decimal value -4.40625 in binary form can be represented as,

Special Cases — Infinity and NaN

Infinity — When the exponent bits are all ones and the fraction bits are all 0 then the resulting value represents infinity.

NaN — When exponent bits are all ones but the fraction value is non zero then the resulting value is said to be NaN which is short for Not a Number. You get this value when you perform invalid operations like dividing zero by zero, subtracting infinity from infinity etc…