IEEE 754

3 min readJan 10, 2023

The IEEE-754 standard describes floating-point formats, a way to represent real numbers in hardware. It was established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE).

An IEEE 754 format is a “set of representations of numerical values and symbols”. A format may also include how the set is encoded.

Why is floating-point representation used?

Floating point representation makes numerical computation much easier. You could write all your programs using integers or fixed-point representations, but this is tedious and error-prone.

A floating-point format is specified by a base (also called radix) b, which is either 2 (binary) or 10 (decimal) in IEEE 754.

What are the 2 types of floating-point ?

There are two floating point primitive types. Data type float is sometimes called “single-precision floating point”. Data type double has twice as many bits and is sometimes called “double-precision floating point”.

There are several ways to represent floating point number but IEEE 754 is the most efficient in most cases. IEEE 754 has 3 basic components:

The Sign of Mantissa –
This is as simple as the name. 0 represents a positive number while 1 represents a negative number.
The Biased exponent –
The exponent field needs to represent both positive and negative exponents. A bias is added to the actual exponent in order to get the stored exponent.
The Normalised Mantissa –
The mantissa is part of a number in scientific notation or a floating-point number, consisting of its significant digits. Here we have only 2 digits, i.e. O and 1. So a normalised mantissa is one with only one 1 to the left of the decimal.

IEEE 754 numbers are divided into two based on the above three components: single precision and double precision.

1. Single Precision: Single Precision is a format proposed by IEEE for the representation of floating-point numbers. It occupies 32 bits in computer memory.

2. Double Precision: Double Precision is also a format given by IEEE for the representation of the floating-point number. It occupies 64 bits in computer memory.

Difference between Single and Double Precision:

SINGLE PRECISION -

In single precision, 32 bits are used to represent floating-point number.

This format, also known as FP32, is suitable for calculations that won’t be adversely affected by some approximation.

It uses 8 bits for exponent. In single precision, 23 bits are used for mantissa. Bias number is 127.

Range of numbers in single precision : 2^(-126) to 2^(+127)

This is used where precision matters less. It is used for wide representation. It is used in simple programs like games.

This is called binary32. It requires fewer resources as compared to double precision.

It is less expensive.

DOUBLE PRECISION-

In double precision, 64 bits are used to represent floating-point number.

This format, often known as FP64, is suitable to represent values that need a wider range or more exact computations.

It uses 11 bits for exponent. In double precision, 52 bits are used for mantissa. Bias number is 1023.

Range of numbers in double precision : 2^(-1022) to 2^(+1023)

This is used where precision matters more. It is used for minimization of approximation. It is used in complex programs like scientific calculator.

This is called binary64. It provides more accurate results but at the cost of greater computational power, memory space, and data transfer.

The cost incurred using this format does not always justify its use for every computation .