Floating-Point Rounding Problem

Asinshani Taniya
6 min readOct 5, 2021

--

Computer cannot understand words and numbers as humans do. When we input words and numbers to a computer, the computer convert it into a binary code under some standards defined.

If you request to do some tasks, computer perform your requested tasks in this mode. And after the task finished, computer convert the results into high-level language that human can understand.

When we talk about floating-point numbers, the theory is the same.

Floating-Point Numbers

Numbers that contain decimal points are called floating-point numbers. It can be a negative number or a positive number.

As examples 3.5, 7.89, -9.44, 345.78 and -99.01 etc.

IEEE 754 Standard

If we enter a floating-point number to a computer, the computer will convert it into a binary format under the standard called IEEE 754.

IEEE 754 standard represent a floating-point number in 3 sections.

IEEE 754 Floating-point Representation

Size of each section defined according to the precision. Precision represents the size of the floating-point number.

Consider floating-point number,

2.3

For that IEEE 745 floating-point binary representation for single precision is,

01000000000100110011001100110011

If we convert this back to floating-point number using a specific calculator, we will get,

2.2999999523162841796875

You see, we did not receive exact number 2.3 back, there’s an error.

Consider negative floating-point number as well,

-9.1

For that IEEE 745 floating-point binary representation for single precision is,

11000001000100011001100110011010

If we convert this back to floating-point number using a specific calculator, we will get,

-9.1000003814697265625

Okay, this is the impact of floating-point rounding problem. Let’s see where did this issue coming from. For that, we have to see how to convert floating-point number to IEEE 745 standard binary representation.

Convert Floating-Point Number to IEEE 754 Standard Binary Representation

Consider 2.3 in single precision,

IEEE 754 Binary Representation for Single Precision

Write as a binary number:

2.3 → 10.01001100110011001100110011001100110011001100…

Write in scientific notation:

2.3 → 1.001001100110011001100110011001100110011001100… x 2¹

Scientific Notation

Consider the above explanation on scientific notation, we are now going to find sign bit, exponent and mantissa.

Sign Bit

If the number is positive sign bit is 0, and for negative number it is 1. So here,

Sign Bit = 0

Exponent

To represent an exponent, we have to add Exponent Bias to the exponent of scientific notation. And convert final value into binary number. Exponent bias is equal to 127.

Exponent of scientific notation = 1

Exponent in IEEE 754 in decimal = 127 + 1 = 128

Exponent = 10000000

Mantissa

To represent mantissa, we have to consider fraction/significant part of the scientific notation. Here we only consider its numbers after the floating point. Which means,

001001100110011001100110011001100110011001100…

This is not ends, it repeats itself so long. But we have to collect 23 numbers to represent in standard. So we have to round this big number into 23 bits.

First, consider the first-23-bits from the fraction. And then add the 24th bit to it. If it is 1 we have to add 1, if it is 0 nothing to add. This final result is equals to Mantissa.

First-23-bits: 00100110011001100110011 | 0011001100110011001100

24th-bit: 00100110011001100110011 | 0 | 011001100110011001100…

Mantissa = 00100110011001100110011 + 0

Mantissa = 00100110011001100110011

So Finally we have sign bit, exponent and mantissa. By concatenating these 3 we will get IEEE 754 standard binary representation.

Sign Bit = 0

Exponent = 10000000

Mantissa = 00100110011001100110011

IEEE 754 standard binary representation for 2.3 in single precision is,

01000000000100110011001100110011

Note: Similar steps we can use when it is about double precision and long double precision as well.

The Rounding Error

If you go through previous steps clearly, you may notice we had to round the long fraction into just 23 bits. We also apply a rounding method as well.

This incident called Floating-Point Rounding Error.

When we convert this back into floating-point number, we will not get the exact number we used to convert. We get near number with some errors.

When we use double or long double precision, we can small the error into some decimals. But will not give exact number back.

You may feel, it is just small error, but it can give significant errors in terms of performing some tasks.

As a solution in Java language, we can use BigDecimal data type instead of using double and float for sensitive applications.

Convert IEEE 754 standard binary representation to Decimal Number

This study is not actually necessary to understand what is the Rounding Error. Anyhow, to finish this problem, I planned to clear this out as well.

Here we’re going to learn how we can convert back an IEEE 754 standard binary representation into the decimal floating point number in steps.

However, when we perform this ourselves, this transformation gives you a different answer than when you take the answer from an IEEE 754 standard binary representation into the decimal calculator due to rounding we have to perform in the middle of calculations.

But no worries, the answer quite nears to the exact number.

As we got IEEE 754 standard binary representation for 2.3 in single precision,

01000000000100110011001100110011

Let’s start with, 01000000000100110011001100110011

First you have to know what precision it is, count the number of bits. Since my example has 32 bits, it is single precision.

Next, according to the precision, divide this representation into sign bit, exponent and mantissa.

Sign Bit: 0

Exponent: 10000000

Mantissa: 00100110011001100110011

Then, we have to clear one by one.

Sign Bit

Sign Bit: 0

If the sign bit is 0 number is positive, and for 1 it is negative number. So here,

Sign = + (positive)

Exponent

Exponent = 10000000

Convert this into decimal, Exponent = 128

To get the value of exponent, we have to subtract the Exponent Bias from the decimal value of exponent. Since it is single precision, exponent bias is equal to 127.

Exponent of scientific notation = 128-127

Exponent = 1

Note: The base is 2, so 2¹.

Mantissa

Mantissa: 00100110011001100110011

Mantissa gives the value of the fraction, For get that,

Multiply nth bit by 2 to the power -n, and all those values. As an example, take the 1st bit, multiply it by 2 to the power -1, next 2nd bit, multiply it by 2 to the power -2 and so on to the 23rd bit. Finally, add all the values and get the sum.

Note: We can omit 0s here, but don’t confuse with the power.

You have the answer 0.1500003337860110000000

Fraction: 150000333786011

Now, we have

Sign = + (positive)

Exponent = 1

Fraction: 150000333786011

When convert a floating point number into IEEE 754 standard binary representation, we just did not care about the 1 we had before the fraction or before the dot. Here, come up with that back.

So we have the scientific notation,

+1.150000333786011 x 2¹

The number is: 1.150000333786011 x 2 = 2.300000667572022

--

--