Understanding Mathematics behind floating-point precisions

Prabhu Raghav
DecisionFacts

--

Introduction

Deep learning and Transformer models precisely using floating point numbers in weights, gradients, and activation functions. The floating-point representations and formats such as Single Precision (FP32), and Half Precision (FP16) play a crucial role in Machine Learning (ML) and Large Language Models (LLMs) by balancing computational efficiency and numerical precisions.

Double Precision (FP64) is not commonly used!

Below are some of the key areas where floating point precisions are commonly used.

Model Training

During the training of neural networks, floating-point precision is used in the following areas.

  • Gradient Descent: During training, gradient descent algorithms compute gradients of loss functions for model parameters. These gradients are typically computed using floating-point arithmetic.
  • Backpropagation: Backpropagation algorithms propagate gradients backward through neural network layers to update weights. Floating-point arithmetic is used extensively in this process.
  • Activation Functions: Activation functions like sigmoid, tanh, and ReLU involve floating-point operations in forward and backward passes.

Most deep learning frameworks, such as TensorFlow and PyTorch, default to FP32 precision for training. However, FP16 precision is becoming increasingly popular due to its reduced memory footprint and faster computation, especially for large-scale models and distributed training.

Model Inference

Inference, or the process of using a trained model to make predictions on new data, also relies on floating-point precision. Inference can be performed using FP32, FP16, or even lower precision (e.g., INT8) depending on the hardware platform and the specific requirements of the application. Lower precision inference is often preferred for deployment on edge devices or in real-time applications where computational resources are limited.

  • Prediction: During inference, trained models make predictions on new data. This involves performing matrix multiplications, convolutions, and other operations using floating-point arithmetic.
  • Neural Network Layers: Fully connected layers, convolutional layers, and recurrent layers rely on floating point computations for forward and backward passes.

Quantization

  • Inference: After training, models can be quantized to lower precision formats like FP16 or INT8 to improve efficiency and reduce memory usage during inference.
  • Dynamic Quantization: Dynamic quantization techniques adaptively adjust the precision of model weights and activations during inference based on data statistics.

Hardware Acceleration

  • GPU and TPU Computing: Graphics processing units (GPUs) and tensor processing units (TPUs) are optimized for floating-point computations and are widely used for accelerating AI and ML workloads.

Memory Optimisation

  • Model Compression: Lowering precision (e.g., from FP32 to FP16) reduces memory requirements, enabling larger models to fit into memory or reducing memory bandwidth constraints.
  • Sparse Representations: Techniques like quantization-aware training, weights optimization methodologies, and model parameters regularization help to reduce memory and computational load dependencies

Embedded and Edge Devices

  • Mobile Devices: Mobile phones and other embedded devices often have limited computational resources and battery life. Lower precision floating-point formats like FP16 are used to enable efficient inference on these devices.
  • Edge Computing: Edge devices deployed in IoT (Internet of Things) applications use floating-point operations for on-device AI processing, enabling real-time decision-making without relying on cloud services.
Fig: 1 — Model Size Calculations

Now we understand, the significant need for floating-point precisions in AI/ML models, training, inferencing, size quantization, hardware accelerations, memory optimizations, and deployments across various platforms and devices including edge computing!!

Let’s see how floating-point representations.

As we know, machines understand only 0’s & 1’s, a binary format. Hence, Floating point precision is a representation of numbers through binary. The most common floating-point precision formats are Half-precision (FP16), Single-precision (FP32), and Double-precision (FP64).

The floating-point representation uses the IEEE 754 standard. Since the IEEE 754 standard for floating-point arithmetic is widely adopted in computer systems.

Floating Point Representation

In floating point representations, numbers are stored in a combination of three parts:

  • The sign
  • The exponent
  • The mantissa (or significant)
Fig:2 — Floating Point Representation — The Sign, Exponent & Mantissa / Significand

Consider an example floating point number 5.625 to break down its binary representations.

Decimal Number: 5.625

Step 1: Convert the Integer Part into binary.

5.625 = 5’s binary value is 101

Step 2: Convert the Fractional Part into binary.

Multiply the fractional part by 2 repeatedly and note the integer parts of each multiplication until the fractional part becomes 0.

  • 0.625 * 2 = 1.25 (integer part = 1, fractional part = 0.25)
  • 0.25 * 2 = 0.5 (integer part = 0, fractional part = 0.5)
  • 0.5 * 2 = 1.0 (integer part = 1, fractional part = 0)

Combine the integer parts obtained in each step: 101

Step 3: Combine Integer and Fractional parts.

Combine the binary representations of the integer and fractional parts.

· Hence, 5.625 in binary is 101.101

Normalization

There are many ways we can move the radix or decimal point in the decimal numbers to the left side and right side. Based on the number of moves, we define the exponent values.

For example in the binary value,

So, there are no restrictions/controls to set the radix or decimal point positions!

How do we streamline these radix positions?

To streamline the decimal or radix point positions, we use Normalization. It has two types.

  • Explicit Normalization
  • Implicit Normalization

By using any one of the normalization types, we can define the process to set how to set the radix point positions.

Explicit Normalization

This explicit normalization type moves the radix point to the LHS of the most significant ‘1’ in the bit sequence.

Fig:3 — Explicit Normalization

Implicit Normalization (Default)

This implicit normalization type moves the radix point to the RHS of the most significant ‘1’ in the bit sequence.

Fig:4 — Implicit Normalization

Normalization in the above explains, how to represent mantissa to store in a fixed memory space! Storing the exponent in binary format is also important.

OK then, how are we going to represent exponent values?

Using the Biased technique, the exponent values can be stored based on floating-point representations.

Biasing

In floating-point representation, biasing refers to the process of representing the exponent in a way that allows both positive and negative exponents to be represented without using a sign bit.

In many floating-point formats, including the widely used IEEE 754 standard, biasing involves adding a bias value to the true exponent before encoding it. This bias value is chosen so that the encoded exponent can be represented as an unsigned integer.

For example, in single-precision floating-point format (32 bits), the exponent is typically represented using 8 bits. To allow for both positive and negative exponents without using a sign bit, a bias of 127 is commonly added. This means that an exponent value of 0 is represented as 127, an exponent of -1 is represented as 126, an exponent of 1 is represented as 128, and so on.

Single Precision — FP32

  • Mantissa 23-bits : 0 to 22
  • Exponent 8 bits: 23 to 30
  • Sign 1 bit: 31
Fig:8 — Floating Point Representation Structure — IEEE 754

Hence, exponent values are to be converted as binary and stored as 8 bits in the Exponent portion.

For example, a floating point value of 0.15625

  1. First, we determine the sign bit, which is 0 because the number is positive.
  2. Next, we convert the mantissa part 0.15625 into binary. The fractional / mantissa part conversion is straightforward using the method of multiplying by 2 repeatedly.

The binary representation of the mantissa part is 0.00101.

3. Now, we need to normalize the above result → 0.00101

In the part above, we talked a lot about ‘Normalization’. Like I said earlier, it usually uses implicit normalization (Check out the ‘Implicit Normalization’ part).

So the result → 0.00101, becomes after applying normalization which uses RHS radix point.

Fig:6 —Implicit Normalization for the value — 0.15625

The value is 1111100 → 7 bits.

Note: Since the Exponent for Single Precision (FP32) is 8 bits, hence added 0 in the prefix of the above result.

The final floating point representation of value 0.15625 will look like below.

Fig:7 —Single Precision Binary Representation for the value 0.15625

Half Precision — FP16

Half precision uses 16 bits

  • Mantissa 10-bits : 0 to 9
  • Exponent 5 bits : 10 to 14
  • Sign 1 bit : 15
  1. Determine the sign bit: The number is positive, so the sign bit is 0.
  2. Normalize the binary number: 0.00101 normalized is

3. Determine the exponent: Since we normalized to 2–3 the biased exponent will be 3 plus the bias value for half precision, which is 15 resulting in an exponent of 3+15=18, which is 10010 in binary.

4. Determine the significand or mantissa: The significand is 01.

So, in IEEE 754 half-precision FP16 format, 0.15625 would be represented as:

  • Sign bit : 0
  • Exponent: 10010
  • Mantissa : 01
Fig:8— Half Precision Binary Representation for the value 0.15625

Brain Floating Point Format — BF16

Google brain team developed BF16 which is similar to FP16, the only difference is it uses 8 bits exponent same as FP32.

The main advantage of using BF16 in deep learning models is, converting FP32 to BF16 is pretty faster and less complex by ignoring fractional part.

It also easier and useful for mixed precision between FP32 and BF16.

  • Mantissa 7-bits : 0 to 6
  • Exponent 8 bits : 7 to 14
  • Sign 1 bit : 15

So, in IEEE 754 BF16 format, 0.15625 would be represented as:

  • Sign bit : 0
  • Exponent: 10000010
  • Mantissa : 0100000
Fig: 9 — BF16 — Binary Representation for the value 0.15625

INT8 Quantization

The demand for on-device Deep Learning applications has surged across multiple fields, including smart IoT systems, autonomous driving, robotics, and more.

On-device deep learning presents several notable benefits over cloud-based deep learning, wherein the device connects to the cloud for processing deep learning tasks:

  • Users can avoid sending sensitive data to the cloud.
  • It eliminates the need for network communication.
  • Reducing energy consumption and latency.

However, mobile and IoT devices have limited resources and power consumption constraints compared with cloud servers. Hence, a lower foot print models are very important here to use!

A lower precision quantization techniques such as INT8 quantization models are required to use in the mobile or edge devices.

INT8 quantization is a technique used to reduce the memory footprint and computational cost of deep learning and transformer models while maintaining acceptable levels of accuracy. It’s particularly useful for deploying models on resource-constrained devices like mobile phones or edge devices.

INT8 representation quantizes inputs into 8-bit integers, requiring less storage space and typically faster than the original FP32 representation.

How INT8 conversion happens!

Let us see, how to represent the value 0.15625 (same example value we used though out this blog) into INT8 binary value.

To convert the floating-point number 0.15625 into an INT8 representation, you typically need to scale it to fit within the range of INT8 (usually from -128 to 127), round it to the nearest integer, and then clip it to ensure it falls within the valid range of INT8 values.

To convert the floating-point number 0.15625 into an INT8 representation,

  1. Typically need to Scale it to fit within the range of INT8 (usually from -128 to 127).
  2. Round it to the nearest integer, and then Clip it to ensure it falls within the valid range of INT8 values.

Scaling

To scale the floating-point number 0.15625, you need to multiply it by a scaling factor that maps the range of the floating-point numbers to the range of INT8 values. In this case, the range of INT8 values is from -128 to 127.

scaled_value = 0.15625 x scaling_factor

To map the range of floating-point numbers [0, 1] to the range of INT8 values [-128, 127], following scaling factor.

scaled_value = 0.15625 X 127 = 19.9375

Rounding and Clipping

Round the scaled value to the nearest integer.

rounded_value = round(scaled_value) = round(19.9375) = 20

Finally, clip the rounded value to fit within the range of INT8 values [-128, 127]

quantized_value = clip(rounded_value, -128, 127)

= clip(20, -128, 127) = 20.

So, the floating-point number 0.15625 can be represented as the INT8 value 20.

The INT8 representation of the decimal number 20 in binary format is 00010100.

1 bit — LLMs

Recently Microsoft released a 1-bit LLM variant namely BitNet b1.58 which uses ternary {-1, 0, 1} for every single parameter. Surprisingly it matches the FP16 or BF16 precision transformer model.

Fig: 10 — 1-bit LLMs — Pareto solution. Ref: https://arxiv.org/pdf/2402.17764

Summary

  • Understanding Floating Point representations is crucial, since its heavily used in Deep Learning and Transformer models, in various areas such as weights/parameters, gradients, and activation functions
  • Floating point representation types are Double Precisions (64 bits), Single Precisions (32 bits), Half Precisions (16 bits)
  • Some of the key areas in floating point representations are commonly used in model training, model inference, quantization, and Edge Devices!
  • Floating point representations are widely adopted with the IEEE-754 standard which tells, how the precisions are stored in memory; The format contains, The sign, The exponent, and The mantissa/significand
  • Each floating point precision type such as FP32, FP16, and BF16 tells, how many bits are used for Exponent & Mantissa
  • In this blog, we have detailed the methodology for binary conversion of exponent and mantissa
  • Each precision format requires Normalization (Explicit & Implicit) to set radix point positions and biasing techniques to convert exponent values in binary formats
  • Explained the significant importance of reducing the models size during inferencing and training models.
  • Lower precision models such as FP16, BF16, or 1-bit LLM models are useful in mobile, edge devices!

--

--

Prabhu Raghav
DecisionFacts

Co-Founder & CTO — Decisionfacts.ai - Data Scientist, Architect, Full Stack Developer — Tech Enthusiast, Learner