Floating points in deep learning: Understanding the basics

Understanding floating points for training and inference

7 min readJan 11, 2024

With the rapid development of deep learning models, there is an increasing demand for compute resources with efficient and scalable methods for training and inference on hardware such as GPU and TPUs while reducing cost and energy consumptions. In this article, we will explore why floating points are important and their impact on model training and inference, particularly focusing on FP8 floating point format with other formats like BF16, FP16 and TF32 .

Table of content

❏ Why floating points are important
❏ Floating point representation
❏ Floating point formats
︎▸ Double precision
▸ Single precision
▸ Half precision
▸ FP8
▸ BF16
▸ TF32
❏ Conclusion
❏ Reference

Why floating points are important
Hardware is one of the most reason as to they are very compatible with floating points. Floating point numbers have wide range of values, requires less memory, easy to perform complex computations, faster performance and variety of formats that supports wide range of applications like gaming, machine learning and simulations etc.

Floating point representation
As you know that floating point data type has large range of values with fractional part which is mostly used in scientific notations.

Floating point of 12.345 as base-10 (source)

Above figure gives the basic idea of how precision works in Machine Learning but we prefer base-2 as we are binary people (⌐■_■).

It consists of:
● Sign: Represents whether the number is positive (0) or negative (1)
● Exponent: Integer (power-2) to use with significand multiplication
● Mantissa / Significand: Represents most significand number

Let’s see some example

> check system float information

import sys
sys.float_info

that prints

sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

here float_info returns low level information like precision, min, max and internal representation detail.

> A numpy array with dtype float-32 and float-64 occupies 4 bytes and 8 bytes respectively in memory

narr_32 = np.array([1.2345, 2.6543, 4.3205, 0.0023, 8.5362], dtype=np.float32)
narr_32.itemsize # prints 4

narr_64 = np.array([1.2345, 2.6543, 4.3205, 0.0023, 8.5362], dtype=np.float64)
narr_64.itemsize # prints 8

If you want to check python object memory consumption then use sys.getsizeof() as it directly attributed to the object it is accounted for, not the memory consumption of objects it refers to.

> Get information of numpy float types supported by machine

for f in (np.float32, np.float64):
    finfo = np.finfo(f)
    # dtype: data type
    # bits: no of bits occupied by type
    # iexp: no of bits in exp
    # nexp: no of bits in exp with sign and bias
    # nmant: no of bits in mantissa
    # precision: no of decimal digit near to float this float number
    # min: smallest representation 
    # max: largest representation
    print(f"d_{finfo.dtype}, b_{finfo.bits}, i_{finfo.iexp}, n_{finfo.nexp},m_{finfo.nmant}, p_{finfo.precision}, min_{finfo.min}, max_{finfo.max}")

gives

d_float32, b_32, i_8, n_8,m_23, p_6, min_-3.4028234663852886e+38, max_3.4028234663852886e+38
d_float64, b_64, i_11, n_11,m_52, p_15, min_-1.7976931348623157e+308, max_1.7976931348623157e+308

Floating points formats
Floating points can be represented as:

Floating point formats illustration by Author

Double precision (fp64)
Double precision also called FP64 reserves 11 bits exponent and 52 bits significand which represents wide dynamic range of numbers. It is used for high precision requirement, though calculation process might be slower (depending on hardware).

Single precision (fp32)
Single precision also called binary32 or FP32 reserves 8 bits (base 2) exponent and 23 bits significand. It is a default format used for storing weights and activations in model training.

Half precision (fp16)
Half precision also called binary16 or FP16 reserves 5 bits (base 2) exponent and 10 bits significand, applicable with less storage and bandwidth requirement.

There are also other formats that are mainly used for model training and inference like BF16 and TF32. As mentioned previously, let’s dive deeper with FP8 format.

FP8

“Neural networks are a bit strange in that they are actually remarkably tolerant to relatively low precision” — Richard Grisenthwaite (executive vice president and chief architect at Arm)

Training a neural network requires lot of computation power and memory. And it requires very huge amount of weights and activation values which varies for different layers to achieve good result.

Machine learning can also works with less than 32 bit schema. Key challenge for training and inference is high precision formats that causes memory overload and performance degradation. So reducing those numbers to achieve latency and efficiency is needed.

We can classify training and inference as two function here as training done on different hardware devices (like TPU, NVIDIA A100) and inference on other hardware (generally used CPU, GPU). This causes many problems if you have limited memory capacity. Generally 32 or 16 bits floating point trained model inference with int8 format which requires some conversation or quantisation.

Currently IEEE SA Working Group P3109 working on developing a standard for 8-bit binary floating point format.

FP8 has two variations E4 and E3 as mentioned in [4].

F8 E4M3 and E5M2 format illustration by Author

As the name suggests E4M3 consists 4 bits exponent and 3 bits mantissa, while E5M2 consists 5 bits exponent and 2 bits mantissa. E4M3 represents fewer special values, and E5M2 represents all infinite, NaN and zeros. E4M3 useful with weights and activating tensors that requires more precision during forward pass while E5M2 for gradient tensors as they are less acquiescent to the loss of precision during backward pass.

In [1], they proposed a methodology to select scaling for F8 linear layer for training and inference with per-tensor scale distribution for weights and activation without performance degradation. Study also shows that mixed-precision with matrix multiplications in FP8 are possible but it has range issues. Proposed methodology is able to dynamically update the per-tensor scaling biases and prevent degradation using FP8 in llm for training. As for inference, it is required to do some conversation or post-training quantisation due to the high precision weight formats. By applying a scaling bias to each tensor and subsequently, casting to FP8 for inference can be done with post-training quantisation without any degradation.

Paper [2] introduced FFP8 8-bit floating point format which offers configurable bit width of exponent or fraction field, value of exponent bias, and presence of the sign bit. There are three key observations in paper: First is that maximum magnitude and the value distribution between weight and activations are different in most nn, Second is a commonly-used activation functions always produce nonnegative outputs and so the sign bit is not required for those activation functions as it seems to make storage 1 bit longer and lastly the training requires their own sophisticated training frameworks to produce 8-bit floating-point models. The FFP8 format consists of sign (s), exponent (e), and fraction (f), each with a specific bit length x, y, and z, respectively. Additionally, the format includes an exponent bias (b) parameter for specifying floating point number.

Aim of FFP8 format is to develop memory efficient inference for limited memory capacity system.

BF16
Google’s bfloat16 is widely used floating point format which is truncated version of IEEE’s fp16 that represents a wide dynamic range of numeric values.

BF16 occupies 8 bits exponent and 7 bits mantissa. The dynamic range for bf16 is same as fp32 (~1e⎺³⁸ to ~3e³⁸) which covers large range of tensors with half memory occupation. BF16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.

The format is utilised in Intel AI processors, including Nervana NNP-L1000, Xeon processors, Intel FPGAs, and Google Cloud TPUs.

TF32
According to NVIDIA, “TensorFloat-32 is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations used at the heart of AI and certain HPC applications. TF32 running on Tensor Cores in A100 GPUs can provide up to 10X speedups compared to single-precision floating-point math (FP32) on Volta GPUs.”

TF32 consists 8 bits for exponent and 10 bits for mantissa. TF32 adopts the same 8-bit exponent as FP32 so it can support the same numeric range.

TF32 can be applicable in a wide range of fields such as nuclear energy, earth science, healthcare, fluid dynamics, material science and gas exploration.

Conclusion
In conclusion, floating points represents wide range of numbers and hardware friendly. They play a crucial role in the training and inference of neural networks as the choice of floating-point format can significantly impact the accuracy, speed, and memory usage of deep learning models. Understanding the different types of floating-point formats and their trade-offs is essential for optimizing neural network performance.

References
[1] S. P. Perez et al., “Training and inference of large language models using 8-bit floating point,” arXiv (Cornell University), Sep. 2023, doi: 10.48550/arxiv.2309.17224.
[2] J.-D. Huang, C. Huang, and T.-W. Chen, “All-You-Can-Fit 8-Bit flexible Floating-Point format for accurate and Memory-Efficient inference of deep neural networks,” arXiv (Cornell University), Apr. 2021, doi: 10.48550/arxiv.2104.07329.
[3] D. D. Kalamkar et al., “A study of BFLOAT16 for deep learning training,” arXiv (Cornell University), May 2019, [Online]. Available: https://arxiv.org/pdf/1905.12322.pdf
[4] P. Micikevicius et al., “FP8 formats for deep learning,” arXiv (Cornell University), Sep. 2022, doi: 10.48550/arxiv.2209.05433.
[5] P, “Public/Shared Reports at main · P3109/Public,” GitHub. https://github.com/P3109/Public/tree/main/Shared%20Reports (accessed Jan. 11, 2024).
[6] P. Kharya, “What is the TensorFloat-32 Precision Format? | NVIDIA Blog,” NVIDIA Blog, May 18, 2020. https://blogs.nvidia.com/blog/tensorfloat-32-precision-format/

Thanks for reading ^-^

Floating points in deep learning: Understanding the basics

Understanding floating points for training and inference

Written by Krinal Joshi