From TensorFlow 2.1, it has allowed for mixed-precision training, making use of the Tensor Cores available in the most recent NVidia GPUs.
My Youtube video explaining the flow.
One way to describe mixed-precision training, in TensorFlow could go like this: MPT (Mixed Precision Training) lets you train models where the weights are of type float32 or float64, as usual (for reasons of numeric stability), but the data — the tensors pushed between operations — have lower precision, namely, 16bit (float16).
Some of the benefits are, faster Model training with compatible GPU, and because it use 16bits it will allow to use larger batch-size- With mixed-precision training, I can run batches of size 256 in cases where without using mixed-precision, I get an out-of-memory error pretty fast.
Fused Multiply-Add (FMA)
To understand Mixed Precision Training we also need to get this concept of Fused Multiply-Add.
Fused Multiply-Add is a type of multiply-accumulate operation. In multiply-accumulate, operands are multiplied and then added to an accumulator keeping track of the running sum.
If “fused”, the whole “multiply-then-add” operation is performed with a single rounding at the end (as opposed to rounding once after the multiplication, and then again after the addition). Usually, this results in higher accuracy.
For CPUs, FMA was introduced concurrently with AVX2. FMA can be performed on scalars or on vectors, “packed” in the way described as above.
Why did we say this was so interesting to data scientists? Well, a lot of operations — dot products, matrix multiplications, convolutions — involve multiplications followed by additions.
So “Matrix multiplication” here actually has us leave the realm of CPUs and jump to GPUs instead because what MPT does is make use of the new-ish NVidia Tensor Cores that extend FMA from scalars/vectors to matrices.
From the very nice paper from NVIDIA.
Basically, The operation takes place on 4x4 matrices; multiplications happen on 16-bit operands while the final result could be 16-bit or 32-bit.
Tensor Cores operate on FP16 input data with FP32 accumulation. The FP16 multiply results in a full precision product that is then accumulated using FP32 addition with the other intermediate products for a 4x4x4 matrix multiply (see Figure 9). In practice, Tensor Cores are used to perform much larger 2D or higher dimensional matrix operations, built up from these smaller elements.
Python Impementation with CIFAR 10 Dataset
For the above code — why I am using dtype=’float32’ in the last Activation layer
According to the official guide from Tensorflow, To use mixed precision properly, your sigmoid activation at the end of the model should be float32. Because we set the policy mixed_float16, the activation’s compute_dtype is float16.
Thus, we had to overwrite the policy for the last layer to float32.
In above code while scaling Images why did I divide by 255
When using the image as it is and passing through a Deep Neural Network, the computation of high numeric values may become more complex.
To reduce this we can normalize the values to range from 0 to 1.
In this way, the numbers will be small and the computation becomes easier and faster. As the pixel values range from 0 to 256, apart from 0 the range is 255. So dividing all the values by 255 will convert it to range from 0 to 1.
Thanks for reading !!