NHWC vs NCHW : A memory access perspective on GPUs

Deepika
6 min readSep 27, 2023

--

NHWC and NCHW are contrasting data layout formats widely used in deep learning, particularly in Convolutional Neural Networks (CNNs). They determine how multi-dimensional data, like images, Point clouds or feature maps is stored in memory.

  1. NHWC (Number of samples, Height, Width, Channels): This format stores data so that for each sample, the height and width dimensions come first, followed by channels. NHWC is commonly associated with TensorFlow.
  2. NCHW (Number of samples, Channels, Height, Width): In this layout, channels precede height and width dimensions. It’s often used with PyTorch.

The choice between NHWC and NCHW can significantly impact memory access, computational efficiency, and compatibility with deep learning frameworks, influencing both model performance and hardware utilization. The decision depends on your specific use case, framework, and hardware configuration.

Convolution as GEMM:

Internally, convolution can be implemented using transform-based methods like Fast Fourier Transform, which converts the convolution into element-wise multiplications in the frequency domain, or transform-free methods like matrix multiplication, where input and kernel are flattened and combined using a matrix operation to compute the output feature map. Few concerns with FFTs are,

  • FFTs are memory-intensive as they require additional memory for storing transformed matrices.
  • FFTs can be computationally costly, especially when transforming data back and forth between the time and frequency domains, involving operational overhead.

On the other hand General Matrix multiplication of a convolution operation looks like below.

Each receptive field is stacked column-wise to obtain the feature map transform matrix. We also flatten and stack the filters matrix row-wise to form filter transform matrix. Filter transform and Feature Map transform matrices undergo matrix multiplication operation to form the flattened output matrix.
Note: Here transform matrix is an intermediate matrix with mere rearranged values and does not have any relation with frequency domain transformation.

  • N — Batch size of feature map, C - Input channels, H-Input height, W - Input width,
  • K-Output channels, R-Filter height, S-Filter width, P-Output height, Q-Output width

Feature map transform and filter transform matrices are considered as intermediate matrices whose dimensions are larger than the feature maps itself. Dimensions of feature map = C x H x W, (3x3x3) Dimensions of Feature map transform = CRS x NPQ (12x4)

GPU Implementation of GEMM:

In order to avoid the memory hunch, Implicit GEMM is leveraged. Instead of forming the Transform matrices, on the fly indexing of each columns and rows are utilized in implicit GEMM. The final output is directly stored in the corresponding index of the output tensor.

GPU which is composed of SMs (Streaming multiprocessors) is primarily made for performing parallel computations. In the above Implicit GEMM, each matrix multiplication can be split into smaller matrix multiplications or Tiles. Each tile is then handled by SMs simultaneously to speed up the process.

Each tile is handled in separate threads inside SMs

Having known this lets also have a look at how tensors are stored in GPU.

Tensor Storage:

Tensors are often stored in a GPU in strided format, where elements are stored in a non-contiguous fashion within the memory layout. This strided storage approach provides flexibility in arranging tensors in various patterns, such as NCHW or NHWC formats, optimizing memory access and computational efficiency. For a given tensor as in the figure we can represent them in NCHW and NHWC in row major formats as given in the next subsections.
Row-major storage arranges tensor elements in memory by sequentially storing each row.

NCHW in Row Major:

Here W is the most dynamic dimension. Elements in the same channel are stored together followed by the elements in the next channel.

NHWC in Row Major:

Here C is the most dynamic dimension. Elements from the same spatial position across all channels are stored sequentially, followed by elements from the next spatial position, optimizing access to spatial data within each channel.

Memory Throughput on GPU: GPUs are highly parallel processors, and they work best when data access is done in a coalesced manner, meaning they like to read data in a contiguous, organized fashion. When each thread looks for data in L2 cache, if its a cache hit(contents of requested memory available in cache), then memory access is fast. If its a cache miss (negation of cache hit), then GPU approaches DRAM to fetch contents of requested memory address, which is a time consuming operation.

GPU Memory hierarchy

When a GPU needs to access data that’s stored in memory, it does so in “transactions.” Each transaction accesses 32/128 byte of information depending upon the GPU configuration.The accessed information remains in the cache. When another GPU thread requests memory access, it first checks the cache. If the data is not available in the cache, the request is then forwarded to DRAM.

This can be overall summarized as 
Coalesced memory transactions occur when the GPU accesses memory in a contiguous block. In other words, if the GPU needs to read 32 bytes of data that are stored consecutively in memory, it will perform a single coalesced memory transaction to retrieve all 32 bytes at once.
Uncoalesced memory transactions, on the other hand, occur when the GPU needs to access data that’s not stored contiguously in memory. In this case, the GPU will need to perform multiple transactions to retrieve all the necessary data

Now in-case of GEMM irrespective of the height and width of the filter, we are sure to read all channel wise information for a given spatial location. For example, if our input features are 128 x 128 x 32. We are sure to read the all the channels of location (1,1) irrespective of using a 1x1 or 3x3 kernel.

If we use NCHW which stores all elements belonging to a single channel together, we will have to stride to locations a[0], a[16384], a[32,768]… until a[16384x31] for 1x1 convolution(for simplicity). We see that these locations are not contiguous and are certain to be a cache miss which leads to transnational overhead during memory reads. The rest of the data read during each transaction is also not used, also known as Uncoalesced memory transactions.

When we represent the tensors in NHWC format the access locations are a[0],a[1]…,a[127] which are contiguous and sure to be a cache hit.For the first time accessing a[0] leads to a cache miss and a transaction from DRAM fetching 32/128 bytes of data.While accessing a[1] it will be a cache hit saving a transaction. This will happen for a certain number of locations after a[0] depending on number of bytes read during a transaction and cache size. Even if its a cache miss after a certain number of locations leading to a transaction from DRAM, the transaction will carry with itself contiguous data of the consecutive memory locations, leading to cache hit while accessing the further locations, known as Coalesced memory transactions.

Takeaway:

Thus, NHWC reduces memory access bottlenecks in tensor core GPUs leading to optimized performance, seeming to a better option when compared to NCHW. Given below is the performance interms of TFLOPS for both NCHW and NHCW in NVIDIA A100-SXM4–80GB, CUDA 11.2, cuDNN 8.1. We see that NHWC performs better interms of TFLOPS in both settings. We here for simplicity did not get into the NC/xHWx layout which a variant of NHWC that is prepared for NVIDIA Tensor Core operations.

Source: https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#imp-gemm-dim

Sources:

  1. https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#imp-gemm-dim
  2. https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html
  3. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
  4. https://leimao.github.io/blog/CUDA-Convolution-Tensor-Layouts/
  5. https://www.microway.com/hpc-tech-tips/avoiding-gpu-memory-performance-bottlenecks/
  6. Molchanov, Vladimir & Vishnyakov, Boris & Vizilter, Yu & Vishnyakova, Oxana & Knyaz, Vladimir. (2017). Pedestrian detection in video surveillance using fully convolutional YOLO neural network. 103340Q. 10.1117/12.2270326.

--

--