A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection

YOLO architecture [1]

When it comes to object detection, there is one word that comes to mind — YOLO (You Only Look Once)[1]. With this revolutionary computer vision and deep learning algorithm climbing its way to the top (used even in Autonomous Driving), many researchers have been finding an efficient way to deploy it on any platform. But there are a lot of challenges!

A recently published article titled “A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection” in IEEE Transactions on Very Large Scale Integration (VLSI) Systems [2] takes us one step closer to achieve perfect deployment of this architecture.

The authors state that when a conventional CNN is deployed or implemented on hardware, frequent access to off-chip memory causes slow processing and a large amount of power dissipation. This paper essentially deals with the deployment of YOLO CNN using a TOPS streaming hardware accelerator.

Specifically, for the proposed streaming architecture, there are two methods used by the authors.

  1. Hardware-centric quantization that includes, Optimization for BatchNorm, leaky ReLU
  2. Flexible low-bit quantization

The author mentions that they use 3–6 bit quantization depending on the layer and that the 1-bit weight and flexible low-bit quantization reduces the model size by 30x and all of the weights can be stored on the RAM of an FPGA. As a result, there is a tremendous reduction in off-chip memory access.

Also, to optimize the data path, the strategy proposed by the author reuses the weights for a single row i.e., line-based weight reuse and input feature map full reuse which reduces the output buffer size when compared with the prevailing methods.

The streaming architecture proposed by the authors is as seen in Fig. 1.

Fig 1. Streaming architecture for convolution [2]

The buffer as seen in Fig 1. is to specifically implement the line-based weight reuse. Other components of the architecture are to implement the optimized batch normalization. To increase the speed of computation, the parameters are fetched beforehand. The size of the weight buffer is doubled, and it works as a ping-pong buffer.

A similar architecture is seen in Fig. 2. for the max pool layer (2x2)

Fig. 2. Streaming Architecture for Max Pooling Layer [2]

Unlike the convolution, the max pool buffer requires only one line of the buffer. Part input of the convolution is latched and this is then compared to the input. If the row count is even, the results are stored in the line buffer. . Else, they are again compared with the value in the buffer address. Finally, the results are concatenated as Ti.

Apart from this, the authors also propose resources aware parallelism and batch processing to increase the throughput.

According to the authors, the proposed design achieves a throughput of 1.87 TOPS without increasing the hardware cost and outperforms most of the previous works. Also, most of the previous works fail when the architecture goes deeper but in this case, the efficiency actually increases when the architecture goes deeper.

References:

[1] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[2] D. T. Nguyen, T. N. Nguyen, H. Kim, and H.-J. Lee, “A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 8, pp. 1861–1873, 2019.

--

--