OPU: An FPGA-Based Overlay Processor for Convolutional Neural Networks

Published in

VLSI for Neural Networks and Artificial Intelligence

7 min readFeb 17, 2021

In this article, the authors propose a domain-specific FPGA overlay processor, named OPU to accelerate CNN networks. It offers software-like programmability for CNN end users, as CNN algorithms are automatically compiled into executable codes, which are loaded and executed by OPU without reconfiguration of FPGA for switch or update of CNN networks. Their OPU instructions have complicated functions with variable runtimes but a uniform length. The granularity of instruction is optimized to provide good performance and sufficient flexibility, while reducing complexity to develop microarchitecture and compiler. Experiments show that OPU can achieve an average of 91% runtime multiplication and accumulation unit (MAC) efficiency (RME) among nine different networks. Moreover, for VGG and YOLO networks, OPU outperforms automatically compiled network-specific accelerators in the literature. In addition, OPU shows 5.35× better power efficiency compared with Titan Xp. For a real-time cascaded CNN networks scenario, OPU is 2.9× faster compared with edge computing GPU Jetson Tx2, which has a similar amount of computing resources.

The proposed overlay processor OPU has the following features

CPU/GPU Like User Friendliness: As shown in Fig. 1, the CNN network is compiled into instructions. This is done once for each network. Then, instructions are executed by OPU which is implemented on FPGA and fixed for all networks. The CNN algorithm developer does not need to deal with FPGA implementation.
Domain-Specific Instruction Set: Our instructions have optimized granularity, which is smaller than that in [17] to ensure high flexibility and computational unit efficiency, while a lot larger than those for CPU/GPU to reduce the complexity of the compiler.
FPGA-Based High-Performance Microarchitecture: These architectures are optimized for computation, data communication, and reorganization, which are controlled by parameter registers set directly by instructions.
Compiler With Comprehensive Optimization: Independent of microarchitecture, operation fusion is performed to merge or concatenate closely related operations to reduce computation time and data communication latency. Data quantization is also conducted to save memory and computation resources. Related to microarchitecture, the compiler explores multiple degrees of parallelism to maximize throughput by slicing and mapping the target CNN to overlay architectures.

Instructional Set Architecture

Instruction set architecture (ISA) is the key to a processor. Our OPU is specific for CNN inference. We identify all the necessary operations during CNN inference and group them into different categories. Each category maps to one type of instruction with adjustable parameters for flexibility. Our instructions are 32 bit long and have complicated functions and variant runtimes (up to hundreds of cycles). CNN inference can be executed by OPU without a general-purpose processor such as a CPU.

There are two types of instructions defined: Conditional instruction (C-type) and Unconditional instruction (U-type).

Microarchitecture

Another challenge in OPU design is overlaid microarchitecture design. The overlay microarchitecture needs to incur as little control overhead as possible while maintaining easily runtime adjustable and functionality. We design our modules to be parameter customizable and switch modes at runtime based on parameter registers that directly accept parameters provided by instructions. The computation engine explores multiple levels of parallelisms that generalize well among different kernel sizes. Moreover, CNN operations categorized into the same group are reorganized and combined so they can be accomplished by the same module to reduce overhead.

(a) Conventional Intra kernel-based parallelism. (b) OPU input-output channel-based parallelism. © FM data fetch pattern of OPU.

The overlay microarchitecture can be decomposed into six main modules following the instruction architecture definition.

Compiler

We develop a compiler to perform operation fusion, network slicing, and throughput optimization on input CNN configuration. There are two stages during the operation of the compiler: Translation and Optimization, as shown in Fig. Translation extracts necessary information from model definition files and reorganizes them into a uniform intermediate representation (IR) we defined. During this process, operation fusion (introduced in Section V-A) is conducted to combine closely related operations.

Experiment Results

We implement three OPU versions with different MAC numbers on Xilinx XC7K325T FPGA and XC7Z100 FPGA. Corresponding resource utilization is given in Table III. For OPU1024, all the MACs are implemented with DSP. For OPU2048 and OPU4096, part of the MACs is implemented with LUT since the number of DSPs is not enough. A PC with Xeon 5600 CPU is used for our compiler program. Result interface and device are shown in Fig.

Evaluation board and runtime results for classification network VGG16 and detection network YOLO.

Network Description

To evaluate the performance of OPU, nine CNNs of different architectures are mapped, including YOLOv2, tinyYOLO, VGG16, VGG19, Inceptionv1/v2/v3, Resnetv1–50, and Resnetv1–101.

Runtime MAC Efficiency

OPU is designed to be a domain-specific processor for a variety of CNNs. Therefore, runtime MAC efficiency (RME) for different CNN’s is an important metric for both hardware efficiency and performance. RME is calculated by the actual throughput achieved during runtime, divided by the theoretical roof throughput (TTR) of design.

Comparison With Existing FPGA Accelerators

In this section, we compare the performance of OPU with auto compiler generated network-specific accelerators. Table VI lists out customized accelerators designed for networks VGG16 or YOLO, which are implemented on FPGAs of similar scales for a fair comparison. We use throughput and RME as the comparison criteria.

Performance comparison of OPU implemented on similar number of MACs with reference designs.

Direct comparison of throughput without taking the number of utilized MACs into consideration is not fair, so we scale our design to match the number of MACs in different reference designs. As shown in Fig, the blue dots represent the simulated real performance of OPU implemented with a different number of MACs running VGG16.

Power Comparison

We compare the power efficiency of OPU with other FPGA designs as well as GPU and CPU. Table VII lists out the comparison results of different hardware platforms running VGG16. We measure the power consumption of OPU using a PN2000 electricity usage monitor. The reported power includes static and dynamic powers consumed by the whole board.

Power efficiency (GOPS/W) comparison of CPU/GPU/OPU using CPU as a baseline.

Case Study of Real-Time Cascaded Networks

To further evaluate the real-time performance on cascaded networks of OPU, we implement the task on OPU1024 to recognize car license plate from street-view pictures. It is composed of three networks: car-YOLO (trained based on YOLO for car detection), plate-tiny-YOLO (trained based on YOLO for plate detection), and a character recognition network (cr-network). For single picture input, the car-YOLO network runs first to label all cars. Then, plate-tiny-YOLO and cr-network run to detect the plate numbers or characters for each car.

As given in Table, we compare the performance for OPU1024 and Jetson Tx2. Tx2 is running with batch = 5 and the speed data are computed by total time between input to output divided by 5. It can be seen that OPU is faster in executing all three networks compared to Jetson. Overall, OPU is 2.9× faster than Jetson. With similar computation capability, the higher speed achieved by OPU comes from the higher PE utilization rate enabled by our domain-specific architecture and compiler.

References

C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An FPGA-based processor for convolutional networks,” in Proc. Int. Conf. Field Program. Log. Appl., Aug./Sep. 2009, pp. 32–37.

K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E. S. Chung, “Accelerating deep convolutional neural networks using specialized hardware,” Microsoft Res., Redmond, WA, USA, White Paper 2, 2015.

S. Cadambi, A. Majumdar, M. Becchi, S. Chakradhar, and H. P. Graf, “A programmable parallel accelerator for learning and classification,” in Proc. 19th Int. Conf. Parallel Archit. Compilation Techn. (PACT), Sep. 2010, pp. 273–284.

S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A dynamically configurable coprocessor for convolutional neural networks,” ACM SIGARCH Comput. Archit. News, vol. 38, no. 3, pp. 247–257, 2010

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, pp. 2818–2826.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.

Conclusion and Discussion

In this article, we proposed OPU, a domain-specific FPGA-based overlay processor system targeting general CNN acceleration. It is software-programmable with a short compilation time. We developed a set of instructions with granularity optimized for PE efficiency. Our compiler performs operation fusion to reduce inference time and conducts network slicing to maximize overall throughput and RME. OPU exhibits high flexibility and generality for a variety of CNNs with an average RME of around 91%. It has a higher RME on VGG16 and YOLO networks compared with the stateof-the-art network-specific auto compiler generated accelerators. For power consumption, OPU of different scales shows 1.2×–5.35× better power efficiency compared with GPU (batch = 1, batch = 16, and batch = 64) and other FPGA designs. Moreover, for cascaded CNN networks to detect car license plate, OPU is 2.9× faster compared with edge computing GPU Jetson Tx2 with a similar amount of computing resources. Our future work will develop better microarchitecture and more compiler optimization mechanisms, extend OPU for both CNN and recurrent neural networks (RNN), and also apply and optimize OPU for different deep learning applications, particularly for 3-D medical images.