How To Implement A Real-time Human Detection Application at the Edge Using Zynq UltraScale+ MPSoC Device

7 min readJun 5, 2020

The ability to perform real-time, low-latency and deterministic processing at the edge is increasingly important for a range of applications, from autonomous vehicles to vision guided robotics and intelligent surveillance systems.

Processing at the edge is required for four main reasons availability, latency, security and determinism. Should wireless communications be used to communicate to and from a cloud service where the processing is performed, connection to the cloud service cannot be guaranteed. As you might experience service outages and/or signal blackspots created by buildings or natural vegetation. Of course, processing time and decision making for sensitive data in the cloud will also increase the latency and decrease the determinism of the response making it unsuitable for real time safety critical decisions.

Edge processing addresses the availability, latency and determinism challenges. However, it can present additional challenges as, normally, the computational power available at the edge is much lower than is available in the cloud.

In this article, we will address the low power and high performance challenges of an edge processing system by implementing a real time human detection application using a Zynq UltraScale+ MPSoC device on an Aldec TySOM-3A-ZU19EG embedded development board.

Zynq UltraScale+ MPSoC Heterogeneous System on Chip (SoC) Device

Using heterogeneous System on Chip (SoC) devices, like the Zynq UltraScale+ MPSoC, enable the user to address the challenges of implementing low latency, deterministic processing at the edge. Unlike traditional processor-based solutions, heterogeneous SoCs are divided into two elements; a processing system which contains high performance ARM processor cores and programmable logic which provides structures based on the latest Xilinx FPGA fabric.

The flexibility provided by the programmable logic IO structures enables the implementation of any-to-any interfacing. This frees the designer from the IO constraints enforced by application specific standard part (ASSP) devices. For example, thanks to programmable logic’s IO flexibility, several MIPI interfaces can be implemented both as TX and RX and supporting a range of data rates, data lanes and data types. The flexible implementation of this logic enables the solution to be upgraded as sensor technology evolves.

Outside of the programmable logic IO structures the parallel nature of programmable logic itself enables the implementation of a true image processing pipeline. This image processing pipeline will be implemented in parallel due to the massively parallel nature of the programmable logic. This means the image processing pipeline is implemented internally to the programmable logic, removing the need to use DDR memory to store processing stages. This reduces latency and increases the determinism, as there is no need to compete for access to a system resource in DDR.

When developing solutions for a heterogeneous SoC we can leverage embedded Linux running on the software processing side of the device, while development of the programmable logic can take advantage of vendor-provided IP and increasingly high level synthesis; which allows C/C++ to be implemented in programmable logic.

One of the more exciting total system solutions is the use of system optimizing compilers such as SDSoC and Vitis which enable functions to be accelerated from the processing system to programmable logic leveraging the OpenCL framework. This enables system architects to further leverage the programmable logic to accelerate algorithms, this is especially true for accelerating Deep Neural Network applications thanks to the Xilinx Deep Learning Processor Unit (DPU).

Implementation

One example which demonstrates the capabilities of Zynq UltraScale+ MPSoC device at the edge is implementing a real-time human detection that can be used in surveillance cameras, ADAS, smart cameras, etc.

Such applications require the ability to configure and receive images from a camera, process the images and run convolutional neural networks (CNN) on the processed image. These processed images can then be output over HDMI and written to solid-state memory. To be able to demonstrate the algorithmic performance on programmable logic quickly and easily, Aldec’s TySOM-3A-ZU19EG development board along with FMC-ADAS (for capturing camera images) and FMC-NVMe (for recording the processed / raw data) can be used. These development boards enable the development team to hit the ground running. The following image shows an overview of the design example for this blog.

Accelerating your design in this manner enables you to demonstrate early progress to internal and external project stakeholders.

The implementation of a real-time edge processing application requires significant resources to achieve the desired performance. The TySOM-3A-ZU19EG, which features the largest programmable logic in Zynq Ultrascale+ MPSoC devices, provides the designers with over 500K LUTs and 1M Flip Flops, exactly what is needed to implement such a challenging application. In addition to the largest Zynq UltraScale+ MPSoC device, TySOM-3A-ZU19EG provides a wide range of peripherals such as USB 3.0, USB 2.0, Ethernet, QSFP+, mPCIe, DisplayPort, HDMI IN/OUT, Wi-Fi and BLT, SATA, CAN, Pmod, UART, JTAG and FMC HPC connectors.

Capturing the video is done by the FMC-ADAS which includes five FPD-Link III connectors, frequently used for ADAS and bird’s eye view applications. A base platform can then be designed using Vivado which implements the image processing path; as a crucial part of the image processing path is making the video stream available to the processor system by storing frames in the Processor System DDR Memory. Higher level frame works such as GStreamer can then be used by the Processing System to work with the image stream.

To provide the NVMe SSD interface, Aldec’s FMC-NVMe is used. It provides four NVMe interfaces, which can each be stacked up to eight cards deep. This feature enables us to use up to 32 SSDs. Aldec provides a detail solution page for NVMe data storage using this FMC daughter card.

The PCIe capabilities of the programmable logic are used to perform reads and writes. Drivers for the PCIe access to the NVMe will be provided by the embedded Linux operating system.

Of course, to display the output image on a HDMI monitor the Processor System DDR memory is accessed by the Xilinx Video Mixer IP block. This block can mix up to four video layers, ideal for the creation of overlays and graphics on the output video. For example, frame rate information and bounding boxes around detected objects.

This Vivado hardware design can be combined with an embedded petalinux solution targeting the TySOM board to create an acceleration platform used by SDSoC or Vitis. This platform can then be used across several projects to implement software accelerated solutions ranging from embedded vision applications to deep learning, as is the case in this example.

Using this acceleration platform in SDSoC or Vitis, it is possible to accelerate neural network inference. This acceleration will implement Xilinx DPUs within the programmable logic when the application is compiled. There are several different instantiations of the Xilinx DPU depending on the required resource utilization and performance goals. For this example, two of the largest and highest performance DPUs were implemented. Both are capable of 4096 operations per clock, and can be clocked at 300 MHz. TySOM-3A-ZU19EG enables us to implement these high performance DPU architecture thanks to the high capacity of programmable logic provided in the board’s Zynq Ultrascale+ MPSoC ZU19EG device.

The application software is then executed taking advantage of the platform image processing pipeline to capture and display an image, while the neural network — which is processing the image frames for people detection — is implemented on the Xilinx DPUs in the programmable logic. This enables a 720P image stream to be processed and detect people within the image stream at 30 FPS while at the same time writing to the NVMe SSD.

When the project was implemented using SDSoC only about one third of the TySOM 3A’s total logic resources was used.

This project demonstrated the capabilities of using a heterogeneous SoC to create edge processing applications which require low latency and determinism.

The use of Aldec’s TySOM-3A-ZU19EG embedded development kit along with the company’s FMC-ADAS and FMC-NVMe boards enabled us to de-risk the implementation of the algorithms and gave us the ability to start development early.

SDSoC enabled us to leverage the acceleration platforms to implement deep learning applications using only a software development environment.

The reference design for the human detection example at the edge was provided by Aldec, and is available to TySOM users. For further information, please contact Aldec

How To Implement A Real-time Human Detection Application at the Edge Using Zynq UltraScale+ MPSoC Device

Zynq UltraScale+ MPSoC Heterogeneous System on Chip (SoC) Device

Implementation

Written by Adam Taylor