DinoPlusAI: Latency Optimized AI Processor for the 5G-enabled Network Edge
By Dawn Xie & Jay Hu @ DinoPlusAI
AI applications demand ultra-low latency responsiveness
Many AI applications require close to real-time responsiveness. Ultra-fast responsiveness is critical for these applications to achieve make-or-break user experiences, or to meet mandatory safety requirements.
For example, gaming, AR or VR media streaming requires close to real-time responsiveness for a satisfying user experience. In the VR world, it is understood that headset users will experience nausea when latency between users moving their eye position to video showing up is over 20 ms; ideally, VR latency would be no more than 7 ms, if not zero. When latency is in the single-digits, VR users will be able to immerse themselves, and play high-impact games. It’s like playing soccer or basketball games in real-time, vs. only watching passively on the sideline. When AI capability is integrated in AR or VR media streaming, such as performing real-time video or audio recognition to trigger command control or ad content insertion, associated AI processing needs to be ultra-fast and close to real-time as well.
Ultra-low latency also significantly benefits manufacturing, remote operation and medicine. With ultra-low latency, ideally with latency at single-digits, it becomes possible to transmit manufacturing data from multiple sources to a single controller. The single controller then makes instant decisions and sends instructions back to the machinery to adjust, all done wirelessly with ultra-low latency.
Ericsson together with the Fraunhofer Institute for Production Technology did an experiment on this, for the production of blade integrated discs, blisks, used in e.g. jets, where precision and accuracy is vital. The experiment shows the sub 1ms ultra-low latency is essential for very fast control loop in manufacturing, which allows the generation of a digital twin, a virtual reflection of the component to be generated, and detects and corrects any deviations before they become severe.
Autonomous driving is also all about low latency. Road conditions need to be captured, sensed and processed in real-time. Vehicle to vehicle communication needs to be instant in order for vehicles to react. These all need to be ultra-reliable as well, so the autonomous driving won’t break down and drivers can trust and rely on it.
Low latency is the key enabler for these and many more emerging applications. Only when these applications can present close to real time responsiveness with ultra-low latency to their users, these applications become viable.
Achieving AI ultra-low latency through 5G-enabled Network Edge
One approach to achieve real-time responsiveness for these latency sensitive applications is to have the processing local on the device. By having the processing local on the device, which eliminates the latency for moving data and processing across the network, the applications are more likely to achieve the desirable latency and responsiveness requirement.
However, this approach drives up the processing complexity and the cost on the devices, which can make or break the business models. For example, autonomous cars are being built with massive hardware compute like NVIDIA’s Drive PX Pegasus, which is targeted at Level 5 automation and provides 320 TOPS of compute at 500 W of power, at the estimated cost of $15000. This is assuming that all the AI inference will be done on the vehicle itself. These compute capabilities add a major cost to the vehicle itself, not to mention the cost of the cooling systems needed to manage 500 W of heat dissipation. This essentially keeps the design at proof of concept stage, and prohibits turning to production.
5G has introduced ultra-reliable low latency (URLLC) capability and network architecture change, which enables a second approach to run AI low-latency applications, that is, via the 5G-enabled network edge. 5G networks are being built to support three service categories:
· eMBB (Enhanced mobile broadband), which offers high throughput bandwidth internet access suitable for web browsing, video streaming, and virtual reality.
· mMTC (Massive machine type communication), which provides massive narrowband Internet connectivity for sensing, metering, and monitoring devices.
· URLLC (Ultra-reliable low latency communication), which provides services for latency sensitive devices for applications like factory automation, autonomous driving, virtual reality and augmented reality media, and remote surgery. These applications require sub-millisecond latency with error rates that are lower than 1 packet loss in 10⁵ packets.
In particularly, latency refers to user plane and control plane latency. User plane latency is the contribution of the radio network to the time from when the source sends a packet to when the destination receives it. The targeted round trip requirements for user plane latency are:
· 8 ms for eMBB
· 1 ms for URLLC
Control plane latency refers to the transition time from a most “battery efficient” state (e.g. Idle state) to the start of continuous data transfer (e.g. Active state). The targeted requirement for control plane latency is 10ms.
As 5G comes into the picture with its millisecond latencies and gigabit bandwidth, devices connecting to the edge cloud built off the 5G network have almost instantaneous access to the compute and storage resources there. 5G resolves the latency introduced by the network and makes network latency no longer the bottleneck. 5G will be the turning point where many applications can be shifted or offloaded from the cloud and the device to the edge of the network.
As 5G resolves the network latency bottleneck, reducing latency introduced by compute or storage access becomes ever more critical, to ensure end-to-end close to real-time application responsiveness. It rings the bell of the need of a new generation edge cloud, powered by new hardware and software capabilities to significantly reduce compute, storage, and AI processing latency.
DinoPlusAI’s Latency-Optimized AI Processor for 5G-enabled Network Edge
DinoPlusAI is a startup specialized in building latency-optimized AI processor with upper bounded ultra-low latency. According to Jay Hu, founder and CEO of DinoPlusAI, the Trex processor the company builds is the first in the industry focused on latency optimization ground-up. The company employs a 3D vision when creating the Trex processor, emphasizing upper bounded ultra-low latency, in addition to throughput performance and low energy consumption. It makes the processor fit especially well for servers or systems that need to support AI applications with ultra-low latency AI inferencing.
The processor has demonstrated impressive performance. Jay Hu shared these performance data during the Alchemist Accelerator demo day on May 16, 2019 (see video here). He shared that the processor is able to shorten the latency to 1/100x and lower power consumption by 1/6x, compared to Nvidia’s V100 series, when processing Resnet-50 model at 70K images per second. It is able to process 5000 frames per second at INT8 precision with batch 1 inference for SSD Mobilenet V1 model, compared to only 34 frames per second for Nvidia GeForce GTX Titan X system. It is also able to process 4000 speech recognition streams simultaneously while consuming only 45W. This performance demonstrates 13x performance improvement and 1/6x energy consumption reduction compared to Nvidia’s Tesla V100.
DinoPlusAI’s nimble development team has also verified its Trex chipset’s performance against a large set of deep learning models, ranging from popular CNN models to LSTM models for speech recognition. This demonstrates the design flexibility built into the processor hardware, as well as the compiler and SDK that are future-proof to support large set of models.
Traditional CPU or GPU have inherent limitations on deep learning performance for low batch or low latency inference, which opens door for new generation of AI processors that are specially optimized for low latency and low power consumption at low batch inferencing, while maintaining equivalent performance and programmability. Servers or systems built with processors like DinoPlusAI’s Trex chipsets, will be able to support AI workloads that are extremely sensitive to latency. Installing these systems in the 5G-enabled network edge, which provides sub 1ms network latency, or having these systems on devices itself, like self-driving vehicles, will truly enable the latency essential applications, making applications that would not be feasible in the 4G era a reality in 5G.