cuSignal 0.13 — Entering the Big Leagues and Focused on Screamin’ Streaming Performance

Published in

RAPIDS AI

5 min readMar 24, 2020

Authors
Adam Thompson, Matt Nicely, Graham Markall — NVIDIA
John Ferguson, Dan Bryant, Peter Witkowski — Deepwave Digital

What’s New in cuSignal?

Last fall, the cuSignal development team excitedly announced their standing as the newest member of the NVIDIA-led GPU-accelerated data science project RAPIDS. If you are late to the party, cuSignal is a GPU-accelerated version of the popular SciPy Signal library, aiming at delivering GPU orders of speed to common signal processing functions.

Since then, we’ve been hard at work on bringing the code-base up to community and project standards, and have been engaging with the user-community, fixing bugs, adding features, building docs, and generally making stuff faster.

Notably, in cuSignal v0.13, we’ve added:

Anaconda installation support — If you’re a conda user on a Linux system, simply type:

# Stable - Coming soon in RAPIDS version >= 0.13
conda install -c rapidsai -c conda-forge cusignal# Nightly - Available now
conda install -c rapidsai-nightly -c conda-forge cusignal

Enabling an aarch64 build — While Jetson users will still need to build the library from source, we’re actively working on enabling an aarch64 build. Windows users will need to continue to build from source.
GPU-accelerated acoustics — If you’re interested in source separation or speech signal processing, we’ve added support for both the real and complex cepstrum.
Speed, speed, speed — If you’ve been following us on Twitter, we overhauled our Polyphase Resampler (resample_poly/upfirdn) to leverage raw CuPy CUDA kernels rather than Numba CUDA kernels. This change results in a ~2x speedup and is enabled by default.
Full integration into RAPIDS — cuSignal follows the overall project’s coding, documentation, project planning, and CI/CD standards.

Online Signal Processing with cuSignal

If you’ve taken the time to browse the example notebooks, you may have thought, “Geeze, these speedups are compelling, but they’re all for really large signals; I don’t have 100 billion samples to process!”

We hear you.

Many signal processing workflows occur online, meaning that samples are streamed from a sensor to the GPU as the application is running. In order to reduce the probability of data loss during streaming, small buffers are used to facilitate data I/O from the sensor to the data processor. cuSignal implements a zero-copy buffer between CPU and GPU to reduce the overhead of this data movement.

Interfacing directly with a sensor has long been viewed as the domain of an FPGA; while FPGAs have significant value propositions with respect to ultra-low latency and deterministic performance, they’re often difficult to program. By enabling signal processing workflows to be built from Python, cuSignal drastically reduces the learning curve to building real-time solutions. By using cuSignal 0.13 and its refactored polyphase resampler, one of the RAPIDS contributors, Deepwave Digital, was able to achieve real-time performance on their AIR-T, a GPU enabled software defined radio (SDR), at 62.5 MSPS of complex64 samples for a total of 3.7 Gbps throughput.

Addressing Streaming Signal Processing Challenges with Deepwave Digital’s AIR-T

Image of Deepwave Digital’s Artificial Intelligence Radio Transceiver (AIR-T)

Deepwave Digital’s Artificial Intelligence Radio Transceiver (AIR-T) leverages the embedded NVIDIA Jetson TX2 GPU and has a tunable radio front end that can receive and transmit signals anywhere in the 300 MHz to 6 GHz range. Since the Jetson series of GPUs share a memory space between the Arm CPU and respective GPU, zero-copy data movement can be used to reduce latency compared to traditional GPU-accelerated SDR platforms, making the AIR-T an ideal platform for streaming signal processing applications.

Comparing workflows of an SDR with Discrete GPUs and Deepwave Digital’s AIR-T

cuSignal on the AIR-T

Deepwave and the RAPIDS cuSignal team have been working closely together to improve the online performance of cuSignal, specifically focused on the polyphase resampler. In the video below, we demonstrate the real-time performance of the resample_poly on the AIR-T, showing an 8x speedup over the embedded Arm CPU — all from Python with an almost identical API.

Video showing performance of resample_poly on Deepwave Digital’s AIR-T

For more information, including step-by-step instructions for getting cuSignal installed on the AIR-T, read the tutorial here. Below is a quick look at the simple Python code to implement the improved GPU polyphase resampler on streaming data.

import simplesoapy, cupy
import cusignal as signal
from matplotlib import pyplot as pltbuffer_size = 2**19 # Number of complex samples per transfer
fs = 62.5e6 # Sample rate# Create polyphase filter
fc = 1. / max(16, 25) # cutoff of FIR filter (rel. to Nyquist)
nc = 10 * max(16, 25) # reasonable cutoff for our sinc-like function
win = signal.fir_filter_design.firwin(2*nc+1, fc, window=(‘kaiser’, 0.5))
win = cupy.asarray(win, dtype=cupy.float32)# Init buffer and polyphase filter
buff = signal.get_shared_mem(buffer_size, dtype=cupy.complex64)
s = signal.resample_poly(buff, 16, 25, window=win, use_numba=False)# Initialize the AIR-T receiver
sdr = simplesoapy.SoapyDevice(sample_rate=fs, channel=1, auto_gain=True)
sdr.freq = 1350e6 # Set receiver frequency
sdr.start_stream(buffer_size=buffer_size)# Run test
sdr.read_stream_into_buffer(buff)
s = signal.resample_poly(buff, 16, 25, window=win, use_numba=False)
sdr.stop_stream()# Plot signals
plt.figure(figsize=(7, 5))
plt.subplot(211)
plt.psd(cupy.asnumpy(buff), Fs=fs, Fc=sdr.freq, NFFT=16384)
plt.ylim((-160, -75))
plt.title(‘Before Filter’)
plt.subplot(212)
plt.psd(cupy.asnumpy(s), Fs=fs*16/25, Fc=sdr.freq, NFFT=16384)
plt.ylim((-160, -75))
plt.title(‘After Filter’)
plt.show()

Final Thoughts

We’ve been humbled and amazed by the growth and interest in cuSignal. As always, open-source projects are driven by each of you, so don’t hesitate to use our GitHub repository to ask questions, file feature requests, submit PRs, and report bugs. Further, Deepwave Digital is hosting an online Webinar on March 25th that dives into more detail about the AIR-T and cuSignal integration. You may register for this event here and it will be available for streaming after.

Be sure to join the conversation on Twitter as well! The #cusignal hashtag is a good way to keep up with the latest and greatest GPU accelerated signal processing conversations.

We know these are troubling and uncertain times. From the cuSignal, RAPIDS, and Deepwave Digital teams, we’re wishing you and yours happiness, safety, and health.