Waveforms are hard! Lots of long-time dependencies. You really need to build in a prior, some notion of what waves are, otherwise just consuming byte by byte will be horribly hard work for any RNN. CNNs do not automatically help, but a useful hack is to translate the waveform into a form where the spatial dimensions sort of make sense — like, say, a spectrogram — then CNNs built-in bias (assuming that closer things are more important) turns into a bias useful for understanding waves.
Sander Dieleman, ex-intern at Spotify (now at Google’s DeepMind), wrote a blog post explaining his approach, which probably made it into Discover Weekly after he left.
Since that, there have been many improvements in RNNs/CNNs which probably make them better at consuming waveforms directly. Maybe Spotify have implemented some of them, but I suspect not, I think they rely on CNNs on mel spectrograms still.