Are there any thoughts on why a CNN would be better than an LTSM for analyzing audio?
Chris Butler

Waveforms are hard! Lots of long-time dependencies. You really need to build in a prior, some notion of what waves are, otherwise just consuming byte by byte will be horribly hard work for any RNN. CNNs do not automatically help, but a useful hack is to translate the waveform into a form where the spatial dimensions sort of make sense — like, say, a spectrogram — then CNNs built-in bias (assuming that closer things are more important) turns into a bias useful for understanding waves.

Sander Dieleman, ex-intern at Spotify (now at Google’s DeepMind), wrote a blog post explaining his approach, which probably made it into Discover Weekly after he left.

Since that, there have been many improvements in RNNs/CNNs which probably make them better at consuming waveforms directly. Maybe Spotify have implemented some of them, but I suspect not, I think they rely on CNNs on mel spectrograms still.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.