Anomaly detection in podcasting

Acast Tech Blog
Published in
5 min readApr 6, 2020

By Mariusz Strzelecki

Being a Data Engineer at Acast is not only about developing information pipelines and maintaining data processes. It’s also about understanding data flow and looking for patterns that allow the business to grow — turning that data into knowledge to quickly respond to clients’ needs.

We collect massive data streams from our internal tools and CDN. These are joined together, processed and used in a number of internal tools. Sometimes we’re asked, “Hey, do you see anything strange in the listening patterns for show X?” and we have to perform a set of queries on Athena to ensure the mp3 files are being delivered as they should, and that there are no unusual spikes in traffic.

But we wondered if we could automate checks like this by using a tool that would constantly monitor shows for us. And we did it.

Listening patterns

Podcasting’s data is driven by server logs. There’s a huge variety of applications you can use to listen to episodes, but there’s no standard on how to report that an episode was played or skipped after the first 10 seconds.

Acast offers podcast apps for web, mobile, and smart speakers, and we can collect first-party insights into listening behaviours happening within them. However, looking at listens coming from all other podcast apps, the only useful information is a set of lines in the server logs showing HTTP requests made by each app for parts of the file. We parse these logs and link them together, and based on this say whether the episode was listened to or not. (If you’re interested in the detail, I recommend our article on Server-side metrics).

When we review the number of listens for any podcast episode soon after publication, the charts typically look like this:

As we would expect, shortly after the episode is published, it gains a lot of attention — then over time the number of listens plateaus. And we observe daily and weekly patterns.

The chart on the right shows an episode targeted at US listeners and published late in the evening. It had some listens in the beginning, then there was a drop before suddenly, at around 3am, Apple devices started the auto download of episodes in the background — meaning a…