Anomaly detection in podcasting
By Mariusz Strzelecki
Being a Data Engineer at Acast is not only about developing information pipelines and maintaining data processes. It’s also about understanding data flow and looking for patterns that allow the business to grow — turning that data into knowledge to quickly respond to clients’ needs.
We collect massive data streams from our internal tools and CDN. These are joined together, processed and used in a number of internal tools. Sometimes we’re asked, “Hey, do you see anything strange in the listening patterns for show X?” and we have to perform a set of queries on Athena to ensure the mp3 files are being delivered as they should, and that there are no unusual spikes in traffic.
But we wondered if we could automate checks like this by using a tool that would constantly monitor shows for us. And we did it.
Listening patterns
Podcasting’s data is driven by server logs. There’s a huge variety of applications you can use to listen to episodes, but there’s no standard on how to report that an episode was played or skipped after the first 10 seconds.
Acast offers podcast apps for web, mobile, and smart speakers, and we can collect first-party insights into listening behaviours happening within them. However, looking at listens coming from all other podcast apps, the only useful information is a set of lines in the server logs showing HTTP requests made by each app for parts of the file. We parse these logs and link them together, and based on this say whether the episode was listened to or not. (If you’re interested in the detail, I recommend our article on Server-side metrics).
When we review the number of listens for any podcast episode soon after publication, the charts typically look like this:
As we would expect, shortly after the episode is published, it gains a lot of attention — then over time the number of listens plateaus. And we observe daily and weekly patterns.
The chart on the right shows an episode targeted at US listeners and published late in the evening. It had some listens in the beginning, then there was a drop before suddenly, at around 3am, Apple devices started the auto download of episodes in the background — meaning a small spike.
On some of the shows in the “daily news” category, we observe a slightly different trend. The episode is listened to for several hours after publication but, after 24 hours when the new one is fired up, the number of listens to the original decreases to nearly zero per hour.
After a few experiments, it was clear this data representation would not suit all our needs regarding anomaly detection. So we tried another model. Instead of looking at the number of listens each hour, we took the definite integral of that value in time. That gave us the following representations:
We started to call it the cumulative number of listens after publication. The above charts look like a graphical representation of logarithmic function. Even the shape (which is different for the “storytelling” show on the left and the “news” show on the right) can be adjusted by just changing the base of the logarithm:
Hello, Logarithm
After a few experiments with logarithm function parameter fitting using SciPy’s curve_fit, we concluded the following:
- The episodes of a single show have a similar listening pattern (the same shape as the listening curve)
- Sometimes on storytelling shows, the shape is slightly different for episodes such as “trailer” or “Christmas special”
- Depending on the time the episode is released, the few first hours may see a slight shift compared to the normally expected shape
As a result of the analysis, we constructed two more curves showing lower and upper threshold. Every day we scan listens of Acast’s top 1,000 shows, and for all the new data we check if the cumulative listens number crosses the thresholds we’ve set.
Results time
Every day this gives us several reports on listenings anomalies. We can manually distinguish several classes:
Trendy topics
Shows gaining audience
Behaviour we can’t understand (yet)
We need to engage with our colleagues and our network of podcasters to understand listener behaviour, because they can’t be explained just by looking at numbers.
Integration
The application that scans the listens to look for anomalies runs just after the data processing is finished on AWS Fargate. If there’s any anomaly we should be notified of, the process sends a notification to an open Slack channel that all Acast employees can join. This is also a place for discussion and comments.
Summary
Listeners that subscribe to a show in their app are usually notified quickly about new episode releases, and tend to have the same listening behaviour when we look at variables like time — with people listening on the way to work, on the way back home, or during evening workouts, for example. Therefore, all episodes published in every show have very similar listening patterns, allowing us to quickly find anything that goes beyond “normal”.
Anomaly detection brings a lot of value to podcasting. It’s useful not only to spot issues with content delivery (when the lower threshold is crossed) or short-time peaks when the podcast is shared through social media (upper threshold is crossed) — looking at the charts also tells us which topics are popular and which shows are gaining popularity episode after episode.