Get new podcast episodes faster with Breaker

Sandstorm
Breaker
Published in
5 min readDec 5, 2017

If you use Breaker to subscribe to podcasts, you’ll now get new episodes 6–8 times faster than just a month ago. In this post, I’ll explain the challenges to refreshing podcast feeds, why new episode delivery used to be slower, and how we made the whole process faster using Sandstorm.

Breaker is now often significantly faster than Apple Podcasts at delivering new episodes.

Until recently, Breaker only fetched new episodes once per hour. Breaker users understandably wanted new episodes to show up faster than that. After I reached out to Erik at Breaker seeking general startup advice for Sandstorm, Erik realized that Breaker’s problem was a perfect fit for Sandstorm’s solution. So, we set out to reduce the new episode fetch rate from once per hour to once every ten minutes.

The challenge of refreshing podcast feeds

Podcasts are distributed via RSS feeds, published somewhere on the Internet. RSS is a flavor of XML, used for the syndication of content on the web. Originally, RSS was used to publish articles on a blog. Over time, RSS came to be used for any kind of syndicated content. Even Twitter had an RSS feed of each user’s tweets!

When the RSS 2.0 spec added the <enclosure> tag, it made it possible to attach a file to a feed entry—just as you might attach a file to an email. People started attaching MP3 (and other audio-formatted) files to feed entries. That’s how podcasting was born.

RSS feed readers, including podcast-listening apps, have to deliberately check for new content in feeds. Feed readers and podcast apps have to poll because there is no push in the RSS specification. The whole architecture of the web works this way. A site publishes some content; separately, some clients (browsers, feed readers, podcast apps, etc.) come along and say “hey, can I get a copy of that content, please?” The inverse method of push would be way too expensive and difficult to do easily at scale. A website would need to know all of its subscribers and send out a notification to each of them every time it published new content. Instead, for podcasts, apps do the heavy lifting.

Like all podcast apps, Breaker has to periodically poll every RSS feed on the known Internet to check for new episodes. The check is a matter of comparing the last episode that Breaker knows about in a given feed and the latest episodes according to the canonical website. If there are new ones (or updated ones), Breaker fetches the episode metadata (title, description, cover art, etc.) and saves it to its servers. At this point, a series of events kicks off. Most importantly to users, Breaker sends them notifications for the new episodes that are available, and those new episodes start downloading on their device.

Why new episodes used to be slower

Breaker uses Heroku for a lot of its server-side infrastructure. Heroku provides a service called Scheduler which is like Cron, but simpler. One of the constraints of Scheduler is that it can only be set to run at fairly coarse increments: every 10 minutes, once per hour, or once per day.

The job that was originally built to refresh all of the feeds was taking longer than 10 minute to complete. So, while the first run was still going, the second run would get added to the Redis queue. This process would repeat until Redis would fill up and run out of memory, then crash. When Redis crashed, three things would fail: new episode fetching, email sending, and notification pushing. So, at the time, the easiest and fastest solution to the problem was to increase the Scheduler time interval from 10 minutes to 1 hour. It was a compromise, but it kept Breaker operational.

How we made Breaker faster with Sandstorm

Sandstorm is a network of compute resources for doing distributed computing. Sandstorm works best for work that can be split up into small jobs and run independently — work that is highly parallelizable. Fortunately, this Breaker job was a perfect match for Sandstorm. Here’s how it works.

Breaker gave us their list of a few hundred thousand podcast feed URLs and told us which RSS fields (title, artist, artwork, etc.) were important to them. Sandstorm also set up a way for Breaker to ping Sandstorm when new podcast feed URLs were added to the Breaker database.

Originally, this background job was written in Ruby. Although we (at Breaker and Sandstorm) love Ruby, we decided to rewrite the feed fetching/parsing job in Go for its speed and inexpensive parallelism.

In the Sandstorm network, we process every block of data (in this case, podcast URLs) separately, running the same function of code on each block. This makes Breaker a perfect fit for Sandstorm, since each feed has to be polled separately and repeatedly. It’s also a good fit because some podcast servers are very slow to respond, and thus most of the time is spent waiting on network IO.

Sandstorm fills up its Redis job queue with all feed URLs and the GUID of that last known episode from that feed. We also spin up a fleet of computers to do the actual work. Each of those computers is then configured to launch over a thousand Go routines as Sandstorm workers. Each Go routine only works on one podcast feed URL at a time.

The Sandstorm worker:

  • fetches the feed URL
  • checks for new entries in the feed
  • saves the new episode metadata
  • posts a JSON blob of podcast metadata to a webhook on the Breaker server

The Breaker server then checks to see if that episode already exists in their database. If so, no duplicate is created. If not, the episode is saved, then notifications are sent to users subscribed to that episode’s feed.

After the job queue is processed all the way down, we fill it back up and start all over again. The whole process takes less a than ten minutes, typically only 7 minutes. That’s the power of distributed computing for highly parallelizable tasks.

If you’d like to get new episodes faster than ever before, try out Breaker today!

--

--