Tell Me Something I Don’t Know
Enriching and analyzing podcast metadata using Jupyter Notebooks
I enjoy listening to podcasts, but I can never keep up with all the episodes released on a daily or weekly basis. So my backlog is growing. A lot.
With about 800 episodes waiting to be heard, it’d take many binge listening sessions to catch up. And therein lies my dilemma: too many choices.
How do I attack the backlog? Listen to the episodes in chronological order? In reverse chronological? I settled on neither approach and instead opted to go with “whatever suits my interest/mood of the day”. Which created another problem: what are all these episodes about?
A podcast for every mood
Here’s how I went about tackling this nice little data engineering challenge using a simple Python notebook.
Why did I use a notebook? I’m not a data scientist and don’t aspire to be one. However, I do like to understand the tools they use and approaches they take. The best way to do that is to walk in their shoes.
Tuning in
Parser libraries such as Python’s feedparser
make it easy to ingest web syndication feeds. Providing a feed URL as input, such as WNYC’s Radiolab (http://feeds.wnyc.org/radiolab
), these libraries retrieve and parse the feed, providing the raw metadata I needed to catalog each feed item. In this project, each web feed item describes a podcast episode; however, other feeds might describe other media types, such as a blog post or a video log.
For each feed item I’ve extracted four pieces of metadata: episode title, publication date, episode summary, tags, and URL.
Analyzing various feeds, I quickly realized that some do not include episode-specific tags, making it hard to implement faceted search or to automatically categorize episodes.
I’ve also noticed that some episode summaries mentioned people, companies, and other entities of potential interest. Thankfully, IBM Watson provides a service (with a free tier) that I could use to derive the desired information from the episode summaries.
A way with words
Watson’s Natural Language Understanding service provides a simple API that you can use to extract concepts, keywords, categories and much more from text snippets. Having settled on using Python notebooks, I installed the watson-developer-cloud
Python SDK.
A REST API and other SDKs are available for various other languages, such as Node.js, Swift, and Java.
To extract the desired information from the episode summary (or the title if no summary was present), I passed the following payload to the /analyze
API endpoint:
- Text snippet to be analyzed. The snippet can include HTML, which is convenient because some feeds do not provide plain text summaries.
- Names of the text analysis features to be applied. I selected Categories, Entities, and Keywords, but more are available. (Refer to the pricing plan for some important information and supported features by language.)
- Language of the text, encoded in ISO 639–1. If no language is specified, the service attempts to identify it by analyzing the text. Since I’ve had some mixed detection results for very short text snippets (four words or less), I’ve explicitly passed in the feed’s language.
The response includes the requested information, along with a relevance rating.
Applying some basic parsing to the response, I’ve appended the new metadata to each episode item.
Painting a picture with words (and bar charts)
With this information in place, it’s now easy to visually explore the podcast. For illustrative purposes, the notebook plots separate charts for tags, categories, and keywords:
Since each tag, category, and keyword is associated with one or more episodes (and its URL) it’s now much easier to find something entertaining, thoughtful, or whatever the mind craves at any given time.
So, this leaves one last question…
Who is Neil deGrasse Tyson?
Watson’s entity analysis feature identified for each episode entities that were mentioned in the summary. For each returned entity, its detected type is specified, such as Person, Company, Organization, and Location.
Taking advantage of Wikipedia’s search capabilities, I’ve added some code to the notebook that sends a search request for each entity with a type of interest (e.g. Person or Organization) and evaluates the returned HTTP status code.
If Wikipedia responds with HTTP code 200,
the exact search term was not found.
If wikipedia responds with HTTP code 302
, the exact search term was found. However, the redirect URL may or may not contain the expected result, as illustrated in the following two examples:
Wikipedia search result for an ambiguous wikipedia entry "Jad".
While this simple approach is not very reliable, it typically provides a meaningful starting point for subsequent searches. The response is recorded, and information about the entity is therefore only “a click away”.
Parting words
The data engineering approach I’ve outlined provides input that could be useful in other scenarios. For example, one can create podcast profiles and build a recommendation engine for other episodes or other podcasts.
I invite you to explore this notebook using your favorite podcast web syndication feed. Maybe you’ll learn something you didn’t know. Follow the instructions in this Github repository to get started.