Tell Me Something I Don’t Know

Enriching and analyzing podcast metadata using Jupyter Notebooks

I enjoy listening to podcasts, but I can never keep up with all the episodes released on a daily or weekly basis. So my backlog is growing. A lot.

My current backlog in Podcast Addict: Where do I start???

With about 800 episodes waiting to be heard, it’d take many binge listening sessions to catch up. And therein lies my dilemma: too many choices.

How do I attack the backlog? Listen to the episodes in chronological order? In reverse chronological? I settled on neither approach and instead opted to go with “whatever suits my interest/mood of the day”. Which created another problem: what are all these episodes about?

A podcast for every mood

Here’s how I went about tackling this nice little data engineering challenge using a simple Python notebook.

Enriching a web syndication feed using Watson Natural Understanding and Wikipedia.
Why did I use a notebook? I’m not a data scientist and don’t aspire to be one. However, I do like to understand the tools they use and approaches they take. The best way to do that is to walk in their shoes.

Tuning in

Parser libraries such as Python’s feedparser make it easy to ingest web syndication feeds. Providing a feed URL as input, such as WNYC’s Radiolab (http://feeds.wnyc.org/radiolab), these libraries retrieve and parse the feed, providing the raw metadata I needed to catalog each feed item. In this project, each web feed item describes a podcast episode; however, other feeds might describe other media types, such as a blog post or a video log.

For each feed item I’ve extracted four pieces of metadata: episode title, publication date, episode summary, tags, and URL.

Collecting basic syndication feed metadata in a Pandas DataFrame.

Analyzing various feeds, I quickly realized that some do not include episode-specific tags, making it hard to implement faceted search or to automatically categorize episodes.

Sample podcast containing no episode-specific tags. (X-axis: number of episodes, color-coded by year.)

I’ve also noticed that some episode summaries mentioned people, companies, and other entities of potential interest. Thankfully, IBM Watson provides a service (with a free tier) that I could use to derive the desired information from the episode summaries.

A way with words

Watson’s Natural Language Understanding service provides a simple API that you can use to extract concepts, keywords, categories and much more from text snippets. Having settled on using Python notebooks, I installed the watson-developer-cloud Python SDK.

A REST API and other SDKs are available for various other languages, such as Node.js, Swift, and Java.

To extract the desired information from the episode summary (or the title if no summary was present), I passed the following payload to the /analyze API endpoint:

  • Text snippet to be analyzed. The snippet can include HTML, which is convenient because some feeds do not provide plain text summaries.
  • Names of the text analysis features to be applied. I selected Categories, Entities, and Keywords, but more are available. (Refer to the pricing plan for some important information and supported features by language.)
  • Language of the text, encoded in ISO 639–1. If no language is specified, the service attempts to identify it by analyzing the text. Since I’ve had some mixed detection results for very short text snippets (four words or less), I’ve explicitly passed in the feed’s language.
Calling the Watson Natural Language Understanding API

The response includes the requested information, along with a relevance rating.

Response for “Neil Degrasse Tyson and some new microbiome science help answer the question — when we touch greatness how much of it stays with us?”

Applying some basic parsing to the response, I’ve appended the new metadata to each episode item.

Each enriched feed item now includes categories, entities and keywords

Painting a picture with words (and bar charts)

With this information in place, it’s now easy to visually explore the podcast. For illustrative purposes, the notebook plots separate charts for tags, categories, and keywords:

Most frequently used tags. Generic tags provide little selectivity, rendering them unsuitable for faceted search in episodes from the same podcast.
Podcast classification results. I would not have guessed that law is such a prevalent topic. More research is needed to confirm how accurate the classifications are, given that they are based on short episode summaries.
Most frequently used keywords, based on Watson’s episode summary text analysis.

Since each tag, category, and keyword is associated with one or more episodes (and its URL) it’s now much easier to find something entertaining, thoughtful, or whatever the mind craves at any given time.

So, this leaves one last question…

Who is Neil deGrasse Tyson?

Watson’s entity analysis feature identified for each episode entities that were mentioned in the summary. For each returned entity, its detected type is specified, such as Person, Company, Organization, and Location.

Taking advantage of Wikipedia’s search capabilities, I’ve added some code to the notebook that sends a search request for each entity with a type of interest (e.g. Person or Organization) and evaluates the returned HTTP status code.

If Wikipedia responds with HTTP code 200, the exact search term was not found.

A Wikipedia search result for “Watson Data Lab”, which is not found.

If wikipedia responds with HTTP code 302, the exact search term was found. However, the redirect URL may or may not contain the expected result, as illustrated in the following two examples:

Wikipedia search result for a unique wikipedia entry “Neil deGrasse Tyson”.
Wikipedia search result for an ambiguous wikipedia entry "Jad".

While this simple approach is not very reliable, it typically provides a meaningful starting point for subsequent searches. The response is recorded, and information about the entity is therefore only “a click away”.

Entities along with their associations.

Parting words

The data engineering approach I’ve outlined provides input that could be useful in other scenarios. For example, one can create podcast profiles and build a recommendation engine for other episodes or other podcasts.

I invite you to explore this notebook using your favorite podcast web syndication feed. Maybe you’ll learn something you didn’t know. Follow the instructions in this Github repository to get started.