Extracting an n-gram word index for 675 hours of Podcast glory

11 min readJan 19, 2023

A couple of weeks ago the downloads for the Harmontown podcast stopped working. Apparently for everyone. But I needed them. Urgently. So while the remaining time on my previously downloaded episodes ran down like an hourglass I decided I would download all 364 episodes once the server was working again. And it did and so I did.

This text is about how to replicate these steps so you can create your own transcript and index for any podcast you want. It’s all open source and works a lot better than I had initially hoped for!

If you’re only here for the transcripts: raw transcripts are here (one large file with everything and a folder with transcripts per episode). The index json created below can be found here.

Have you ever tried to find something specific on a podcast and couldn’t. Simply because its impossible to listen through all of it just to find that 5 seconds of information. Wouldn’t it be great if there was some sort of transcript one could search for that little snippet. But now — since I had all the episodes — I decided to build my own text version and index, so I could look up everything to my hearts content.

Downloading all episodes

This is relatively straight forward. There exists an rss feed which is basically an xml file. This file contains structured information on each episode, including download paths, release dates etc.

The following script fetches the xml file, parses for mp3 download paths, prepends the release date and fixes some special characters that cannot be used as filenames:

import os
from bs4 import BeautifulSoup
import requests
import pendulum

feed = "https://feeds.megaphone.fm/harmontown"
out_folder = "path_to_downloaded_mp3s/Harmontown/"

# get xml file, then parse for "item" (= one item equals one episode)
r = requests.get(feed).content
soup = BeautifulSoup(r, "xml")
items = soup.findAll("item")

# keep track of downloaded mp3s
existing_mp3s = os.listdir(out_folder)

for item in items:
    # replace special chars in title
    title = item.find("title").text.replace(" ","_").replace("’","").replace("?","").replace("/","_").replace(":","_")
    
    # prepend formatted date to output file name
    org_date = item.find("pubDate").text
    date = pendulum.from_format(org_date[5:16], "DD MMM YYYY", tz=None).to_date_string()
    
    # skip downloading if file already exists
    mp3_title = date + "_" + title + ".mp3"
    if mp3_title in existing_mp3s:
        continue
    
    # actually download file and save to drive
    url = item.find("enclosure")["url"]
    mp3 = requests.get(url, allow_redirects=True).content
    with open(out_folder + mp3_title, "wb") as w:
        w.write(mp3)

Speech-to-Text with VOSK

I’ll be using VOSK as the primary speech-to-text tool here. Mainly because it supports several languages (which I potentially require for future use) and it supports speaker detection, which I will touch superficially below.

To replicate the following code, all you need is a folder full of mp3s, VOSK, pydub and ffmpeg. The latter two are only required to transform mp3s to wav format to work with VOSK. You can use any other means to achieve that if that appears easier. To avoid unnecessary hardware writes all of these transformations will be done in-memory with Pythons BytesIO file-like objects.

Let’s get started. First import all necessary packages and set paths. This should be the only thing needed to be altered to work on another system:

from vosk import Model, KaldiRecognizer, SpkModel
from pydub import AudioSegment
import io
import wave
import json
import numpy as np
import os

out_text_fol = ‘path_to_project/transcripts/’
mp3_fol = “path_to_project/Harmontown_Podcast_2012_2021/”
vosk_model_fp = “path_to_models/vosk-model-en-us-0.42-gigaspeech”
AudioSegment.converter = ‘path_to_ffmpeg_folder/bin/ffmpeg.exe’

A note on the ffmpeg.exe: using it under Windows without a proper installation and simply unzipping the exe files then referring to their absolut path as indicated above led to some problems. The error can be a bit misleasing as it simply states “WindowsError: [Error 2] The system can not find the file specified”. Apparently pydub tries not only to access the ffmpeg.exe (which works fine with the absolute path), but also ffprobe.exe. However, setting a path to ffprobe.exe did not have any effects for me on my system and simply setting the ffmpeg.exe path was not enough either. If you run into this problem try this quickfix I’ve outlined over here.

If that is done, simply throw the mp3s at the model and let it do its magic. The code will iterate through all mp3s and create an indiviual text output file in the out_text_fol defined above:

for mp3_fp in os.listdir(mp3_fol):
 
  # skip non mp3
  if mp3_fp[-3:].lower() != “mp3”:
   continue
  
  # skip existing ones
  if mp3_fp[:-4] in [i[:-4] for i in os.listdir(out_text_fol)]:
   continue
  
  # read the mp3 file
  mp3_file = AudioSegment.from_mp3(mp3_fol + mp3_fp)
  
  # convert to mono wave file in memory
  memfile = io.BytesIO()
  mp3_file.export(memfile, ‘wav’)
  wav_file=AudioSegment.from_wav(memfile)
  wav_file_1c = wav_file.set_channels(1)
  wav_file_1c = wav_file_1c.set_frame_rate(16000)
  
  # pass wav-file to wave package in memory
  memfile = io.BytesIO()
  wav_file_1c.export(memfile, ‘wav’)
  wf = wave.open(memfile, “rb”)
  
  # initialize lists to hold results
  resultsList = [] # this will hold ALL model output (word likelihood etc.)
  textResList = [] # this will hold ONLY text output
  
  # build the model and recognizer objects.
  model = Model(vosk_model_fp)
  rec = KaldiRecognizer(model, wf.getframerate())
  rec.SetWords(True)
  
  # use this for non speaker recognitoin
  while True:
   data = wf.readframes(4000)
   if len(data) == 0:
     break
   if rec.AcceptWaveform(data):
     recResult = rec.Result()
     resultDict = json.loads(recResult)
     resultsList.append(resultDict)
     # save the ‘text’ value from the dictionary into a list
     textResList.append(resultDict.get("text", "")) 
  
  # process “final” result. This is the last chunk of data that
  # was too small to be read as a full chunk
  results = results + rec.FinalResult()
  resultDict = json.loads(rec.FinalResult())
  textResults.append(resultDict.get("text", ""))
  
  # write text portion of results to a file
  with open(out_text_fol + f”{mp3_fp[:-4]}.txt”, ‘w’) as w:
   w.write(‘\n’.join(textResList) )

This may take some time to finish. For a 12.000 word episode it took about 40 min of decoding. There seems to be a solution to run VOSK server with GPU support, however it does not seem to work properly from the standard Python installation, escpecially on Windows. As indicated by the discussion here, running it through docker might enable using the batch mode with GPU support. However, I did not set this up, so for now it runs on CPU only. It probably doesn’t help to use the largest language model either, though I’d rather be more accurate here than faster. So for now I’ve set it up to run a couple processes in parallel (manually setting different starting points in the input list). Caveman style… but it works ;)

Speaker Recognition (optional)

While this is probably not super useful for Harmontown (because of the variety of possible speakers per episode) I skipped speaker recognition for this podcast. However, I did a small intro on it for another podcast here. In the meantime, here is a short description how to approach speaker recognition:

SetSpkModel()

While not immediately obvious, the VOSK API offers the possibility of speaker recognition out of the box (see example code). Basically there exists a second model pipeline that can be used in conjunction with the general speech recognition pipeline. First the model itself needs to be downloaded: currently there is one available on the main model site called vosk-model-spk-0.4.

Now after loading the main speech-to-text model add the model to the pipeline by adding:

spk_model = SpkModel(“path_to_model/vosk-model-spk-0.4”)
rec.SetSpkModel(spk_model)

Cosine Distances

The results output will now not only return the spoken text, but a 128-dimensional vector representing the voice characteristics. Using a simple cosine distance function will now allow for finding similar voices:

def cosine_dist(v1, v2):
    dot = np.dot(v1, v2)
    norm_vec1 = np.linalg.norm(v1)
    norm_vec2 = np.linalg.norm(v2)
    cosine_similarity = dot / (norm_vec1 * norm_vec2)
    return cosine_similarity

The closer the result is to 1 the more similar two vectors are (note: this cosine distance function uses a slightly different function than described in the example code. The code above correctly returns 1 for identical vectors while the example code was returning values around 0).

Speaker Assignment

Now there is a bit of manual work necessary. For it to properly work one needs to keep a look-up table of known 128d vectors for each potential speaker. Possibly even several of these vectors per speaker. As per the documentation, utterances should be longer than 4 seconds to create more reliable outcomes.

So first I had it run over a 10-minute episode snippet of a random episode and found a couple of those occurrences for Dan Harmon. In general it looks like distances >0.6 are a good indicator that its actually him speaking, though I haven’t looked into it that deeply.

So next steps would be, create a json file with speaker names and a couple of their 128 vectors, then write code that checks new speeches against all vectors and assigns the name with the highest similarity that is >0.6 as well. This probably needs more testing but is a point to jump off from.

Building n-grams

Before starting one needs to consider two main aspects: preprocessing and final layout.

Preprocessing can be everything from word replacements (in this case the word Harmontown is sometimes not recognized — instead there is a variation of harmattan, harman town etc.), to character substitution (special characters in particular), to stop-word extraction and other nlp approaches like making everything small-caps (what everything luckily already is). I’ve decided to only remove special characters. As I want to have the index contain counts per episode as well it might be interesting to compare the number of “I”s per episode.

As a layout I’ve decided on a json file, with some meta information prepended and then separate parts for each n-gram depth (1-gram, 2-gram, 3-gram…), where the n-grams are keys and (episode_name, count) tuples are values (to remain in Python dictionary terms). So in general it should look like this:

{"meta":{
      "episode_names":{
         "1":"First Episode Name In Filename.mp3",
         "2":"Second Episode Name in Filename.mp3"
      },
      "episode_done":['0','1'],
      "episode_info":{
         "1":{
            "words":12655,
            "time_min":121.3
         },
         "2":{
            "words":8554,
            "time_min":119.3
         }
      }
   },
   "1-gram":{
      "I":{
         "1":15,
         "87":13
      },
      "harmon":{
         "2":5,
         "3":10
      }
   },
   "2-gram":{
      "six seasons":{
         "9":15
      },
      "a movie":{
         "9":15,
         "45":3
      }
   }
}

The meta part also contains a conversion from filenames to a preset numbering of episodes (basically running count). All other episode references thus simply need to be this number instead of entire titles.

Creating n-grams

The following code lives separately from the one above to extract the text in the first place. The header looks as follows:

import os
import json
import re
import struct
import io

transcripts_folder = 'path_to_transcripts/transcripts/'
mp3_fol = 'path_to_mp3s/Harmontown_Podcast_2012_2021/'
index_json_fp = 'path_to_final_output/index.json'

The mp3 folder is only necessary to extract some additional metadata like mp3 runtime. It is not needed for the main code to run.

The ngrams themselves are not that hard to create. Instead of using an additional package like NLTK or the like this is used in the code:

def generate_ngrams(s, n):
    # Replace all none alphanumeric characters with spaces. 
    # Replace all newlines with an empty space
    s = re.sub(r'[^a-zA-Z0-9\s]', '', s)
    s = re.sub(r'[\n]', ' ', s)
    
    # break up sentences into tokens. Ignore empty tokens
    tokens = [token for token in s.split(" ") if token != ""]
    
    # create actual tokens using zip and return everything as list
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]

In addition I will use the following function to extract audio length from the mp3s. It would be possible using pydub as well, however it slowly reads the entire file first and is very very slow in doing so. So I’d rather use mutagen:

from mutagen.mp3 import MP3

def get_mp3_lem_from_metadata(mp3_fp):
    """ Get length of mp3 in seconds"""
    
    audio = MP3(mp3_fp)
    length = audio.info.length
           
    return length / 60

Building the actual index

The rest of the code is pretty straight forward. Most of it handles keeping track of what has already been done:

# check for existing index file. If none exists create from scratch
if os.path.isfile(index_json_fp):
    with open(index_json_fp, "r", encoding="utf8") as r:
        index_dict = json.load(r)
else:
    index_dict = {'meta':{'episode_names':{},
                          'episode_done':[], 
                          'episode_info':{}}, 
                  '1-gram':{}, 
                  '2-gram':{},
                  '3-gram':{},
                  '4-gram':{},
                  '5-gram':{}
                  }

# build initial naming mapping, only do this once. 
# Save names without filetype ending
if index_dict['meta']['episode_names'] == {}:
    cnt = 0
    for ep in os.listdir(mp3_fol):
        if ep[-3:].lower() == "mp3":
            index_dict['meta']['episode_names'][str(cnt)] = ep[:-4]            
            cnt += 1

for txt_file in os.listdir(transcripts_folder):
    
    with open(transcripts_folder + txt_file, "r") as r:
        text = r.read()
    
    # recover current episode ID by its name
    ep_num = [i for i in index_dict["meta"]['episode_names'] if index_dict["meta"]['episode_names'][i]==txt_file[:-4]][0]
                    
    # skip file if already in dict
    if ep_num in index_dict['meta']['episode_done']:
        continue
    
    # OPTIONAL get some meta info for episode and write to dict
    ep_words = len(text.split())
    ep_len = get_mp3_lem_from_metadata(mp3_fol + txt_file[:-3] + "mp3")
    index_dict["meta"]['episode_info'][ep_num] = {'words':ep_words, 'time_min':round(ep_len, 2)}
    
    # iterate different n-grams and save to 
    for ng_len in range(1,6):
        ngrams = generate_ngrams(text, ng_len)
        
        for ng in ngrams:
            # check if ngram exists at all
            if ng in index_dict[f"{ng_len}-gram"]: 
                # check if entry for ep exists
                if ep_num in index_dict[f"{ng_len}-gram"][ng]:
                    index_dict[f"{ng_len}-gram"][ng][ep_num] += 1
                else:
                    # if no ep, init ep
                    index_dict[f"{ng_len}-gram"][ng][ep_num] = 1 
            else:
                # if no ngram, init with ep
                index_dict[f"{ng_len}-gram"][ng] = {ep_num:1} 
    
    index_dict['meta']['episode_done'].append(ep_num)

# finally save dict to json
with open(index_json_fp, "w", encoding="utf8") as w:
    json.dump(index_dict, w)

And that’s it. Now we have a super nice and large index file where we can look up word counts, compare them across episodes, find weird 5-gram combos that appear again and again and just look up those topics you remember put couldn’t pinpoint to a concrete episode.

“Fun” Stats about Harmontown

Now that we have all the data, let’s have a look at some fun stats (bit random, I was just playing around):

There is a total of 40,468.55 minutes of Harmontown. That is 674.5 hours or 28 days 2 hours 30 minutes. Those minutes contain 6,863,118 words… which is almost 12x the Lord of the Rings series — including The Hobbit.
The episode with most words spoken (25,607, average is 18,854) is also the longest (147.8 min, average is 111.2 min): ep.120 2014–10–10_LIVE_in_Toronto_feat._Bobcat_Goldthwait!
However the most words spoken per minute (200.9, average is 169.6) is ep.263 2017–10–25_Seventeen_Chicken_Boots
There is not a single episode — including the short preview on 2014–05–03 — that lacks the word “fuck”. Not even counting any derivatives like fucking, fucked etc. However, at least a runner up in that category is ep.210 2016–08–24_A_Haunted_House_With_A_Glass_Ceiling that only contains 2x “fuck”, 1x “fucking” and 5x “fucked”. On average these three words appear 62x per episode. Most of these three fucks were given 148x in ep.328 2019–04–04_Jeffs_Joke_Corner_
Top 5 n-grams (and their total counts) — should be done without the stopwords again:
1-gram: i (241302), like (220635), the (218415), you (194998), and (189441)
2-gram: you know (26447), i dont (25120), like like (21966), i was (19306), and i (18912)
3-gram: i dont know (10038), like like like (6449), i was like (5045), a lot of (4359), it was like (3453)
4-gram: like like like like (1896), i dont want to (1857), and i was like (1533), i dont know i (1421), i dont know what (1403)
5-gram: like like like like like (516), i dont know i dont (442), yeah yeah yeah yeah yeah (400), dont know i dont know (281), no no no no no (278)
noteworthy 3-grams that appear in max. 5 episodes but at least 20x in on of those episodes (pretty random, but I was too lazy to come up with something more sophisticated):
better off dead e56: 3, e137: 21, e143: 1, e145: 1, e150: 1
lou diamond phillips e81: 1, e115: 1, e306: 1, e345: 25
dollar shave club e215: 22, e216: 9, e217: 9, e219: 6, e220: 1
noteworthy 4-grams that appear more than 10x in a single episode and in more than 5 episodes:
my name is john e38: 1, e135: 13, e173: 1, e175: 1, e204: 2, e218: 1, e227: 1, e238: 1, e275: 1, e290: 1, e322: 1, e343: 1, e356: 1
king of the nerds e49: 5, e91: 1, e93: 1, e137: 11, e138: 2, e142: 1, e145: 1
grand theft auto five e70: 11, e124: 1, e160: 2, e204: 1, e213: 1, e310: 1, e322: 1
now you see me e201: 15, e203: 1, e209: 2, e233: 1, e254: 1
noteworthy 5-grams that appear more than 5x in a single episode and in more than 5 episodes:
new kids on the block e3: 1, e17: 1, e105: 3, e209: 6, e353: 1
god oh my god oh e177: 1, e182: 9, e223: 3, e282: 1, e315: 1
to the break of dawn e224: 1, e231: 2, e233: 1, e247: 1, e285: 6, e300: 1, e357: 1