The new sample dataset from the web archives of the Library of Congress consists of 1000 audio files archived from various government websites. Hearings, announcements, podcasts from members of congress, all sorts of audio. I was thinking about how they could relate together and was curious about repetition of phrases. In political rhetoric the device called anaphora is the act of reusing the same phrase over and over for impact. I also wanted to try out AWS Transcribe, a service that takes audio files and produces an automated transcript. So I used this dataset to try and see what words or phrases were repeated across the corpus.
The first step is to convert the audio to MP3, they are all sorts of wild formats like old Real Media and Windows Media Audio. I used ffmpeg in this python script to batch convert them to mp3. This worked on about 700, some of the ram files would not convert to mp3, I figured 700 was enough to get started and moved on.
AWS transcribe works by uploading the audio files to a S3 bucket and running the API command to work on the file. After it is complete it provides a URL to download the transcript. I batch processed them in groups of 100 using this script which is just making calls to the AWS command line tool. I was doing this during the breaks while I was at a conference and wasn’t really paying attention. I knew the service was a little expensive but at some point I got a billing alert email on my phone and knew I had overdid it. I ended up sending 500 of the 1000 files through the service for a total of 289,167 second costing me close to $100(!). Pro tip: The AWS billing dashboard updates like every 6 hours. But my loss can be your gain, you can download the 500 transcripts here.
The transcripts are pretty elaborate, it has each word broken out to the fraction of the second it was spoken. I wanted to compile the often repeated n-grams and also I used my Semlab DADAlytics NER toolkit to pull out named entities (also included in the above download). This script builds the n-grams.
The next step was to cut out the words I was interested in and save them together. This script cuts out each n-gram and base64 encodes them together into one big JSON file. This JSON file will then be loaded by a website to playback the audio.
The result is https://anaphora.glitch.me
It has a number of playback options and you can click on one of the words to see where it came from and play it again in the full context of the original sound file.
You can also directly link to favorites: