As part of Cortico’s mission to foster a healthy public sphere, we’re building tools for journalists to explore and understand voices from around the country. We’re especially interested in surfacing voices that are underrepresented in existing data sets.
Every day thousands of conversations take place on talk radio, a medium that has enormous reach and influence. According to the Nielsen Media Research as reported by Pew Research Center, in 2017, 90% of Americans age 12 and older listened to broadcast radio in a given week; at any given time of day, news and talk stations commanded about 10% of the total listenership.
Some of these conversations are local in scope, while others embrace national or international events. Some are in call-in shows that are syndicated across the country, while others are unique to a single radio station. Hearing these conversations gives our journalist partners a deeper understanding of a community’s local issues — as well as the residents’ reactions to national issues — than they could achieve by analyzing social media alone.
To this end, we’ve been working to make talk radio browsable and searchable using automatic speech recognition and diarization technology. The result is that talk radio is now one of the flagship data sources available in our public sphere search engine, Earshot.
Our software pipeline
We now process about 170 talk radio and public radio stations across the country, adding more than 4000 hours of audio containing 30 million words of speech every day. The pipeline that goes from talk radio to Earshot has seven distinct pieces:
- Ingestion of publicly accessible radio streams from around the country.
- Transcription of the audio files (speech-to-text) using the Kaldi  open source speech recognizer.
- Diarization of the audio files, in which we chop the transcripts into “speaker turns”, contiguous segments of words spoken by the same person in a conversation.
- Classification of the speaker turns by various facets such as whether the speech seems to be from a phone call or from the studio; whether it’s a male or female speaker; and whether it’s speech or music.
- Syndication grouping, in which we use the speech-to-text transcripts and the raw audio to determine which content is from syndicated radio shows and which content is specific to a single station.
- Indexing of the transcript data into a text-friendly database
- Analysis of the data to find topics in the transcripts and surface interesting geospatial and temporal trends to Earshot users.
Each of the above steps is a world unto itself, rich with interesting computer science problems. We’ll take a look at diarization in this blog post and explore some of the other steps in future blog posts.
A Deep Dive on Diarization
With the speech transcripts we obtain using automatic speech recognition, talk radio becomes searchable. But along with characterizing what was said, we are also interested in who said it. Separating a speech stream by speaker is known as speaker diarization. Once a speech stream has been separated by speaker, we can then infer additional information such as the speaker gender and whether they are in the studio or calling in over the phone. To accomplish these tasks, we employ the LIUM speaker diarization toolkit [2,3] and have developed tools to adapt and improve the underlying acoustic models.
The basic idea in speaker diarization is to extract salient features from voice that are unique to the speaker, and ideally independent of what’s being said. The salient features we use capture the unique mix of frequencies present in a speaker’s voice. In the same way that different musical instruments playing the same note are distinguishable, different speakers’ voices also have a unique characteristic mix of frequencies. The first step in speaker diarization is thus to convert the speech stream to a sequence of frequency snapshots. Technically, this is accomplished by extracting a sequence of mel-frequency cepstral coefficient (MFCC) features from the audio stream, which are standard feature representation of audio used in speech applications.
The next step in the LIUM diarization pipeline is to identify contiguous segments of MFCC features that belong to a single speaker. This is accomplished by fitting multivariate Gaussians to short windows of MFCC features, and performing a statistical test to determine whether the MFCC features in adjacent windows are different. While adjacent windows will usually have some differences even if they are from the same speaker, the differences will be much larger when there is a speaker change. Finally, a clustering process is applied that groups segments together that appear to come from the same speaker. This is achieved by merging segments that are statistically similar and likely to be generated by the same speaker. A mixture of Gaussians is used to model each speaker.
In general, the LIUM system produces good speaker segmentation on the talk radio data we have collected. But in addition to segmentation, we also would like to classify the speaker gender and band (i.e. whether the speaker is in the studio or calling in on the telephone.) The LIUM system provides built-in gender and band classifiers that use MFCCs and mixture of Gaussian models that are applied to the segmented speech. While these classifiers were a good starting point, the models were tuned for audio and speech with different characteristics and produced too many errors for our task. For example, the model missed a lot of telephone speech, in particular mistaking female telephone speech as another category. So, we embarked on a process of retraining the models using our own data. We developed a simple tool that played a speech segment (as determined by the diarization) to a human annotator, who then labeled the speaker gender and band using a few keystrokes. Our web-based tool was optimized for speed, and with several of us investing just a few hours of annotation effort we labeled several hours of speech segments that we used to retrain the models. The outcome of these efforts was a model that increased overall absolute accuracy by 10%, and crucially improved the recall of female telephone speech (and telephone speech in general) to an acceptable level.
What’s diarization good for?
You can see from Figure 2 that diarization is a key component in our pipeline. For one thing, being able to segment audio by speaker turn gives our search engine a coherent chunk of speech to index and return to users. Without diarization, a radio transcript just looks like an endless stream of words, and it’s not clear how much context to show when search results are returned to the user. Also, having speaker boundaries lets us detect syndicated content more efficiently — when we’re comparing two pieces of audio to determine if they’re duplicates, we can line them up at speaker boundaries. Finally, the attributes we get from our classifier for each speaker turn (gender and telephone/studio, for now) are useful for prioritizing and filtering results in the search engine, and for deeper analysis of our ever-growing radio corpus.
Having deployed our pipeline in May, the corpus is now more than a billion words, large enough that we can glean from it a few interesting facts about the dynamics between hosts and callers on the radio. 89.4% of content on our stations is “studio” speech (from hosts), and 10.6% is telephone speech (from call-in show guests, for example). 22.2% of speaker turns (21.2% of words) are speakers that were classified as female, while 77.8% of speaker turns (78.7% of words) are from speakers classified as male. As shown in Figure 4, female speaker turns are both rarer and shorter (p < 0.01), and are slightly more common within telephone speech than studio speech. Deeper analysis of the corpus could better reveal the nature of these differences and other conversation dynamics on talk radio.
We should note that all of these numbers may be influenced by our selection of 170 radio stations, which were chosen for geographic balance and for utility to our journalist partners. The conclusions may not hold generally for all radio stations in the United States, nor for any individual radio station.
Diarization is just one of the steps we need to get right in order to surface interesting voices to our Earshot users. In future posts we’ll discuss the transcription (speech-to-text) part of the pipeline and how we optimized our speech recognizer models for the domain of talk radio. We’ll also talk about audio fingerprinting, which lets us find syndicated content more easily within our corpus; and the natural language processing tools we’ve written to extract meaning from this ever-growing corpus of talk radio transcripts. If you’re interested in learning more about the work we’re doing, follow our Medium account.
 Povey, Daniel et al, “The Kaldi Speech Recognition Toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Waikoloa Village, Hawaii, USA, Aug. 2011.
 S. Meignier, T. Merlin, “LIUM SpkDiarization: An Open Source Toolkit For Diarization,” in Proc. CMU SPUD Workshop, Dallas (Texas, USA), March 2010,
 M. Rouvier, G. Dupuy, P. Gay, E. Khoury, T. Merlin, S. Meignier, “An Open-source State-of-the-art Toolbox for Broadcast News Diarization,” Interspeech, Lyon (France), 25–29 Aug. 2013