Case Study: Utilizing Podcasts for Topic Modeling

Published in

The whispers of a data analyst

5 min readDec 17, 2023

Data Scientists also need to join the User Interviews

As a data analyst, I often participate in user interviews for our products. After each interview, we endeavor to organize what the users have said, categorizing insights from various topics such as user requirements or the user journey. This method facilitates our contemplation of different aspects related to the product experience.

Some trick to do after having a user interview

Reflecting on direct user feedback is immensely beneficial for data analysts. Most of the time, I deal with cold, impersonal data, lacking warmth and a human touch. This absence of empathy and passion for humanity and society makes hypothesis formation challenging. Without proper hypotheses, it’s difficult to use data effectively to validate business decisions.

I am comfortable with interviews conducted in English and Chinese, finding many interesting viewpoints. However, interviews in other languages present more challenges. It’s impractical for me to attend every interview, and there’s an abundance of historical interview data. Sifting through original interview recordings, audio, or transcripts to extract valuable information swiftly is often a daunting task.

Therefore, my personal efficiency in such research has always been somewhat limited. However, I believe that as various excellent pre-trained models are being open-sourced, the line between quantitative and qualitative research is increasingly blurring. As data scientists and analysts, we must not confine ourselves to merely simple and well-organized data.

Analyzing the Content of Podcasts

I’ve actually wanted to perfect this workflow for many years, but it wasn’t until yesterday that I finally had the time to organize everything. I envisioned an engaging workflow that could extract key information from user interviews or my favorite podcasts through a series of NLP tools, thus eliminating the need to spend a lot of time understanding and storing this knowledge.

I used “Casual Inference Season 5 Episode 1” as an example. This is a podcast I really enjoy, hosted by two distinguished professors in biostatistics and epidemiology, Lucy D’Agostino McGowan and Ellie Murray, in collaboration with the American Journal of Epidemiology.

Casual Inference

Keep it casual with the Casual Inference podcast. Your hosts Lucy D'Agostino McGowan and Ellie Murray talk all things…

casualinfer.libsyn.com

I finished listening to this episode last Wednesday, and the information it contained was incredibly inspiring to me. This episode was re-released in memory of the late great biostatistician Ralph B. D’Agostino, Sr., having been originally aired in 2021.

Interestingly, the guests in this episode were the father-son duo, Ralph D’Agostino Sr. and Ralph D’Agostino Jr. The host, Lucy D’Agostino McGowan, is the granddaughter of Ralph D’Agostino Sr. They are all experts who have made significant contributions to human health and public health.

About the workflow

The following workflow can be executed in Colab, but note that the code is unorganized and quite messy, so please use it with caution.

You can find my Jupyter Notebook here.

Download the MP3 audio file of S5E1 from the official website.
Use OpenAI’s “whisper-large-v3” model for Automatic Speech Recognition to transcribe each sentence with timestamps.
Employ pyannote/embedding model to convert each sentence into audio embeddings.
Apply UMAP for dimensionality reduction of audio embeddings, followed by HDBSCAN for identifying different speakers.
Use Bertopic, which utilizes models like Sentence Transformer “all-mpnet-base-v2”, UMAP, and HDBSCAN, for topic modeling of each sentence.
Identify and merge topics with high similarity within smaller topics.
Organize sentences and topics by each speaker to avoid fragmentation of sentences within the same topic.
Use OpenAI’s API “gpt-3.5-turbo-instruct-0914” for summarizing important topics.
Visualize the topics discussed by each speaker at different time points using Plotly.

Result

You will finally be able to see these:

This Episode is about D’Agostino Family’s Contributions to Biostatistics and Epidemiology.

The key topics in this episode involved:

Data Science and the Role of Statisticians in the Next 100 Years
Memorial Episode for Ralph D’Agostino Sr.
Collaboration and Versioning in Causal Inference
Framingham Heart Study and Risk Functions
The D’Agostino Family’s Contributions to Biostatistics and Epidemiology

Challenges

There are some issues here.

The code is not standardized, making it difficult to reuse. I think LangChain could be very helpful, but I haven’t had enough time to explore it yet.
A systematic method to effectively store, visualize and reuse both simplified and original audio information is still unknown or non-existent.
The ASR occasionally produces sentences that are too short. I’ve made adjustments in the program, but this may affect the quality of the audio embeddings.
There should be a more appropriate method for speaker diarization, such as using Google Speech to Text API or pyannote/speaker-diarization. However, the method I used here is quite interesting, allowing one to literally see different vocal characteristics being captured and grouped as shown above.
Simplifying the number of topics in the Topic Model relies a bit on manual identification, which is not ideal.