Qualitative Data Analysis: Python, Research Theory, Whisper, NVivo, and more

Howard Tobochnik
6 min readFeb 4, 2024

--

DALL-E Generated Image

This article contains three distinct sections. Each can be read individually. All Python code is contained in section 2.

Section 1: Introduction, Theory, and Real-Life Examples

Section 2: Python, Whisper, Speaker Diarization, and Transcribing Audio Files

Section 3: NVivo and other computer-assisted software for qualitative data analysis (coming soon…)

SECTION 1: Introduction, Theory, and Real-Life Examples

Computer-assisted software for qualitative data analysis is often overlooked.

This is not surprising for two reasons:

  1. Unlike quantitative data, which can give us the answer to complex questions in the form of a simple number (e.g. 42), qualitative data can’t be analyzed this way. You can’t take the average of a sentence and there’s no obvious algorithm for analyzing text.
  2. Qualitative data seems more manageable for the human mind. I can read a hundred interview transcripts, jot down several notes from each, and pick out the top five major themes. In fact, I can almost certainly do this more competently than even the most advanced AI model today.

And so it seems that there is no best practice for understanding the main takeaways from words, like there is for numbers.

How should you deal with qualitative data? Is there a best practice? Or at least common approaches?

One solution is to collect qualitative data in a quantitative form. Instead of asking open-ended survey questions such as, “how do you feel today?” where the answers can range from “superb” to “you know, it’s just one of those days,” we can use a Likert Scale: “On a scale from 1–10, how do you feel today?”

Another solution is using binary variables. Gender is often encoded as 1 or 0 for male / female. This can be used on qualitative variables that have more than two values by creating several binary variables. For example, if we have clients in the US, Japan, Germany, Israel, and India, we would create five binary variables: US client = 1, non-US client = 0; Japanese client = 1, non-Japanese client = 0; etc. (This is call one-hot encoding).

A third solution to analyze longer and more complicated texts is to count the number of instances that a certain word appears, or to classify the text by its main theme and then count how many fall into each group. This specific approach can be executed quite effectively using machine learning algorithms. For example, a 2022 study analyzed 200,000 US congressional speeches and 5,000 presidential communications since 1880 to determine whether feelings towards immigrants were becoming more positive or negative (the results may surprise you!). The researchers started with 17 million speeches and narrowed this population down to a sample of only those that mentioned immigration, then categorized these as either positive or negative.

These are good solutions for certain data. But are you satisfied?

Perhaps not. The reason is that these approaches are not great at answering the real-life questions that we actually care about, especially the complex ones.

The goal of data analysis — for companies, governments, or almost anyone except for universities — is to create insights that can inform action. A company doesn’t really care how many customers are unsatisfied, but rather why are they unhappy and what are the most cost-effective remedies.

Many of these answers can be found in quantitative data. Netflix has personalized their product to every customer without using a single survey or any explicit feedback. Many companies utilize recommendation models based on purely implicit quantitative data: customer purchase history (content based filtering) and similarities between users or items (collaborative filtering).

Still, there’s no substitute for human insight.

Last year, I was the primary research assistant for a qualitative academic research project. The focus of the study was to understand the motives, mechanisms, impact, and limitations of the ESG movement to leverage investors, through ESG metrics and engagement, to pressure companies to reduce emissions.

Our main source of data was elite interviews with executives at large oil & gas companies and investment firms.

Before even getting to the challenge of analyzing these interviews (which I will give one approach to in section 3), there were several crucial steps:

  • Choosing the right people to interview.
  • Convincing these people to speak to us.
  • Writing interesting and useful questions.
  • Scheduling and carrying out the interviews.
  • Transcribing the audio files.
  • Cleaning the transcripts.

Section 2 will show the technical aspects of transcribing audio files into text files.

SECTION 2: Python, Whisper, Speaker Diarization, and Transcribing Audio Files

Much of this Python code can be found with better and more detailed documentation in various Github repositories (such as this one from Tel Aviv University, although in Hebrew).

However, I made some changes to either simplify or customize the process to my liking. I’m using Jupyter Notebook on a MacOS 14.2.1.

For our data source, I use YouTube videos of interviews. You should be able to directly copy the code below without changing anything.

This code requires Waveform (.wav) audio files and Mono format, meaning that file contains only one channel of audio. If you are using audio files from other sources (i.e. not downloaded from YouTube using the code I provide below) make sure it’s in the correct form.

In this code snippet, all you need to do is enter in a url to a YouTube video and the file path to where you want to save the file:

from pytube import YouTube
import ffmpeg

# This example uses an interview between Heather Cox Richardson and President Joe Biden
url = 'https://www.youtube.com/watch?v=tGRXnB_GQcM&t=13s'

yt = YouTube(url)

# Getting the highest quality audio stream URL
stream_url = yt.streams.filter(only_audio=True).order_by('abr').desc().first().url

# Specify your desired output path
output_path = 'file_path/file_name.wav'

# Run ffmpeg subprocess to convert audio to mono WAV
(
ffmpeg
.input(stream_url)
.output(output_path, format='wav', acodec='pcm_s16le', ac=1) # Ensure output is mono
.run()
)

print(f"Audio has been saved to {output_path} in mono WAV format.")

Now you should see file_name in the folder you chose.

Next, we will use a combination of whisper to transcribe and pyannote to distinguish between different speakers (diarization).

# import libraries
import whisper
import os
from os.path import join
import time
import datetime
import subprocess
import torch
import pyannote.audio
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
embedding_model = PretrainedSpeakerEmbedding(
"speechbrain/spkrec-ecapa-voxceleb",
device=torch.device("cpu"))
from pyannote.audio import Audio
from pyannote.core import Segment
import wave
import contextlib
from sklearn.cluster import AgglomerativeClustering
import numpy as np
from os.path import join
# Set audio file location
audio_file = "file_path/file_name.wav"

# Assuming the text files will be placed in a folder called "Transcriptions" under "Whisper"
transcription_folder_path = "/Users/howard/Audio/"

# Create "Transcriptions" folder if does not exist
if not os.path.exists(transcription_folder_path):
os.makedirs(transcription_folder_path)
# load whisper model - this step can take a long time. 
# model choices include ['tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large-v1', 'large-v2', 'large']
model = whisper.load_model("medium.en")

lang = 'en'
num_speakers = 2
# set timer to see % progess transcribing the file
import time
start_time = time.time()

# transcribe interview file
result = model.transcribe(audio_file, verbose = False, language = lang) # to translate add task = 'translate'
segments = result["segments"]

print(f"\033[1m--- Transcribed file in %s seconds ---" % (time.time() - start_time))

with contextlib.closing(wave.open(audio_file,'r')) as f:
frames = f.getnframes()
rate = f.getframerate()
duration = frames / float(rate)

print(f"\033[1m--- The file contains {frames} audio frames ---")
print(f"\033[1m--- The sampling frequency is {rate} Hz ---")
print(f"\033[1m--- The duration is {duration} seconds ---")

audio = Audio()

def segment_embedding(segment):
start = segment["start"]
# Whisper overshoots the end timestamp in the last segment
end = min(duration, segment["end"])
clip = Segment(start, end)
waveform, sample_rate = audio.crop(audio_file, clip)
return embedding_model(waveform[None])
embeddings = np.zeros(shape=(len(segments), 192))

for i, segment in enumerate(segments):
embeddings[i] = segment_embedding(segment)

embeddings = np.nan_to_num(embeddings)
clustering = AgglomerativeClustering(num_speakers).fit(embeddings)
labels = clustering.labels_
for i in range(len(segments)):
segments[i]["speaker"] = 'SPEAKER ' + str(labels[i] + 1)
def time(secs):
return datetime.timedelta(seconds=round(secs))

# Extract the name of the audio file without the extension
audio_filename = os.path.splitext(os.path.basename(audio_file))[0]

# Construct the path to the transcription file
transcription_file_path = os.path.join(transcription_folder_path, audio_filename + ".txt")

f = open(transcription_file_path, "w", encoding= 'UTF-8')

for (i, segment) in enumerate(segments):
if i == 0 or segments[i - 1]["speaker"] != segment["speaker"]:
f.write("\n" + segment["speaker"] + ' ' + str(time(segment["start"])) + '\n')
f.write(segment["text"][1:] + ' ')
f.close()
print(f"\033[1m--- Finished writing to {transcription_file_path} ---")

The result, although not perfect, is a text file like the following. The voice-to-text transcription is impressively accurate. The speaker diarization is less accurate and may require a bit of cleaning up.

SECTION 3: NVivo and other computer-assisted software for qualitative data analysis

Coming Soon…

--

--

Howard Tobochnik
Howard Tobochnik

Written by Howard Tobochnik

Researcher, writer, data analyst.

Responses (1)