‘Wait…who said what?!’ Let’s Talk About How to Clean Up Meeting Transcripts from those early days of COVID Zooms.
Remember those early days of COVID and the Zoom learning curve? Background noise, unmuted speakers, and frequent interruptions became routine in online work meetings as we all struggled to find our way with this new normal.
Since then, we’ve come a long way with both our online etiquette and Zoom’s features, including many transcription improvements. But despite these many Zoom advancements, some of the old mishaps still endure in the meeting transcript archives — a challenge I recently sought to tackle as a recent data project required me to reduce the noise of older transcripts with natural language processing.
Here is what I learned.
The Two Biggest Issues
The majority of issues in the transcript corpus stemmed from incorrect speaker identification. For example, a transcript would contain:
Alice: Hi, Adil!
Adil: How are you?
Adil: I am well. How about you?
Alice: I am great. Hello everyone. It’s great to see you Pedro
Adil: Hi all. Great to see you as well, Alice.
However, when compared with the video, it became apparent that what actually transpired was:
Alice: Hi, Adil! How are you?
Adil: I am well. How about you?
Alice: I am great. Hello Everyone. It’s great to see you Pedro.
Pedro: Hi all. Great to see you as well, Alice.
There were two main issues with the utterances in our original transcripts.
First, the transcription software would split a single utterance into two and assign them to two speakers. This issue is illustrated in the example above when Alice’s utterance of “How are you?” gets incorrectly assigned to Adil.
The second issue is that the transcription software often struggles to detect the correct number of speakers on the call. The original transcription software transcribed the words spoken accurately. However, people coughing, a pencil falling on the table, or a speaker pausing to collect their thoughts were all things that would trigger a change of speaker in the transcription software.
I needed to improve the transcript quality; now, it was a question of determining the best methodology.
How to Fix Transcripts Via Facial Detection
The original videos were Zoom calls, which had two visual presentations.
The first one, the Tile View, presents video thumbnails of each meeting attendee on the screen. Zoom did not always denote who was always speaking in this layout. The other visual layout, Active Speaker View, shows a single video of the individual Zoom identified as the speaker. For this exercise, we analyzed the active speaker views, as the tile view would require a different methodology and could not be verified using a single frame.
You may wonder why the discussion focuses on what is on the screen rather than what the audio contains. Knowing that none of the transcripts correctly assigned all the utterances to the speaker, I needed to create a test set for future model fine-tuning. Visually checking a set of images to see if they contain the same person wasn’t the issue — humans can do that almost instantly.
The more significant challenge was verifying if a set of audio clips contained the same voice, which is a far more time-consuming and error-prone task. For this, I turned to computer vision to correctly determine which speaker corresponded to an utterance.
I used open-source facial detection and comparison libraries to improve the transcripts. For each utterance a single frame is extracted using the corresponding timestamps. Then, all of the faces in each frame are detected using RetinaFace.
RetinaFace did a great job of avoiding phantom faces in our dataset. A phantom face, shown below, is when an object is incorrectly identified as a person’s face. RetinaFace also consistently detected faces when the speaker shared a presentation on their screen or when a speaker’s video feed was relegated to a small portion of the screen.
Using the ArcFace model, the extracted faces are compared. If a new face is identified, it is added to the speaker list. If the face matches a previously detected speaker, the utterance is assigned the same label as the previously detected speaker. In instances when multiple speakers were detected or no speaker was detected we assigned the corresponding utterances to speaker_-1, so that they could be manually reviewed. Human verification was done to ensure the validity of the extracted data, as well as build a test set that could be used for future improvements.
Conclusion
Our transcript improvement script significantly reduces the time required to process and correct a transcript.
In general, once the transcript is processed it takes less than 5 minutes to verify and correct the output. I save each frame in a folder corresponding to the speaker it is assigned.
Later, we simply check that the images in that folder all correspond to the same person. Any image that was not correctly identified can be moved to the appropriate folder to correct the speaker assignment. Since each folder corresponds to one speaker, and each frame corresponds to an utterance, new speaker assignments can be generated using the directory structure.
Below is a representation of what a transcript would look like before and after being run through our transcript improvement software. In this example, the original transcript only identifies and assigns utterances to four speakers. After reprocessing the transcripts using two DeepFace models, eleven speakers are identified as quite a difference!
If you happen to have some active speaker Zoom transcripts lying around that you would like to improve, we have made the code available at GitHub.
Blog Contributors
Devon Amsterdam is the blog’s author. He is a data science consultant for The Rockefeller Foundation and is interested in mathematics and machine learning.
Chantelle Norton is the blog’s graphic designer. She is a multidisciplinary designer and artist. Since 2020, she has worked with The Rockefeller Foundation Data Science Team, creating images and infographics that simplify complex data.