Fano Labs ranked Global Top 5 on DIHARD III Competition! — Diarization

Published in

Fano Labs

3 min readMar 5, 2021

Diarization is a process to partition audio according to different speakers. This process is required when we want to use Automatic Speech Recognition (ASR) to transcribe conversation in business phone calls (e.g. customer service hotlines) and meetings into text, where all speakers’ voices are recorded on the same track in the recording system. In order to precisely determine who spoke what and when, a good technology for speech diarization becomes crucial.

Speech analytics also leverages diarization to a new level because enterprises want to conduct Big Data analysis to learn more about the behavior of the customers during the conversations, and identify business insight or ways to improve their business and services. In addition, regulators have also imposed lots of compliance policies for enterprises to follow, especially the financial institutes, and they need to ensure their staff follows the policies to avoid penalties. Therefore, knowing what customers and staff have spoken precisely becomes even more important, which also drives the need for better diarization technology.

The challenge of building a good technology for diarization is not only separating different speakers from the speech but in reality, we also need to face challenges such as severe background noise, side speech, overlap speech, short sentences, etc.

Fano Labs’ research engineer, Mr. Leung Tsun Yat (“TY”), with the assistance of the Lead Speech Scientist, Dr. Lahiru Thilina Samarakoon, recently participated for the first time in the Global Third DIHARD Speech Diarization Challenge (DIHARD III), which has attracted experts in this field around the globe. During the challenge, the evaluated task is precisely speech diarization; that is, the task of determining “who spoke when” in a multi-speaker environment based only on audio recordings. By leveraging the latest AI technologies for end-to-end speech diarization, TY won Top 5 globally, under the “Diarization from Scratch” track, which is an outstanding achievement to showcase to the world that Fano Labs has the expertise and capability to deliver world-class performance.

The challenge is intended to improve the robustness of diarization systems to variation in recording equipment, noise conditions, and conversational domain. Speaker diarization is evaluated under two segmentation conditions (diarization from a reference speech segmentation vs. diarization from scratch) and 11 diverse domains. The domains span a range of recording conditions and interaction types, including reading audiobooks, meeting speech, clinical interviews, web videos, and, for the first time, conversational telephone speech. With the excellent result in diarization, Fans Labs will apply the technology to different solutions so as to fulfill the customers' needs in different scenarios.

在日常的商務電話(例如：客戶服務熱線)或會議中很多時都會錄音作跟進或分析，當我們想透過自動語音識別(ASR)技術把錄音對話轉為文本時，話者分離技術(Diarization)是一個至爲關鍵的過程。因為錄音時，對話中所有講話者的語音都會被記錄到錄音系統的同一聲音軌道上，而話者分離技術就是把於軌道上不同講話者的聲音分辨出來，準確地分辨出何人在何時說了什麼，因此，好的話者分離技術能直接提升ASR的準確性。

同時間，話者分離技術亦可帶動語音分析(Speech Analytics)到一個新的層次。現今越來越多企業希望利用大數據分析，透過與客戶的對話了解客戶的行為和想法，從而得出業務洞見(Business Insight)或可改善業務/服務的地方。此外，監管機構亦為企業(特別是金融機構)制定許多合規政策以供業界遵循：企業需要確保其員工遵守這些合規政策以免受到監管機構處罰。因此，準確了解客戶和員工的對話變得更加重要，而市場對話者分離技術需求亦不段增加。

去研發一個準確、可靠的話者分離技術，當中的困難不單是將多名講話者的語音分辨出，事實上亦需要處理背景噪音、側語音、重疊語音、短句子等挑戰。

有光科技(Fano Labs)的研究工程師梁晉溢(“TY”)，在首席語音科學家Lahiru Thilina Samarakoon博士的協助下，代表公司首次參加了全球第三屆DIHARD語音分析挑戰賽(DIHARD III)。挑戰賽的評分內容是進行準確的話者分離，就是在一個多講話者的錄音中分辨出”誰人在什麼時候說什麼”。TY利用了最新的人工智能技術(Artificial Intelligence)，把比賽提供的聲音軌道從零開始進行話者分離的技術分析，並獲得全球Top 5的佳績！這是一個非常傑出的成績，展現出有光科技具備着國際領先的專業知識和能力，為客戶提供專業顧問和服務。

是次挑戰旨在提升話者分離技術對不同錄音設備、背景噪音、對話源的分別的準確性。話者分離會以2種細分的話語情況以及11種不同的聲音源進行評估，當中包括有聲書、會議對話、網上影片及首次加入的電話對話。有光科技在話者分離技術取得優異成績，未來亦會把技術應用在不同的解決方案中，以幫助客戶於不同場境中的需要。

Fano Labs ranked Global Top 5 on DIHARD III Competition! — Diarization

Written by Fano Labs