Machine Learning Solutions For English Speaking

Published in

SpeakBit

3 min readDec 28, 2022

CaGOP Explained — By SpeakBit

This series explains how to use machine learning models to help non-native speakers to improve their English pronunciation.

Concepts:

CAPT: Computer-Assisted Pronunciation Training
SOTA: State of the art
SOTA Mispronunciation Detection Models: either DNN or GOP.
ASR: Automatic speech recognition. It mainly includes an acoustic module, a decoding module, and a scoring module.
Acoustic module: It converts speech information into frame-level
phonetic posterior-probabilities.
Decoding module: It force-aligns the posterior probabilities into phonetic segments.
Scoring module: It scores the given segment according to the reference phoneme. The dominant method is GOP.
GOP: Goodness of Pronunciation. It is a confidence measure for phonetic pronunciation, relating the test speech to an ASR model
trained on native speech. Later, weighted GOP and confused phoneme sets were proposed to improve hard cases where phonemes with similar sounds couldn’t be identified correctly.

Context-aware GOP Model (original paper)

Brief: This paper improved the existing pronunciation scoring model by injecting contextual information.

Problem: Limitations of GOP models come from the forced alignment(no transition between phonemes) and phonetic segments (no context effects such as liaison, omission, incomplete plosive sound)

Solution: The authors invented two factors to compensate for those limitations. These two factors are the transition factor and the duration factor. Together this new method is called the Context-aware Goodness of Pronunciation (CaGOP) scoring model. This method improved detection accuracy in both phoneme and sentence levels.

Transition factor: It identifies
the transitions between phonemes and apply them to weight
the frame-wise GOP.

Duration factor: It is calculated based on a self-attention-based phonetic
duration modeling

The problem of forced alignment: Most ASR solutions depend on forced alignment, where the entire speech sequence is split into phonetic segments corresponding to reference phonemes. These segments (1) include transitions between phonemes, (2) don’t include word-level context. Phoneme transitions have higher entropy due to the potentially misleading information that does not relate to the target phoneme. Word-level information may influence the original sound of phonemes. These three parts(center phoneme, transition, longer context) of information should be handled differently to improve the scoring accuracy.