Machine Learning Solutions For English Speaking

Moon
SpeakBit
Published in
3 min readDec 28, 2022

CaGOP Explained — By SpeakBit

This series explains how to use machine learning models to help non-native speakers to improve their English pronunciation.

Concepts:

  1. CAPT: Computer-Assisted Pronunciation Training
  2. SOTA: State of the art
  3. SOTA Mispronunciation Detection Models: either DNN or GOP.
  4. ASR: Automatic speech recognition. It mainly includes an acoustic module, a decoding module, and a scoring module.
  5. Acoustic module: It converts speech information into frame-level
    phonetic posterior-probabilities.
  6. Decoding module: It force-aligns the posterior probabilities into phonetic segments.
  7. Scoring module: It scores the given segment according to the reference phoneme. The dominant method is GOP.
  8. GOP: Goodness of Pronunciation. It is a confidence measure for phonetic pronunciation, relating the test speech to an ASR model
    trained on native speech. Later, weighted GOP and confused phoneme sets were proposed to improve hard cases where phonemes with similar sounds couldn’t be identified correctly.
ASR Components

Context-aware GOP Model (original paper)

The framework proposed by this paper

Brief: This paper improved the existing pronunciation scoring model by injecting contextual information.

Problem: Limitations of GOP models come from the forced alignment(no transition between phonemes) and phonetic segments (no context effects such as liaison, omission, incomplete plosive sound)

Solution: The authors invented two factors to compensate for those limitations. These two factors are the transition factor and the duration factor. Together this new method is called the Context-aware Goodness of Pronunciation (CaGOP) scoring model. This method improved detection accuracy in both phoneme and sentence levels.

Transition factor: It identifies
the transitions between phonemes and apply them to weight
the frame-wise GOP.

Duration factor: It is calculated based on a self-attention-based phonetic
duration modeling

The problem of forced alignment: Most ASR solutions depend on forced alignment, where the entire speech sequence is split into phonetic segments corresponding to reference phonemes. These segments (1) include transitions between phonemes, (2) don’t include word-level context. Phoneme transitions have higher entropy due to the potentially misleading information that does not relate to the target phoneme. Word-level information may influence the original sound of phonemes. These three parts(center phoneme, transition, longer context) of information should be handled differently to improve the scoring accuracy.

Transitions between phonemes have higher entropy.
Forced alignment example in PyTorch: https://pytorch.org/audio/main/tutorials/forced_alignment_tutorial.html

Fun facts: TOFEL and AZELLA use CAPT.

If you are interested in automated language learning, check out SpeakBit.

References:

  1. Librispeech: An ASR corpus based on public domain audio books

--

--

Moon
SpeakBit

Digging@PinkRain | Biologist & AI Engineer For Entrepreneurs | www.skool.com/ai-creators | Let me know when you figure out what I am doing