Understanding Mispronunciation Detection Systems Part 3
Table of contents
· Introduction
∘ Evolution of End-to-end architectures
∘ Evolution of End-to-end MDD architectures
· System Architecture
∘ CNN-RNN-CTC based architecture
∘ Hybrid CTC-ATT-based architecture
∘ SED-MDD architecture
∘ Transformer based architecture (Wav2Vec)
· Evaluation Metrics
∘ Hierarchical Evaluation Structure
∘ FRR, FAR and DER
∘ Precision, Recall and F-measure
∘ Diagnosis and Detection Accuracy
· Speech Datasets
∘ Timit
∘ LibriSpeech
∘ Speechocean762
∘ Isle
∘ L2-Arctic
∘ Data Augmentation
· Tools
∘ Kaldi
∘ ESPnet
∘ Praat
∘ TextGridTools
· Summary
· References
· Online reference links
Introduction
Previous articles (Part 1 and Part 2) have established that End-to-End ASR models are simpler to build and optimize than conventional ASR systems, as they eliminate the need for training multiple modules separately. This article begins by exploring the evolution of End-to-End ASR and MDD systems. It then examines some of the popular End-to-End system architectures for MDD tasks. It then delves into the hierarchical evaluation structure for mispronunciation and related metrics. The article also reviews notable datasets for ASR and MDD tasks. Finally, it highlights researchers' popular tools and technologies for ASR and MDD tasks.
Evolution of End-to-end architectures
End-to-end (E2E) architectures in speech recognition have steadily evolved over some time.
- Initially, (Graves & Jaitly, 2014) presented a system based on the combination of the deep bidirectional LSTM recurrent neural network architecture and Connectionist Temporal Classification objective function. They demonstrated that character-level speech transcription can be performed by a recurrent neural network with minimal preprocessing and no explicit phonetic representation. They also introduced a novel objective function that allows the network to be directly optimised for word error rate, even in the absence of a lexicon or language model.
- While CTC with RNN makes it feasible to train End-to-end speech recognition systems, it is computationally expensive and sometimes difficult to train. (Zhang et al., 2017) suggested an End-to-end speech framework for sequence labelling, by using a combination of hierarchical CNNs and CTC without the need for recurrent connections. The proposed model is not only computationally efficient but also can learn temporal relations that are required for it to be integrated with CTC.
- (Chorowski et al., 2014) introduces the attention mechanism to speech recognition-related tasks for the first time replacing the traditional Hidden Markov Model(HMM) based approach. (Chorowski et al., 2015) then present a refined Attention-based Recurrent Sequence Generator (ARSG), a recurrent neural network that stochastically generates an output sequence from an input. it is based on a hybrid attention mechanism that combines both content and location information to select the next position in the input sequence for decoding. This proposed model could recognise utterances much longer than the ones it was trained on. Also, the deterministic nature of ARSG’s alignment mechanism allows the Beam search procedure to be simpler, which allows for faster decoding.
- (Watanabe et al., 2017) proposed a hybrid CTC/attention end-to-end ASR, which effectively utilises the advantages of both architectures in training and decoding. During training, a multi-objective learning method is employed by attaching a CTC objective to an attention-based encoder network as a regularisation. This greatly reduces the number of irregularly aligned utterances without any heuristic search techniques. A joint decoding approach is used by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments. This method has outperformed both CTC and an attention model on ASR tasks in real-world noisy conditions as well as in clean conditions. This work can potentially be applied to any sequence-to-sequence learning task.
Evolution of End-to-end MDD architectures
The evolution of ASR technologies also led to their adaptions into MDD systems because the underlying model architecture is similar for both systems.
- (Leung et al., 2019) presents the CNN-RNN-CTC approach to develop an end-to-end speech recognition system for the task of MDD. This approach does not need the presence of any phonemic and graphemic information and also force-alignment is not required. It is one of the first models that proposes an end-to-end model for MDD and works well with the CU-CHLOE corpus spoken by Cantonese and Mandarin speakers. It is also considered one of the baseline papers while evaluating MDD systems.
- (Lo et al., 2020; Yan et al., 2020; Zhang et al., 2020) propose a hybrid CTC and attention-based end-to-end architecture for MDD. (Zhang et al., 2020) introduce a dynamic parameter adjustment method for parameter α used by (Watanabe et al., 2017). (Yan et al., 2020) uses an anti-phone collection to generate additional speech training data with a label-shuffling scheme for a novel data-augmentation operation. The label of the phone at each point of its reference transcript is either kept unchanged or randomly replaced with an arbitrary anti-phone label for every utterance in the original speech training dataset. (Lo et al., 2020) perform input augmentation with text prompt information to make the resulting E2E-based model more tailored for MDD.
- (Feng et al., 2020) build SED-MDD, a sentence-dependent end-to-end model for MDD. It is the first model to use both linguistic and acoustic features to address MDD problems. The model includes a sentence encoder that extracts robust sequential representations of a given sentence. It also has a sequence labelling model with an attention mechanism. The model is trained from scratch with random initialization. It does not use phonological rules or forced alignment and only requires audio files, transcription, and annotation files. The model is evaluated on two publicly available corpora, TIMIT and L2Arctic which makes it a strong baseline for researchers. (Fu et al., 2021) present a system similar to SED-MDD where instead of character sequences, phoneme sequences are fed to the sentence encoder. Since MDDs aim to detect phoneme-level errors, using phoneme sequences is logical. They also propose three simple data augmentation techniques. These techniques address data imbalance issues between positive and negative samples in the L2 Arctic dataset. This approach improves the model’s accuracy compared to CNN-RNN-CTC and SED-MDD.
- Wu et al. (2021) introduced two Transformer-based models for MDD. The first model, T-1, uses a standard encoder-decoder architecture with MFCC input and Cross Entropy loss. The second model, T-2, is based on Wav2vec 2.0 with raw audio input and CTC loss. Both models significantly outperform previous models (AGPM and CNN-RNN-CTC) in freephone recognition and MDD. Similarly, Xu et al. (2021) employed unlabeled data with Wav2vec 2.0 for pretraining to extract speech representations. They added convolutional and adaptive pooling layers to assess correct pronunciation, treating MDD as a binary classification task. Peng et al. (2021) also explored the Wav2vec 2.0 model for MDD. They verified its effectiveness on the TIMIT and L2-Arctic datasets and demonstrated its competitive performance in ultra-low resource scenarios.
- End-to-end (E2E) models typically undergo training using a cross-entropy criterion to optimize log-likelihood. However, this criterion does not align with the common MDD evaluation metric, the F1-score. In response, Yan et al. (2021) propose training E2E MDD models with a discriminative objective function that directly maximizes the expected F1 score. Experiments conducted on the L2-ARCTIC dataset illustrate significant performance improvements compared to state-of-the-art E2E MDD approaches. The novel maximum F1-score criterion (MFC) utilized during training, demonstrated with a hybrid CTC-Attention model, validates its effectiveness in enhancing model performance.
System Architecture
Although the underlying model architecture for ASR and MDD is similar, they are distinct problems to solve. Following are some of the latest and most popular end-to-end architectures for building MDD systems.
CNN-RNN-CTC based architecture
This model architecture was proposed by (Leung et al., 2019) and is considered as one of the baseline model architectures for MDD tasks. The model comprises five main parts, depicted in the following figure.
The first part is the input layer, which accepts framewise acoustic features. Following this, there is a batch normalization layer and a zero padding layer to standardise batch utterance lengths. The second part includes four CNN layers, two Maxpool layers, and another batch normalization layer. This convolutional layer aims to extract high-level acoustic features, beneficial for reducing phone error rates and improving performance in noisy conditions. The third part integrates a bi-directional RNN, specifically using GRU instead of LSTM for simpler and faster training to capture temporal acoustic features. The fourth part consists of MLP layers (Time Distributed Dense layers), concluding with a softmax layer for classification output. Lastly, the model includes a CTC output layer responsible for generating predicted phoneme sequences.
Hybrid CTC-ATT-based architecture
(Lo et al., 2020; Yan et al., 2020; Zhang et al., 2020) used the Hybrid CTC and Attention architecture for MDD. While (Lo et al., 2020; Yan et al., 2020) used data augmentation techniques, (Zhang et al., 2020) used a dynamic adjusting parameter with this architecture. One common thing in all the approaches is that they adopt a multi-objective learning (MOL) framework, wherein CTC plays an auxiliary role in assisting the attention-based main method. For further insights, please refer to Part 2 of the article.
ESPnet is a framework that offers built-in support for implementing this architecture.
SED-MDD architecture
(Feng et al., 2020; Fu et al., 2021) introduce model architectures integrating acoustic and linguistic features (annotated text sequence) to enhance MDD outcomes. The core concepts underlying both architectures are similar, employing Sentence Encoders, Audio Encoders, and Decoder modules.
In (Feng et al., 2020), the Sentence encoder utilizes a CNN-RNN module for character sequence conversion, while (Fu et al., 2021) employs a Bidirectional LSTM. (Feng et al., 2020) employs a GRU layer for spectrogram encoding, whereas (Fu et al., 2021) utilize CNN-RNN layers. Both studies utilize an attention-based Decoder to generate output phone sequences.
Another distinction is that (Feng et al., 2020) use character sequences as input to the sentence encoder, while (Fu et al., 2021) use phoneme sequences. To address data imbalance, (Fu et al., 2021) implement Simple Data Augmentation techniques. Both studies conducted experiments on TIMIT and L2-Arctic datasets.
Transformer based architecture (Wav2Vec)
(Wu et al., 2021; Xu et al., 2021; Peng et al., 2021) proposed this transformer-based architecture for MDD. It is the latest architecture and makes use of a pre-trained Wav2vec 2.0 model.
(Baevski et al., 2020) proposed Wav2vec 2.0 model which primarily comprises a CNN encoder, a Transformer contextualized network, and a quantization module. The model takes raw audio waves as input, with the CNN encoder learning latent speech representations. These latent representations are then discretized into quantized representations by the quantization module. The training objective is to use the output of the contextualized network to recognize the quantized representations for each masked time step.
Evaluation Metrics
When the recognized phones differ from the canonical productions, mispronunciation detection and diagnosis (MD&D) are achieved respectively.
Hierarchical Evaluation Structure
(Qian et al., 2010; Li et al., 2017; Leung et al., 2019) and several other researchers have used a hierarchical evaluation structure to evaluate the performance of the MDD model. Below is a diagram for the same.
The expected outcomes for mispronunciation detection are True Acceptance and True Rejection. Unexpected outcomes include False Acceptance and False Rejection. The focus is primarily on True Rejection cases for mispronunciation diagnosis, particularly those involving Diagnostic Errors.
- True Acceptance (TA) is the number of phonemes annotated and recognized as correct pronunciation.
- True Rejection (TR) is the number of phonemes annotated and recognized as mispronunciation.
- False Rejection (FR) is the number of phonemes annotated as a correct pronunciation. but recognized as a mispronunciation.
- False Acceptance (FA) is the number of phonemes annotated as mispronunciation but recognized as correct pronunciation.
- Correct Diagnosis (CD) is the number of phones correctly recognized as mispronunciations and correctly diagnosed as matching the annotated phonemes.
- Diagnosis Error (DE) is the number of phones correctly recognized as mispronunciations but incorrectly diagnosed as different from the annotated phonemes.
FRR, FAR and DER
The False Rejection Rate (FRR), False Acceptance Rate (FAR) and Diagnosis Error Rate (DER) are widely used as the performance measures for Mispronunciation Detection tasks.
FRR — It is the percentage of instances where the system incorrectly identifies a correctly pronounced word as a mispronunciation. For mispronunciation detection, a high FRR means that the system is overly strict and rejects too many correct pronunciations as errors. This can lead to frustration for users who are actually pronouncing words correctly but are being flagged incorrectly by the system.
FAR — It is the percentage of instances where the system incorrectly accepts a mispronounced word as correctly pronounced. For mispronunciation detection, a high FAR indicates that the system is too lenient and fails to identify and correct actual mispronunciations. This reduces the effectiveness of the system in helping users improve their pronunciation.
DER — It is a metric that quantifies the overall accuracy of the system in diagnosing mispronunciations. DER provides a consolidated measure of the system’s performance in diagnosing mispronunciations, offering insights into its reliability and effectiveness in providing accurate feedback to users. A lower DER indicates that the system is more reliable in identifying and diagnosing mispronunciations, which is essential for providing accurate feedback to language learners or users aiming to improve their pronunciation.
These can be calculated as below:
Precision, Recall and F-measure
Other Fundamental metrics such as Precision, Recall and F-measure are also widely used as performance measures for mispronunciation detection.
Precision — It measures the proportion of correctly identified mispronunciations (TR) out of all instances that the system classified as mispronunciations (TR + FR). Precision quantifies how precise or accurate the system is when it identifies a phoneme as a mispronunciation. A high precision indicates that when the system flags a phoneme as a mispronunciation, it is likely to be correct. In the context of mispronunciation detection, high precision means fewer instances where correctly pronounced phonemes are incorrectly flagged as mispronunciations (lower FRR).
Recall (Sensitivity) — It measures the proportion of correctly identified mispronunciations (TR) out of all actual mispronunciations (TR + FA). Recall indicates how well the system captures all instances of mispronunciations. A high recall means the system effectively identifies most mispronounced phonemes among all mispronunciations present. In the context of mispronunciation detection, high recall means fewer instances of mispronounced phonemes going undetected (lower FAR).
F-measure (F1 Score) — It is the harmonic mean of Precision and Recall, providing a single metric to balance both metrics. It is particularly useful when you need to compare systems that have different Precision and Recall values. A higher F-measure indicates a better balance between precision and recall, which is desirable for mispronunciation detection systems to provide accurate and comprehensive feedback to users.
These can be calculated as below:
Diagnosis and Detection Accuracy
Diagnosis Accuracy — It measures the proportion of correctly diagnosed mispronunciations (CD) out of all instances where the system attempted to diagnose a mispronunciation (CD + DE). It focuses specifically on the system’s ability to accurately diagnose and classify detected mispronunciations against a reference standard (annotated phonemes). It quantifies how well the system can correctly identify the type and location of mispronunciations, crucial for providing precise feedback to users aiming to improve their pronunciation.
Detection Accuracy — It measures the proportion of correctly detected instances (both correctly pronounced and mispronounced phonemes) out of all instances evaluated by the system. Detection Accuracy evaluates the overall performance of the system in correctly identifying both correctly pronounced phonemes (TA) and mispronounced phonemes (TR) within the entire dataset. It reflects how well the system distinguishes between correct and incorrect pronunciations overall, indicating its effectiveness as a tool for identifying pronunciation errors.
These can be calculated as below:
These metrics collectively provide a comprehensive evaluation of a mispronunciation detection and diagnosis system’s performance. They help developers and researchers understand how well the system distinguishes between correct and incorrect pronunciations, identifies mispronunciations accurately, and provides effective feedback to users. Improving these metrics involves refining algorithms, improving data quality, and optimizing the decision-making processes within the system to enhance its accuracy and usability in educational, language learning, or speech therapy contexts.
Speech Datasets
There are four critical dimensions of speech data that must be considered for ASR and MDD tasks, as they directly impact performance.
Dimensions of speech data
- First, vocabulary size is a key dimension. Certain ASR tasks can achieve extremely high accuracy, such as those with a limited vocabulary (e.g., recognizing “yes” versus “no”) or digit recognition tasks. However, open-ended tasks like transcribing videos or human conversations, which involve vocabularies of up to 60,000 words, present a much greater challenge.
- Second, the nature of the speech interaction is crucial. Speech can be categorized into two types: read speech and conversational speech. Read speech, where individuals read predefined sentences aloud (e.g., audiobooks), is relatively easier to recognize. This is because, in such scenarios, people tend to simplify their speech, speaking more slowly and clearly. Conversely, conversational speech, where two people are engaging in dialogue, is more complex to transcribe. Therefore, transcribing a business meeting is inherently more challenging than transcribing speech directed at digital assistants like Siri or Alexa.
- Third, the channel and noise level significantly affect recognition. Speech recorded in a quiet room with head-mounted microphones is easier to recognize than speech captured by a distant microphone in a noisy environment, such as a city street or a car with the window open. Real-world applications that detect mispronunciations must often contend with noisy speech data.
- Finally, accent and speaker-class characteristics are vital dimensions. Recognition is easier when the speaker uses the same dialect or variety that the system was trained on. Speech from speakers with regional or ethnic dialects, or children, can be particularly challenging to recognize if the system is trained exclusively on standard dialects or adult speakers.
Numerous publicly available and licensed datasets are utilized for ASR and MDD tasks. Below is a list of some of the most popular datasets for training ASR and MDD models.
Timit
The TIMIT dataset is one of the oldest speech datasets that was created to offer data for the creation and assessment of ASR systems and for extracting acoustic features. TIMIT is the outcome of numerous locations working together under the support of the Defense Advanced Research Projects Agency — Information Science and Technology Office (DARPA-ISTO). The design of text corpora was a collaborative effort by the Texas Instruments (TI), Massachusetts Institute of Technology (MIT) and Stanford Research Institute (SRI). The National Institute of Standards and Technology (NIST) managed, verified, and prepared the speech for CD-ROM manufacturing after it was recorded at TI and transcribed at MIT.
TIMIT comprises 6,300 sentences, with 10 sentences spoken by each of 630 speakers representing 8 major dialect regions of the United States. The TIMIT prompts include text material comprising 2 dialects “shibboleth” sentences designed at SRI, 450 phonetically compact sentences developed at MIT, and 1890 phonetically diverse sentences selected at TI. The dialect sentences (SA sentences) were designed to reveal the speakers’ dialectal variants and were read by all 630 speakers.
The phonetically compact sentences were crafted to ensure thorough coverage of phone pairs, with additional occurrences of phonetic contexts considered challenging or particularly significant.
Each speaker read five of these sentences (the SX sentences), and each text was spoken by seven different speakers. The phonetically diverse sentences (SI sentences) were chosen to enhance diversity in sentence types and phonetic contexts. The selection criteria maximized the variety of allophonic contexts within the texts. Each speaker read three of these sentences, with each sentence read by only one speaker. The following table summarizes the speech material in TIMIT.
The TIMIT Acoustic-Phonetic Continuous Speech Corpus is not freely available for public use. It is distributed by the Linguistic Data Consortium (LDC), and access to the dataset requires a purchase or a subscription to the LDC. The cost and conditions for accessing the TIMIT dataset can vary, typically requiring affiliation with an academic institution, research organization, or commercial entity. For more information about purchasing or subscribing to the TIMIT dataset, you can visit the LDC’s catalogue entry for TIMIT here: https://catalog.ldc.upenn.edu/LDC93S1
Linguistic Data Consortium
The Linguistic Data Consortium (LDC) is an open consortium comprising universities, libraries, corporations, and government research labs, formed in 1992 to address a critical data shortage in language technology research. Initially serving as a repository and distribution centre for language resources, LDC has since expanded to create and distribute a wide range of language resources, support research programs, and conduct technology evaluations. Hosted by the University of Pennsylvania’s School of Arts and Sciences, LDC benefits from a strong foundation for research and outreach to a diverse member community.
LibriSpeech
LibriSpeech is a large open-source dataset consisting of over 1,000 hours of 16 kHz read-speech audiobooks from the LibriVox project, with transcripts aligned at the sentence level. It is divided into two portions: “clean,” which has higher recording quality and accents closer to US English, and “other,” which is more challenging. This division was made by running a speech recognizer on the audio, computing the Word Error Rate (WER) for each speaker, and categorizing recordings from lower-WER speakers as “clean” and higher-WER speakers as “other.”
LibriSpeech is a valuable resource for MDD research due to its high-quality speech and transcriptions. While it is primarily an ASR dataset, its applications in phoneme recognition, baseline model creation, transfer learning, and data augmentation make it a versatile tool for advancing MDD technology. (Wu et al., 2021; Xu et al., 2021; Peng et al., 2021) used LibriSpeech to pre-train the Wav2Vec 2.0 model to conduct MDD experiments.
OpenSLR
The LibriSpeech dataset can be downloaded from the OpenSLR website. OpenSLR is a website dedicated to hosting speech and language resources, including training corpora and software for speech recognition. It aims to provide a convenient platform for sharing and publicly downloading these resources.
Speechocean762
The speechocean762 corpus comprises 5,000 English sentences spoken by Mandarin-speaking non-natives, including both children and adults. Detailed age and gender information is provided. Five experts independently scored the pronunciations to ensure unbiased evaluation.
The text scripts, selected from daily life contexts, contain about 2,600 common English words. Speakers read the text accurately while holding their mobile phones 20 cm from their mouths in a quiet 3×3 meter room, using popular phone models like Apple, Samsung, Xiaomi, and Huawei. Each speaker read 20 sentences, resulting in a total audio duration of approximately 6 hours. The dataset is divided into training and test sets, each with 125 speakers, carefully selected based on gender, age, and English proficiency. Experts rated pronunciation proficiency into three levels: good, average, and poor.
This corpus, available for free download for both commercial and non-commercial purposes, supports pronunciation scoring research. It features annotations at the sentence, word, and phoneme levels and is accessible on the OpenSLR website, with a baseline system included in the Kaldi speech recognition toolkit.
Isle
To develop advanced pronunciation training tools for second language learning, a comprehensive corpus of non-native speech data has been collected. This dataset includes nearly 18 hours of annotated speech from Italian and German learners of English, based on 250 utterances from typical language learning exercises. Annotations at both the word and phone levels highlight pronunciation errors, aiding in the development of detailed corrective feedback.
Key details of the corpus:
- 46 intermediate English learners (23 German and 23 Italian)
- Approximately 20 minutes of speech per speaker
- 11,484 utterances
- 1.92 GB of WAV files (4 CDs)
- 17 hours, 54 minutes, and 44 seconds of speech data
Annotations were performed by a team of linguists, with corrections first made at the word level, followed by automatic phone-level annotations, and then re-annotations to mark phone and stress errors.
The language material, chosen to avoid reported speech and foreign words, consists of 1,300 words (82 sentences) from a non-fictional, autobiographical text about the ascent of Mount Everest.
Developed collaboratively by researchers in Europe and the United States, the ISLE dataset is not freely available for public use. Access can be requested via the ELRA website.
L2-Arctic
The L2-Arctic Dataset is a publicly available non-native English speech corpus designed to support research for mispronunciation detection, accent and voice conversion. The corpus contains 26,867 utterances from 24 non-native speakers whose L1 languages are Arabic, Chinese, Hindi, Korean, Spanish and Vietnamese. The overall length of corpus is 27.1 hours and the average duration of speech per L2 user is 67.7 minutes. Over 238,702-word segments are included in the dataset, providing an average of about 9 words per utterance, and over 851,830 phone segments (Zhao et al., 2018).
Human annotators evaluated 3,599 utterances manually, annotating 1,092 phoneme addition errors, 14,098 phoneme substitution errors and 3,420 phone me deletion errors. The corpus contains the following information for each speaker:
- Speech recordings: more than one hour of recordings for phonetically balanced short sentences (~1132)
- Word level transcriptions: for each sentence, orthographic transcription and forced-aligned word boundaries are provided.
- Phoneme level transcriptions: for each sentence Montreal forced-aligned phonemic transcription is provided.
- Manual annotations: a subset of 150 utterances are annotated with
- corrected word and phone boundaries, which includes 100 common utterances recorded for all speakers and 50 uncommon utterances that contain phonemes difficult to pronounce for each user as per their L1 language. These 150 utterances are tagged for phoneme substitution, deletion, and addition errors.
Every speaker’s data is organized in its subdirectory under the root folder. Each speaker’s directory is structured as follows:
- /wav: Containing audio files in WAV format, sampled at 44.1 kHz
- /transcript: Containing orthographic transcriptions, saved in TXT format
- /textgrid: Containing phoneme transcriptions generated from forced-alignment, saved in TextGrid format
- /annotation: Containing manual annotations, saved in TextGrid format
Annotations
The dataset uses the ARPAbet phoneme series for the phonetic transcriptions as well as the error tags to make computer processing easier.
Each manually annotated TextGrid file will always have a “words” and “phones” tier while some of them may have an additional tier that contains comments from the annotators.
The below are conditions for tagging a label to a phone segment.
- For a correctly pronounced phoneme, the forced-alignment label remains unchanged.
- In case of a phone substitution error, the forced-alignment label is replaced with the following label template CPL,PPL,s, where CPL is the correct phoneme label (what should have been produced), PPL is the perceived phoneme label (what was actually produced) and s denotes for substitution error.
- When additional phone is present where it should not be, it is called as phone addition error and it is represented as sil,PPL,a where sil stands for silence and a stands for addition error.
- When a silent segment is found where there should be a phone segment, then it is called as phone deletion error and it is represented as CPL,sil,d where d denotes delete error
Exploratory data analysis
An Exploratory Data Analysis (EDA) conducted on the L2-Arctic dataset (Zhao et al., 2018) for phoneme and pronunciation errors is shown in the following subsections. The results obtained by conducting EDA on the L2-Arctic dataset help in deciding data augmentation techniques while training the model.
Phoneme set distribution
The below figure shows the distribution of phonemes in the dataset. The phonemes ‘AH’, ’N’, ‘T’, ‘IH’ and ‘D’ are the top 5 phonemes found in the set.
Phoneme error distribution in the L2-Arctic dataset
The below figure shows the top 20 most frequent phoneme substitution tags in the corpus. The most dominant substitution errors are ‘Z->S’, ‘DH->D’, ‘IH->IY’ and ‘OW->AO’.
The below figure shows the phone deletion errors in the annotations. The most frequent phoneme deletions are ‘D’, ‘T’, and ‘R’.
The below figure shows the phone addition errors in the annotations. The most frequent phoneme additions are ‘AH’, ‘EH’, ‘R’, ‘AX’, ‘G’ and ‘IH’
The task of MDD requires annotated data on mispronounced phones, which is notably scarce. Consequently, researchers often utilize multiple datasets to train MDD models. Initially, some datasets are used to bootstrap the training of E2E-based MDD models. These models are subsequently fine-tuned on datasets specifically designed for MDD tasks. Using multiple datasets requires a certain amount of data preprocessing. For instance, the TIMIT dataset is labelled with a 61-phone set, while L2-ARCTIC uses a 48-phone set. To unify the phone sets, the 61-phone set of TIMIT must be mapped to the 48-phone set of the L2-ARCTIC dataset.
Data Augmentation
MDD systems often adapt to new trends in ASR tasks, but they face unique challenges due to the need for additional data, specifically mispronounced words. Acquiring such data is further complicated by the requirement for accent-based audio samples across a diverse range of speaker ages.
Teaching correct pronunciation to English learners at a young age is more effective than teaching them at an older age. Therefore, obtaining child audio data is essential. However, acquiring audio data from children is significantly more challenging than acquiring it from adults.
An effective MDD system must identify both correctly and incorrectly pronounced phones, hence addressing data imbalance is also crucial. To balance the count of correctly and incorrectly pronounced samples, data augmentation techniques are used by researchers.
To make better use of available data for MDD tasks (Feng et al., 2020) feed transcript sentences to a sentence encoder and speech data to an audio encoder to better capture both linguistic and acoustic features. (Fu et al., 2021) instead of feeding character sequence to sentence encoder, feeds corresponding phoneme sequence. The study also proposes 3 easy data augmentation techniques to handle data imbalance issues between positive and negative samples in the L2 Arctic dataset,
1. Phoneme set-based (PS): Randomly select phonemes from the phoneme sequence of the prior text and replace each with a randomly chosen phoneme from the phoneme set, e.g., /eh/ → /hh/. Note that a “blank” symbol may be used instead of a phoneme, representing an INSERT type error. Conversely, a “blank” being replaced by a phoneme in the reading text signifies a DELETE type error.
2. The vowels and consonants set based(VC): Based on L2-ARCTIC statistics, vowels are more likely to be mispronounced as other vowels and consonants as other consonants, with frequent substitutions such as /z/ → /s/. Therefore, randomly select phonemes from the phoneme sequences of the prior text and replace each vowel with a randomly chosen vowel and each consonant with a randomly chosen consonant.
3. The confusing pairs based(CP): First, the confused phoneme pairs in learner pronunciations are identified from the L2-ARCTIC portion of the training set. Then, phonemes are randomly selected from the reading text sequences. If a phoneme belongs to a confused pair, it is replaced with its corresponding confusing phoneme at random.
Tools
Following is an overview of essential tools and frameworks used for ASR and MDD tasks. Kaldi and ESPnet provide foundational tools and frameworks for building robust speech recognition systems, which can be adapted for detecting mispronunciations by training models to recognize and classify pronunciation errors.
Praat and TextGridTools are used for detailed acoustic analysis and annotation of speech, enabling researchers and educators to identify specific phonetic errors and analyze pronunciation accuracy. Together, these tools contribute to the development of effective mispronunciation detection systems by providing capabilities for speech analysis and annotation management, essential for both research and practical applications in language learning and speech therapy.
Kaldi
Kaldi (Povey et al., 2011) is an open-source toolkit written in C++ for speech recognition-related tasks. It is primarily created for ASR researchers to provide extensible and flexible software for building speech recognition systems. It can generate fbank, mfcc and fMLLR features. One of the popular use cases is to generate acoustic features from raw waveforms for end-to-end speech recognition systems.
It provides recipes for training custom acoustic models on commonly used speech corpora such as TIMIT, LibriSpeech, WSJ corpus and more. These recipes can be used as a template for training acoustic models on our own speech data. To demonstrate the usage of the “speechocean762” corpus for pronunciation assessment, a baseline system (Hu et al., 2015) is published in the Kaldi speech recognition toolkit.
Kaldi has an active community of researchers and developers who contribute to its development and provide support. It also has extensive documentation and tutorials to help new users get started.
ESPnet
ESPnet is a powerful end-to-end speech processing toolkit encompassing speech recognition, text-to-speech, speech translation, speech enhancement, and spoken language understanding. Utilizing PyTorch and following Kaldi-style data processing and feature extraction, ESPnet provides a comprehensive setup for diverse speech-processing tasks.
Unlike Kaldi’s hybrid DNN/HMM architecture, ESPnet uses a unified neural network for end-to-end speech recognition, leveraging connectionist temporal classification (CTC) and attention-based encoder-decoder networks. The hybrid CTC/attention approach further enhances robustness and accelerates convergence.
Additionally, ESPnet supports advanced ASR techniques, including RNNLM fusion and fast CTC computation, and offers state-of-the-art recipes for major benchmarks such as WSJ, Librispeech, TED-LIUM, CSJ, AMI, HKUST Mandarin CTS, VoxForge, and CHiME-4/5. By providing these cutting-edge ASR setups, ESPnet drives the advancement of the end-to-end ASR field.
The above figure illustrates the streamlined flow of standard recipes in ESPnet, highlighting the simplicity brought by end-to-end ASR. Unlike traditional methods, it eliminates the need for lexicon preparation, finite state transducer (FST) compilation, HMM and Gaussian mixture model training/alignment, and lattice generation.
The standard recipe includes six stages:
- Data Preparation: Utilizes Kaldi data directory format and scripts (e.g., data_prep.sh).
- Feature Extraction: Uses Kaldi’s 80-dimensional log Mel features with pitch (83 dimensions total).
- Data Preparation for ESPnet: Converts Kaldi data directory information into a JSON file (data.json), excluding input features.
- Language Model Training: Trains a character-based RNNLM using Chainer or PyTorch (optional stage).
- End-to-End ASR Training: Trains a hybrid CTC/attention-based encoder-decoder using Chainer or PyTorch.
- Recognition: Performs speech recognition using the RNNLM and end-to-end ASR model from stages 3 and 4.
Praat
Praat is a comprehensive, indispensable tool for speech and phonetic analysis, freely available to speech scientists. It has been developed since 1992 by Paul Boersma and David Weenink at the Institute of Phonetic Sciences, University of Amsterdam. Praat stands as a premier resource in the field of Phonetics.
Praat offers powerful features for speech and phonetic analysis:
- Speech Analysis: Analyze pitch, formant, intensity, and voice quality with access to spectrograms and cochleagrams.
- Speech Labeling: Widely used by linguists, Praat supports multi-level transcriptions and custom labelling using the International Phonetic Alphabet (IPA), with multi-language text-to-speech facilities for segmenting sound.
- Speech Synthesis: Generate simple sounds, and create speech from pitch curves and filters (acoustic synthesis) or muscle activities (articulatory synthesis).
- Speech Manipulation: Modify pitch, intensity, and duration of speech, with tools to alter pitch contours, intonation, and stress patterns for prosody research.
At the core of speech analysis is the transcript, which Praat refers to as TextGrids. A Praat TextGrid file comprises a collection of independent annotation analyses for a given audio recording, organized into tiers. There are two types of tiers: IntervalTiers, used for annotating events with duration like syllables, words, or utterances, and PointTiers, used for marking instantaneous events like audio clips, pitch contour peaks, or claps.
TextGridTools
Praat is the de facto standard for phonetic analysis and speech transcription. The prestigious L2-Arctic dataset also features phonetic transcriptions in Praat’s TextGrid format. Praat’s strength lies in its programmability through the embedded scripting language “Praat script,” which offers full access to Praat’s functions and data structures, automating tedious tasks efficiently. However, despite its virtues and ease of use, Praat script lacks essential features of modern programming languages, such as return statements, iterators, and basic data structures like lists or hash tables. Furthermore, as a specialized language, it is limited in functionality beyond Praat itself, such as plotting and statistical analysis.
To address these shortcomings, the development of ‘TEXTGRIDTOOLS’ emerged, a Python package designed to parse, manipulate, and query Praat annotations. TEXTGRIDTOOLS encompasses all TextGrid-related objects, such as interval and point tiers, implemented as native Python classes with a clean API for attribute access. Leveraging Python’s expressive syntax, TEXTGRIDTOOLS enables more compact and human-readable code compared to the Praat script. Users can perform analyses in one step, eliminating the need for Praat script as an exporting tool. Annotations can be directly accessed and processed using Python’s robust data analysis libraries like NumPy, SciPy, Matplotlib, RPy, and pandas.
TEXTGRIDTOOLS mirrors the structure of Praat’s TextGrid objects. The Python classes IntervalTier and PointTier replicate Praat’s tier types, holding ordered lists of Interval or Point objects respectively. Multiple tiers are grouped within the TextGrid class. Shared structures and functionalities between IntervalTier and PointTier, as well as Interval and Point, are inherited from the parent classes Tier and Annotation.
TEXTGRIDTOOLS is released under the GNU General Public License v3.0 and hosted on GitHub, empowering users to contribute their modifications and enhance the current functionality.
Summary
The article began by examining the evolution and current state of E2E ASR and MDD systems. E2E ASR models are favoured for their streamlined design and optimization advantages over traditional systems. The adaptation of ASR technologies for MDD tasks showcases significant advancements, including CNN-RNN-CTC, hybrid CTC-ATT architectures, SED-MDD models that integrate acoustic and linguistic features, and Transformer-based Wav2Vec 2.0 setups.
The article also details the hierarchical evaluation structure and crucial evaluation metrics for MDD systems, such as Precision, Recall, F1 Score, False Rejection Rate (FRR), False Acceptance Rate (FAR), and Diagnosis Error Rate (DER). Key dimensions of speech, including notable datasets like TIMIT, LibriSpeech, Speechocean762, ISLE, and L2-Arctic, are thoroughly discussed.
Additionally, essential tools for MDD system development are highlighted. These include Kaldi for acoustic feature generation and training, ESPnet for end-to-end speech processing with CTC and attention mechanisms, Praat for detailed phonetic analysis and annotation using TextGrids, and TextGridTools for efficient TextGrid manipulation via Python.
This three-part article series aims to equip researchers and AI enthusiasts with the foundational knowledge necessary to advance MDD systems.
References
Graves, A., & Jaitly, N. (2014b, June 18). Towards End-To-End Speech Recognition with Recurrent Neural Networks. PMLR. https://proceedings.mlr.press/v32/graves14.html
Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C. L. Y., & Courville, A. (2017, January 10). Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks. arXiv.org. https://doi.org/10.48550/arXiv.1701.02720
Chorowski, J., Bahdanau, D., Cho, K., & Bengio, Y. (2014, December 4). End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results. arXiv.org. https://arxiv.org/abs/1412.1602
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015, June 24). Attention-Based Models for Speech Recognition. arXiv.org. https://arxiv.org/abs/1506.07503
Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017, December). Hybrid CTC/Attention Architecture for End-to-End Speech Recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240–1253. https://doi.org/10.1109/jstsp.2017.2763455
Leung, W. K., Liu, X., & Meng, H. (2019, May). CNN-RNN-CTC Based End-to-end Mispronunciation Detection and Diagnosis. ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp.2019.8682654
Zhang, L., Zhao, Z., Ma, C., Shan, L., Sun, H., Jiang, L., Deng, S., & Gao, C. (2020, March 25). End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture. Sensors, 20(7), 1809. https://doi.org/10.3390/s20071809
Lo, T., Weng, S., Chang, H., & Chen, B. (2020). An effective End-to-End Modeling approach for mispronunciation detection. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2005.08440
Yan, B. C., Wu, M. C., Hung, H. T., & Chen, B. (2020, May 25). An End-to-End Mispronunciation Detection System for L2 English Speech Leveraging Novel Anti-Phone Modeling. arXiv.org. https://doi.org/10.48550/arXiv.2005.11950
Feng, Y., Fu, G., Chen, Q., & Chen, K. (2020, May). SED-MDD: Towards Sentence Dependent End-To-End Mispronunciation Detection and Diagnosis. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp40776.2020.9052975
Fu, K., Lin, J., Ke, D., Xie, Y., Zhang, J., & Lin, B. (2021, April 17). A Full Text-Dependent End to End Mispronunciation Detection and Diagnosis with Easy Data Augmentation Techniques. arXiv.org. https://doi.org/10.48550/arXiv.2104.08428
Wu, M., Li, K., Leung, W. K., & Meng, H. (2021, August 30). Transformer Based End-to-End Mispronunciation Detection and Diagnosis. Interspeech 2021. https://doi.org/10.21437/interspeech.2021-1467
Xu, X., Kang, Y., Cao, S., Lin, B., & Ma, L. (2021). Explore wav2vec 2.0 for Mispronunciation Detection. https://doi.org/10.21437/interspeech.2021-777
Peng, L., Fu, K., Lin, B., Ke, D., & Zhan, J. (2021). A Study on Fine-Tuning wav2vec2.0 Model for the Task of Mispronunciation Detection and Diagnosis. https://doi.org/10.21437/interspeech.2021-1344
Yan, B. C., Jiang, S. W. F., Chao, F. A., & Chen, B. (2021, August 31). Maximum F1-score training for end-to-end mispronunciation detection and diagnosis of L2 English speech. arXiv.org. https://doi.org/10.48550/arXiv.2108.13816
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Neural Information Processing Systems, 33, 12449–12460. https://proceedings.neurips.cc/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf
Qian, X., Soong, F. K., & Meng, H. (2010). Discriminative acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT). https://doi.org/10.21437/interspeech.2010-278
Li, K., Qian, X., & Meng, H. (2017). Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(1), 193–207. https://doi.org/10.1109/taslp.2016.2621675
Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D., Dahlgren, N., & Zue, V. (1993d, January 1). TIMIT Acoustic-Phonetic Continuous Speech Corpus. https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/SWVENO
Menzel, W., Atwell, E., Bonaventura, P., Herron, D., Howarth, P., Morton, R., & Souter, C. (2000). The ISLE corpus of non-native spoken English. http://www.comp.leeds.ac.uk/eric/menzel00lrec.pdf
Zhang, J., Zhang, Z., Wang, Y., Yan, Z., Song, Q., Huang, Y., Li, K., Povey, D., & Wang, Y. (2021, April 3). speechocean762: An Open-Source Non-native English Speech Corpus For Pronunciation Assessment. arXiv.org. https://arxiv.org/abs/2104.01378
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books. https://doi.org/10.1109/icassp.2015.7178964
Zhao, G., Sonsaat, S., Silpachai, A., Lucic, I., Chukharev-Hudilainen, E., Levis, J., & Gutierrez-Osuna, R. (2018). L2-ARCTIC: A Non-native English Speech Corpus. https://doi.org/10.21437/interspeech.2018-1110
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The Kaldi Speech Recognition Toolkit. https://publications.idiap.ch/downloads/papers/2012/Povey_ASRU2011_2011.pdf
Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N. E. Y., Heymann, J., Wiesner, M., Chen, N., Renduchintala, A., & Ochiai, T. (2018). ESPnet: End-to-End Speech Processing Toolkit. https://doi.org/10.21437/interspeech.2018-1456
Boersma, Paul & Weenink, David (2024). Praat: doing phonetics by computer [Computer program]. Version 6.4.13, retrieved 10 June 2024 from http://www.praat.org/
Buschmeier, H., & Wlodarczak, M. (2013b). TextGridTools: A TextGrid Processing and Analysis Toolkit for Python. https://doi.org/10.6084/m9.figshare.658837.v1
Online reference links
- This is an excellent resource by Stanford University for understanding Speech and Language Processing concepts.
- This online article is an exhaustive resource that reviews different deep-learning approaches for Speech recognition.