Non-Native Lexicon Expansion in Speech Recognition: Modeling accent at the word level

11 min readOct 20, 2022

The lexicon is a crucial component of any automatic speech recognition system. Broadly speaking, a lexicon is a collection of the vocabulary of any language, with all the words with their pronunciations. Those pronunciations are broken down into phoneme sequences, i.e., the smallest units of speech. This blog aims at giving an intuitive and basic understanding about the lexicon, why non-native speakers’ speech needs to be modeled separately, and finally how an expanded non-native lexicon achieves that.

Role of Lexicon in ASR

Firstly, the lexicon specifies what words or entries, i.e. lexical items are known by the system. Then, it provides the means to build acoustic models for each entry. The lexicon basically models the sequence of phonemes of a word. The design of a lexicon depends upon the selection of vocabulary items. It further depends upon how we want to represent those entries, i.e. IPA or ARPABET. These are the two most popular ways to represent the phoneme sequence.

It is a crucial component of the voice recognition pipeline since it provides a means of differentiating between various word pronunciations and spellings. Many identically spelt word parts are pronounced differently depending on the context. The spelling does not reveal the pronunciation. In this scenario, the lexicon’s state transition probabilities — which it encodes between word/phoneme states — will be used to identify the appropriate pronunciation for a particular word in context.

How the Lexicon is used

A crucial component of the acoustic model is the lexicon. The entire acoustic model is defined by taking the lexicon pronunciation HMM and adding to it the feature vectors computed from the raw audio file.In total, the acoustic model is used to calculate P(O/W), or the probability that an audio observation corresponds to a particular word.

A language model that calculates the prior probability, P(W), of a word in the source language is then added to the mix. When these two models are combined, we can perform a decoding process using the Viterbi algorithm to calculate the posterior probability, P(W/O), which provides us with the likelihood of any word sequence given our observation of the raw audio file. Now, we have successfully accomplished speech recognition by selecting the sequence that optimizes this likelihood.

We started with the task of determining the probability of a word sequence given our observation of the audio file, which is P(W/O).

P(W/O) = P(O/W).P(W)/P(O)

P(O) doesn’t change for each sentence, since the feature vectors are the same, resulting in the same observations O. Finally, the equation reduces to

P(W/O) = P(O/W).P(W)

For instance, we take an audio clip. We segment that audio clip with the help of a sliding window, and thus produce a sequence of audio frames. For each frame, we extract 39 MFCC vectors, i.e. basically we produce a sequence of feature vectors, O(O1,..,ON), where N = 39 features. Then, the likelihood P(O/W) can be approximated by the lexicon and the acoustic model. The language model further provides the P(W), and BAM!, we get the P(W/O) by taking the sequence with the maximum probability.

Non-Nativity and its influences

The pronunciation of English words by a non-native speaker is strongly influenced by their native language and is most often different from the native English speakers. Phonemic languages derive pronunciation directly from the spelling of the word. On the contrary, English is an alphabetic and highly non-phonemic language. Hence, native phonemic language speakers whose pronunciation is influenced by the spelling of the word often pronounce English words differently from native English speakers. This mispronunciation is further enhanced for native speakers from languages whose phonemes are different from those of the English language. These speakers generally replace the English phoneme with the closest phoneme in their native language.

More specifically, looking at the individual word level, we find that these mispronunciations/accents are typically due to some phoneme deletions, insertions and substitutions of phones relative to the canonical transcription of the words, when spoken by the non-native speakers.

Building an ASR system for such non-native English bilingual speakers requires a lexicon that handles the influence of the first language on the native English lexicon. Pronunciation variation is one of the factors which leads to less than optimal performance in automatic speech recognition (ASR) systems. It is empirically found that when there is a mismatch between train and test conditions for an ASR model, it performs poorly, i.e., the training gets done on a native English Lexicon, and the testing of the system gets done in real world scenario, where non-native bilingual speakers' speech is inferred. In such a scenario, the ASR model performs poorly.

The problem of modeling pronunciation variation lies in accurately predicting the word pronunciations that occur in the test material. In order to achieve this, the pronunciation variants must first be obtained in some way or other. To this end, we need to expand the lexicon to account for the non-native influences and for the ASR model to perform better.

Ways to Expand the Lexicon

There are basically three ways of expanding the lexicon:

Using expert knowledge of the language (Knowledge-Based Methods)

2. Data-Driven methods

3. A hybrid approach encompassing both of the above approaches

Knowledge-Based Methods

In a knowledge-based approach, the information about pronunciations is derived from knowledge sources such as hand-crafted dictionaries or the linguistic literature. We recognize some phonological processes in the respective languages, which allow us to formulate rules with which pronunciation variants are generated. The phonological rules are derived based on linguistic and phonological knowledge according to known pronunciation variations of speech. The rules are context-dependent and are applied to the words in the baseline lexicon. The resulting variants are then unconditionally added to the lexicon.

For instance, we take the CMU Dictionary initially as our native English lexicon. Now, for incorporating non-native speaker specific variation, it is necessary to know exactly the location where the rule is to be placed. The location can be obtained by knowing the mapping between the letters and the phonemes in the word pronunciation. But, in general, there is no one-to-one mapping between letters and phonemes for a given pronunciation. Therefore we need a Grapheme-to-Phoneme aligner (G2P) for that purpose. Once the phonemes and graphemes are aligned, the rules get applied to each grapheme phoneme pair by checking the context criteria as per the rule-table. Now, these pronunciation variants get added to the native lexicon, and we get the expanded non-native lexicon.

The primary advantage of the knowledge-based approach is that it can be applied to all corpora and especially to new words that are not introduced in the ASR system. However, there are some notable problems with this approach. The information from linguistic literature is not exhaustive, which results in the non-capturing of many rules and, subsequently, pronunciation variants. Furthermore, many processes that occur in real speech are yet to be described. This method doesn’t work well for low-resource language pairs, and expert knowledge is required for formulating these rules. Further, choices must be made as to which variants to include in the lexicon, and/or to incorporate at other stages of the recognition process, because it has been found that expanding the lexicon through purely knowledge-based methods results in generic rules. Consequently, the pronunciation variants are too many, and the Word Error Rates(WER) actually increase with the increase in variants, thus leading for the model to perform worse than even baseline models, as there is too much confusability for the ASR model to decipher, i.e., there are multiple words which have very similar pronunciations, and the model gets confused over which pronunciation variant to choose, which subsequently leads to worsening the performance of the model.

Data-Driven Methods

Data-driven methods can be further classified into direct data-driven or indirect data-driven approaches. The direct data-driven approach derives pronunciation variants depending on pronunciation training databases, then proceeds to directly add those variants in the lexicon. When an ASR system employs the adapted pronunciation dictionary using a direct data-driven approach, some unseen words might appear during ASR testing. Thus, such a mismatch condition in the pronunciation model between ASR training and testing could degrade the performance of an ASR system.

On the other hand, an indirect data-driven method investigates pronunciation variability from the speech training data, derives the variant rules, and applies the variant rules in the ASR pronunciation dictionary to compensate for the variability. The basic difference between direct data-driven methods and indirect data-driven methods is that, indirect data-driven methods generalizes the rules, based on the data modifications, and then plugs those rules into the lexicon, thus making it more generalizable than direct data-driven methods, where directly pronunciation variants are added to the lexicon.

Typically, most data-driven approaches can be brought under one umbrella, as to how they operate.

The aim is to generate alternate transcriptions using some algorithms.

Then, align the reference(canonical) and alternate transcriptions.

Derive initial rules from the alignment.

Prune those rules.

Finally, expand the lexicon using those rules.

Following is a framework explaining the work-flow better, as to how lexicon gets expanded in a data-driven set-up:

Step 1. Each utterance in a non-native development set is recognized using a phoneme recognizer.

Step 2. The recognized phoneme sequence is aligned using a dynamic programming algorithm based on the reference phoneme sequence transcribed by the native pronunciation dictionary, referred to as reference transcription.

Step 3. Using the alignment results of Step 2), variant phoneme patterns are obtained.

Step 4. Pronunciation variation rules are then derived from the variant phoneme patterns using a decision tree.

Step 5. Finally, pronunciation variations are generated from the pronunciation variation rules, allowing the pronunciation dictionary to be adapted for non-native ASR.

To derive the pronunciation rules, we first perform phoneme recognition for each utterance in the non-native development set. As a result, we can obtain an N-best list of phoneme sequences for each utterance. However, there are no word boundaries in the list, which are required to differentiate inter-word pronunciation variations from cross-word pronunciation variations. To obtain these word boundaries, the recognized phoneme sequence is aligned on the basis of a dynamic programming algorithm and compared to the reference transcription with word boundaries. From the alignment between the recognized phoneme sequence and the reference transcription, a rule pattern is obtained if the following condition is satisfied:

Equation (Phoneme Patterns)

where X is a phoneme that is to be mapped into Y, and the left and right phonemes in the reference transcription are L1 and L2, and R1 and R2, respectively.

It is rather difficult to differentiate pronunciation variations from the substitution, deletion, and insertion errors incurred by phoneme recognition. Therefore, the recognition errors should be as small as possible; thus, three subsequent processes are applied to reduce these errors. First, we perform a Viterbi search based on the N-best lists. Second, we only extract a sentence or an isolated word included in the development set if its phoneme recognition accuracy is over the predefined threshold. Third, if more than half of the neighboring phonemes of X in the equation are different from the neighboring phonemes of the target phoneme Y, this rule pattern is removed from the rule pattern set.

After the rule patterns are categorized by filtering errors, pronunciation variation rules are constructed by the decision trees. Their attributes include the two left phonemes, L1 and L2, and the two right phonemes, R1 and R2, of the affected phoneme X. The output class is the target phoneme, where one decision tree is constructed for each phoneme. Next, each decision tree is converted into an equivalent set of the rules by tracing each path in the decision tree from the root node to each leaf node.Next, a native pronunciation dictionary is adapted from these derived rules using decision trees, which results in an expanded pronunciation dictionary.

For instance, consider a decision tree:

For a given center phoneme ‘k’, the output value can change from ‘k’ to ‘g’ depending on L1 and R1. In other words, the output value can become ‘g’ instead of ‘k’ if L1 is ’n’ or ‘jv’. However, the output value changes to ‘g’ if R1 is ‘v’ or ‘U’ when L1 is ’n’ or ‘jv’. The above procedure is applied to all the phonemes, resulting in a total of 40 decision trees. Next, each decision tree is converted into an equivalent set of rules by tracing each path in the decision tree from the root node to each leaf node. For example, the decision tree shown in Fig. 2 can be converted into the following set of rules:

where N is the rule number and N=1 in this example, and [Rule Accuracy] is the relative frequency of the rule applied to all the rule patterns associated with the center phoneme ‘k.’ If there is no rule for a rule pattern, the default rule is applied. After collecting all the rules obtained from the 40 decision trees, we apply a pruning technique to select the most effective rules. A rule is declared effective if the rule accuracy is greater than a given threshold. Finally, the pruned rules are applied to the lexicon, and it is expanded.

In a data-derived approach, a great deal of confusability is introduced by errors in automatic phonemic transcriptions. These phonemic transcriptions are used as the information source from which new variants are derived; consequently incorrect variants may be created.

There are various other approaches in data-driven modeling of the lexicon. Some of them involve giving weights to the phonemes, and in the substitution, insertion and deletion of phonemes, use those weights to get some extra information regarding how good this pronunciation modification was, and then set a threshold on the limiting threshold till which pronunciation variants go into the lexicon. Further, there are approaches where transformer-based models are used to train on non-native speech data, and rules are further derived from them using diversified decoding ideas. The link to those papers can be found at the bottom of this blog. However, this blog is aimed at giving a basic and intuitive understanding of the topic, and not necessarily providing the state-of-the-art solutions to this problem.

Hybrid Approaches

Mixing both approaches has produced good results in recent times. Generally, the hybridisation is introduced by using some preference terms, somehow plugging these preference terms into the equation for data-driven approaches. These preference terms are inspired by the knowledge-based literature, i.e., they ask the model to give higher weightage to those pronunciation variants that are explained by the rules in the literature, and then the lexicon is expanded accordingly.

These are some of the popular ways to expand a native lexicon into a non-native lexicon. After expanding the lexicon, we necessarily introduce some checks on confusability scores, to limit the model’s confusability, and researchers have devised a number of ways to define the confusability of the model, and algorithms to reduce it, but it goes beyond the scope of the blog.