Automatic Pronunciation Intelligibility Assessment

Brij Mohan Lal Srivastava
Viithiisys
Published in
9 min readApr 27, 2018

Learning a new language isn’t easy. You must learn the words, grammar, context in which sentences convey one meaning or the other and if you are planning to speak to a native speaker, then your pronunciation must be near perfect. All of these are hard to learn and there are different resources to assist a new language learner in case of popular languages. In this post, we particularly concentrate on systems which can evaluate the intelligibility of a speaker’s pronunciation automatically and provide constructive feedback on how to correct the mistakes observed in the speech.

Pronunciation assessment systems have several use cases. They can used to train children during their early years in language learning, or by an organization looking to hire voice talents. They can also be used to assess language proficiency in tests like TOEFL, etc. Currently these tasks are performed by humans in loop which induces subjectivity and biases in the scoring process. Automatic assessment can dramatically improve the learning process where students can learn at their own pace and without the fear of punishment when they are making mistakes repeatedly. These systems are also capable of adapting to a user’s needs and present the problem in diverse ways to ease the learning by personalization.

Without further ado, I will now present the details of the APIA(automatic pronunciation intelligibility assessment)system which we (me and my mentor James Salsman) built as part of the project sponsored by Google Summer of Code 2017. The project was undertaken by CMUSphinx organization which is responsible for several speech related open source projects including Pocketsphinx, a lightweight Automatic Speech Recognizer (ASR) built using Hidden Markov Model. ASR technology is quite popular these days, but the internals may not be well known. So, going forward I will first give a brief overview of the components of ASR which are crucial for APIA, then dive into the approach we adopted and finally discuss the implementation and how you can create your own APIA for any language.

Automatic Speech Recognition

ASR is a vast research domain with several open problems still unsolved and conferences dedicated to solve each of them. It is one of the primary concern for human-computer interaction and a playground for Machine learning experts. I can barely scratch the surface(beginning of a larger work) in this post, but I will make sure to cover things needed to understand APIA(automatic pronunciation intelligibility assessment). For better understanding, please refer to additional literature mentioned at the end of the post.

A microphone receives speech signal just like a human ear. It processes the signal and sends it to the algorithms that decipher the words hidden in the signal. Each part of the signal represents a sound which holds specific meaning in a language, like /ka/, /ba/, /na/, etc. These sounds are called phonemes. Every language has its own set of phonemes. Speech processing is complicated because these sounds slightly vary based on speaker, age, accent, emotional state and of course, language. If all the languages had the same sound, then a single ASR will suffice for the entire world. When a language is taught to a child, these sounds are the first things they are introduced with.

A recording of the word ‘car’ and the segments of phonemes in the signal.

Indian languages are spoken in the same way as they are written, but English is not. In Hindi, क is pronounced same everywhere regardless of its context (For example: कार, मकान, etc.). But English characters can modify their corresponding sound based on the context (Notice the sound of C in: Car and Machine). Hence for authentic APIA we must know the sounds that compose a word as well as their deviation from the ideal case.

We will consider the APIA for English which is the most well studied language for ASR and dominates the literature with huge amount of data. To build an ASR, one must possess large amount of audio data along with it’s text transcription. Another significant resource is a lexicon, which is a huge table containing all the words in the language (or a selected few if building ASR for a small domain) along with the sequence of sounds composing that word. This table must be constructed carefully and requires manual effort by native language experts.

Eg: CAR — > /k/ /aa/ /r/

Recently ASR technology is evolving towards End-to-End speech recognition, where one need not have access to the lexicon or the sounds in a language. The so-called neural network architecture is capable to learn these sounds just by audio and corresponding text. Traditionally, the speech signal is first converted to the most probable sequence of sounds and then the sequence that makes most sense according to the lexicon and the context, is selected as the output.

Our approach to APIA

When ASR(Automatic Speech Recognition) receives a speech signal, it tries to figure out the sequence of phonemes, the duration for which a phoneme was spoken and its own confidence score (how much ASR is sure of that sound being spoken at that segment of signal). ASR does not just give the most probable sequence, but also other possibilities which exhibit more confusion and lesser probable sequences. We use this information to extract relevant features for APIA(automatic pronunciation intelligibility assessment). We can just match the obtained phoneme sequence with the true sequence for a word in context, but that would not let us generate relevant feedback for the speaker. Neither will it help us figure out the exact nature of errors committed by the speaker.

There are a few terms we will use in our approach.

N-best list: The list of sound sequences outputted by the ASR are ranked based on the ASR’s confidence on them. This list is called the N-best list, where N is based on programmer’s discretion.

Grammar: ASR requires all the sequences that are likely to be produced in a language to reduce the list of possibilities (cut down search space) and speed up the computation. It can be sequences of sound or word sequences. We will use sound sequences as our grammar since we operate inside a word. This grammar can be stated in a Finite State Grammar format, where probabilities to transition from current sound to another are mentioned as arcs in a graph.

Our approach is to extract features relevant to APIA and then present them to a neural network along with the ground truth scores manually annotated by expert speakers. If you are not aware of neural networks, you can safely assume them to be a complex function which takes input values and produce relevant out when exposed to a lot of data to learn.

f(x) = y

The form of this function is generic enough to learn complex mapping by adjusting its parameters. During training, we present several examples of the word features and its intelligibility score (between 0 and 1) to adjust the parameters of the neural network. At test time, when a speaker pronounces a word, we extract features out of it and present it to the trained network and it produces the intelligibility score as output.

Let us talk about the features that we are referring to, since they are the most important part of our approach.

The features are extracted by iterating through several passes of recognition. To go through a single pass, we must have the audio and the grammar. For a word CAT, our grammar simply looks like this:

K AE T

This tells the ASR(Automatic Speech Recognition) that the three sounds (phonemes) in the above grammar are present in the audio input. All it must do is find their location, duration and the confidence score. It must also output the less likely sequences as N-best list. After the first pass, we know the alignment of phonemes with the audio signal. Now we break the phoneme sequence into triphones. For given example:

SIL K AE

K AE T

AE T SIL

The SIL phoneme stands for SILENCE. For a bigger word, we will have a bigger list of triphones. Subsequent passes through the ASR are of three types.

Three types of feature extraction passes through the ASR

Substitution pass: The middle phoneme in the triphone is replaced by arbitrary phonemes in the language. So, SIL K AE become SIL <some phoneme> AE. English has 39 phonemes, so we get 39 new substitution sequences. Each of these sequences are converted to grammar and the corresponding segment of audio is aligned with the new grammar. For each pass, we note the rank of the true phoneme in the N-best list. This number is normalized over all the passes and finally we obtain a single number which becomes part of our list of features.

Insertion pass: Similarly, insertion pass measures the likelihood of inserting an arbitrary phoneme within the correct sequence. So, SIL <some phoneme> K becomes the new grammar and likewise all the grammars are aligned, and the rank of true phoneme is noted in the N-best list.

Deletion pass: This pass ensures if the ASR(Automatic Speech Recognition) thinks that a phoneme is missing from the audio. Phonemes in the real sequence are omitted and aligned with the audio segment.

Further we also obtain features which are inspired by the physiology of human vocal tract. Researchers assume human voice apparatus to be composed of pipes of varying cross-sectional area connected in a series like in figure given below (Image courtesy: Macquarie University). Please refer to Azu’s blog for a detailed overview.

Representation of human vocal tract as the pipe model

The features we use are:

Place of articulation: The place in vocal tract where a cavity might be obstructed by tongue, lips or velum.

Closedness: This value indicates the proximity of tongue with the roof of the mouth without creating a constriction.

Roundedness: This value indicates the shape of lips while pronouncing, as in /o/ and /i/ cause the lips to be shaped differently.

Voicing: Try pronouncing /aa/ and place your hand on your neck. Now repeat it while just hissing (pronouncing /s/). Notice the vibrations on your neck. These are your vocal folds vibrating. This vibration is called voicing.

We can predict these values based on the type of phoneme in context using a lookup table or a predictor such as a neural network.

Once we have all the features: ASR passes + physiological features, we have enough information to deduce the types of error that are made by the speaker as compared to the ground truth. We can now provide a constructive feedback to the speaker about the exact nature of errors. We can even pass this information to a 3-D model of the vocal tract and ask the speaker to correct themselves using visual feedback. KTH Royal Institute of Technology is working towards developing such interactive models of vocal tract (See: link).

Enough, show me the code!

Past sections covered the background information needed to get you started with APIA(automatic pronunciation intelligibility assessment). Now let’s jump into the implementation details. We extend Pocketsphinx (specifically the Javascript build: pocketsphinx.js)as the ASR(Automatic Speech Recognition) since we already have pre-trained English models available. We modified so that we can directly get the features we require for APIA. Please see the following repository for the modified Pocketsphinx.js:

https://github.com/brijmohan/pocketsphinx.js

The following github repository has a live demo of the project and rest of the implementation details:

https://github.com/brijmohan/iremedy/tree/gh-pages

In case you wish to implement APIA for your own language and new set of words you must do the following:

1. Train pocketsphinx model for your language

2. Compile the model using Emscripten to port it to Javascript

3. Collect data for the new words as speech and intelligibility scores

4. Train DNN models for each word using Keras and use keras.js to port them to Javascript

5. Finally plug them into the iremedy project. That’s it!

These steps might seem trivial but there’s enough to do. Please feel free to contact me in case you face any issues or raise them on the Github page. If you are interested in more nerdy details, we wrote a paper on this work. tPlease check it here: Link

To go more into the details of ASR(Automatic Speech Recognition) please refer to the HTK book: Link

“Hope you liked the information I shared. I would love to have your suggestions and feedback to help me write better in the future”

Have fun learning!

--

--

Brij Mohan Lal Srivastava
Viithiisys

Researcher in human languages in written and spoken form. I particularly enjoy creating softwares which can interact with humans in free form speech.