Phoneme-BERT: Joint Language Modelling of Phoneme Sequence and ASR Transcript

This blog refers to our technical paper accepted at Interspeech 2021, Czechia.

Photo by Pietro Jeng on Unsplash

The problem at hand

Automatic Speech Recognition (ASR) systems fail to transcribe real-life calls with 100% accuracy. The insertion, substitution and deletion errors negatively affect the performance of machine learning systems for downstream tasks such as intent and slot detection, entity recognition, and sentiment classification.

How can we develop a language model that is more robust to ASR errors and leads to better performance on downstream SLU tasks?

Proposed Setup


Phoneme, a perceptually distinct units of sound, is an auxiliary information that could be extracted from any natural speech conversations in addition to ASR transcripts. In most cases, the phoneme-error-rate (PER) is much lower than word-error-rate (WER). This means that phonemes are more accurately captured in the predicted sequence as compared to words themselves.

We hypothesize that a joint language model trained with phoneme sequence and ASR transcript could learn a phonetic-aware representation that is robust to noise and errors in transcripts.

To this end, we propose Phoneme-BERT, a BERT-style language model optimized with a joint training objective to predict masked tokens from ASR transcript and phoneme sequence.

Loss Function

To train Phoneme-BERT, we use three loss functions as described below:

  • ASR MLM loss: The loss function associated with the masked language modeling task over tokens belonging to ASR transcript
  • Phoneme MLM loss: We mask the BPE tokens in the phoneme sequence and create a MLM task on top of it. This loss function optimizes the prediction for MLM tasks on phonemes.
  • Joint MLM loss: In addition to isolated ASR and Phoneme MLM tasks, we concatenate the ASR and Phoneme sequence and randomly mask tokens on either sequence. The task of the model is to predict the masked tokens leveraging information from either side.

To predict a token masked in the word sequence, the model can either attend to surrounding word tokens or to the phoneme sequence, encouraging the model to align the word and phoneme representations, thus, making the word representations more phonetic-aware.

The joint modeling of word and phoneme helps the model to leverage the phoneme context if the word context is insufficient to infer the masked token in the word sequence and vice-versa


We generate noised data for both pre-training and for downstream tasks. The proposed method is additionally evaluated on a real-life speech data (Observe.AI’s sentiment classification task). A total of ~200k data points are used for pre-training the model on ASR dataset. To create an ASR corpus, we use LibriSpeech and a combination of Amazon reviews and Squad dataset.

Since we need ASR transcript, we use Amazon Polly to convert the raw texts to speech, we add ambient noise and prosody to align to data to real-life speech environment and convert it back to ASR transcripts using Amazon Transcribe.

For downstream tasks, we use SST-5 as sentiment classification dataset, TREC as the question classification dataset and ATIS as the intent classification dataset. We follow a similar pipeline to create the ASR version of the downstream tasks to evaluate the proposed setup.

Additionally, we also evaluate the performance of the proposed method on a real-life call center sentiment classification (Observe.AI’s dataset).

Phoneme generation

  • Use listen-attend-spell (LAS) method to train a phoneme generator
  • Compare it with generating phoneme sequence directly from ASR transcript using Phonemizer tool

We release datasets at our Github space for future research usages:


Comparison with default RoBERTa model

RoBERTa model trained on clean English textual data when used as the base language model performed up to 5% worse than the proposed model and degrade by as much as 15% when compared to the model’s performance on clean text.

Comparison with RoBERTa model fine-tuned directly on downstream task

This is one of the natural experiment choices, more so when base ASR corpus is unavailable to pre-train model with general ASR transcripts. This setup improves the performance by 1–2% F1 scores across the datasets as compared to previous method.

Impact of joint-training

Proposed PhonemeBERT performs up to 6% better than a RoBERTa model that is directly fine-tuned on the downstream task. Additionally, a model that is trained only with word (ASR) transcript performs lower than PhonemeBERT by up to 2.5% F1. This shows that:

  • pre-training the model on ASR corpus is an important ingredient
  • pre-training jointly with ASR and phoneme transcripts additionally boosts the performance of the system, suggesting that proposed method is better equipped at handling ASR errors/noises

Using PhonemeBERT in a low-resource downstream setup

It is a practical bottleneck to not having access to phoneme outputs in many off-shelf ASR systems.

Based on our evaluations under this constraint, we observe that if we use a pre-trained Phoneme-BERT encoder with only ASR transcript inputs for downstream tasks, we still get an improvement over a word-only model by 2.5% F1.

This indicates that the phoneme-BERT representations are phonetically aware and even in absence of explicit phoneme inputs at downstream tasks, the model is able to outperform a word-only classification model.

Conclusions and Takeaways

  • Phoneme-BERT: A method to jointly model ASR transcripts and phoneme sequences using a BERT-based pre-training setup
  • Results show that joint language model in Phoneme-BERT can leverage phoneme sequences as complementary features, making it robust to ASR errors.
  • Pre-trained PhonemeBERT can be effectively used as word-only encoder in a low-resource downstream setup where phoneme sequences are not available, still producing better results than word-only language model.
  • We also release our generated datasets used in the work for research usages:




Observe.AI’s official engineering blog, with insights on artificial intelligence, machine learning, design, processes, and culture.

Recommended from Medium

Machine Learning Principles from a Software Engineer

Understanding and Remembering Precision and Recall

Computer Vision Using Convolutional Neural Networks

Part(1/2) : What and How of Autoencoders

Sentiment Analysis using Amazon Comprehend — One of the tools for Natural Language Processing

Generating Synthetic Tabular Data

Things one should know before starting with MLOps

An Icon Classifier with TFLite Model Maker

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ayush Kumar

Ayush Kumar

Machine Learning Scientist | Traveled places in 7 countries | Applied Researcher | IIT Patna Alumnus | Technical Writing

More from Medium

Onikle Paper Summary: An Attention—Free Transformer

Grab-and-go series: n-gram model

Generating handwritten text with an AI model

text generator training

Supporting PyTorch on the Cerebras Wafer-Scale Engine