G2P_EN: A Machine Learning Model for Converting English Text to Phonemes

David Cochard
axinc-ai
Published in
4 min readAug 16, 2024

This is an introduction to「G2P_EN」, a machine learning model that can be used with ailia SDK. You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.

Overview

G2P_EN is an algorithm that converts English text into phonemes using a dictionary and a machine learning model. It is used in the English mode of GPT-SoVITS.

G2P stands for “Grapheme to Phoneme”, which is the process of converting text into phonemes. It is primarily used as a preprocessing step for Text-to-Speech (TTS) systems.

For example, if you have the text “To be or not to be, that is the question,” the result of the G2P process would be something like “T UW1 B IY1 AO1 R N AA1 T T UW1 B IY1 DH AE1 T IH1 Z DH AH0 K W EH1 S CH AH0 N.”

The algorithm

Traditionally, G2P has been implemented using a dictionary-based approach. However, this method faces the challenge of being unable to convert words that are not included in the dictionary into phonemes. G2P_EN addresses this issue by using a neural network to predict phonemes for unknown words.

Formatting Input Text

The numerical representations in the input text are converted to English words. For example, “10” is replaced with “ten” and “10000” is replaced with “ten thousand”.

text = normalize_numbers(text)

Symbols other than ‘. , ? ! - are masked.

text = text.lower()
text = re.sub("[^ a-z'.,?!\-]", "", text)
text = text.replace("i.e.", "that is")
text = text.replace("e.g.", "for example")

Common English contractions are expanded. For example, “can’t” becomes “cannot,” and “I’m” becomes “I am.”

text = text.replace("i.e.", "that is")
text = text.replace("e.g.", "for example")

Word tokenization

The text is split into words using the TweetTokenizer from NLTK. While word separation can also be done using spaces and unicode punctuation symbols, the TweetTokenizer considers special cases like @name for word segmentation. However, since `@` is masked beforehand in G2P_EN, using TweetTokenizer might not be necessary.

Usage of the dictionary

Convert the split words into phonemes using two dictionaries: homographs.en and cmudict.

The homographs.en dictionary accounts for the variations in pronunciation that depend on the part of speech, defining two possible pronunciations. After performing part-of-speech tagging using NLTK’s perceptron, the appropriate pronunciation is selected from the dictionary.

An example entry in homographs.en includes the text, pronunciation 1, pronunciation 2, and the part of speech. If the part of speech matches, pronunciation 1 is used; if it doesn’t, pronunciation 2 is applied.

ABUSE|AH0 B Y UW1 Z|AH0 B Y UW1 S|V
ABUSES|AH0 B Y UW1 Z IH0 Z|AH0 B Y UW1 S IH0 Z|V

An example of an entry in cmudict is shown below. When multiple pronunciations are defined, G2P_EN uses the first pronunciation listed.

ABUSE 1 AH0 B Y UW1 S
ABUSE 2 AH0 B Y UW1 Z
ABUSED 1 AH0 B Y UW1 Z D
ABUSER 1 AH0 B Y UW1 Z ER0

Phoneme estimation

If a word is not found in the dictionaries, its pronunciation is predicted using a neural network. G2P_EN v1 was implemented in TensorFlow, but G2P_EN v2 is implemented in numpy. It uses an Encoder-Decoder structure, decoding up to 20 symbols using Greedy Search.

The structure of the Encoder is as follows: it loops through each character and calculates the embeddings.

Encoder

The structure of the Decoder is as follows: it loops through each output character and stops when the symbol 3 appears. The maximum loop limit is 20 characters.

Decoder

Usage in ailia SDK

You can use G2P_EN with ailia SDK using the command below.

$ python3 g2p_en.py --input "I'm an activationist."

Here is an example of the output. The phrase “I’m an” is converted using the dictionary, while the word “activationist” has its phonemes predicted by the neural network.

Output : ['AY1', 'M', ' ', 'AE1', 'N', ' ', 'AE2', 'K', 'T', 'IH0', 'V', 'EY1', 'SH', 'AH0', 'N', 'IH0', 'S', 'T', ' ', '.']

G2P_EN can be used in Python as well as C++.

ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.

--

--

David Cochard
axinc-ai

Engineer with 10+ years in game engines & multiplayer backend development. Now focused on machine learning, computer vision, graphics and AR