Fine-tuning Allosaurus — A Universal Phone Recognizer

Jittarin Kanjanaphairoj
Super AI Engineer
Published in
4 min readMar 27, 2021
Photo by Scott Evans on Unsplash

That isn’t actually an Allosaurus, but the first thing that comes up when you search for Allosaurus on Google is probably going to be the dinosaur.

However, we’re not here today to talk about the dinosaur. We’re here to talk about the universal phone recognizer and how you can fine-tune it for your own use.

What can it do?

The allosaurus library can automatically extract phones from .wav files. It can also extract the times in which it detects the phones if you modify the source code a little. This may be useful if you’re working with speech.

The details of the model can be found in the paper here:

https://arxiv.org/pdf/2002.11800.pdf

The Repository

You can access the repository from the link above, or load up Google Colab and install the Allosaurus library by using:

pip install allosaurus

Trying It Out

After installing allosaurus you can test it out by importing the read_recognizer which will be able to extract phones from your wav files.

import allosaurusfrom allosaurus.app import read_recognizer
model = read_recognizer()

The code above will import the library and create a read_recognizer object which we will use like this:

output = model.recognize(path_to_voice_file,'tha')

model.recognize() receives at least one parameter which is the path to the voice file. The other parameter is the language in which we want it to recognize, in this case I want it to recognize Thai phones. The output phones will then be stored in the output object which you can use later.

This is an example of an output from model which is customized to output Thai IPAs:

k r u n aː pʰ ɔː b pʰ t iː h ɔː ŋ t w o t a r ŋ tʰ w o p aj kʰ a ʔ

Preparing the Data for Fine-tuning

First, create a data directory, inside the data directory, create a train directory and a validate directory.

Then we need to prepare the data for model to train on.

You would have to find .wav voice files in your desired language. If you don’t know where to start then you can have a look at the Common Voice dataset, you can also donate your voice and help verify voices for your own language here:

Common Voice’s audio format is .mp3 so you will have to convert it to .wav first if you want to use that dataset. You can keep the voice files in the newly created data directory.

After you have your voice files you will also need the list phones that each voice file produces. You can do this if you have the text of each voice file and convert the text into phones with your desired library.

For Thai, we can use the PyThaiNLP library to convert from text to IPA by:

from pythainlp.transliterate import transliterateipa = transliterate(txt, engine="ipa")

After we have the voice files and their transcriptions, we will then need to create two files in each of the train and validate directories.

Those two files will need to be named wave and text with no file extension.

The wave file will contain the path to all of the wave files and each line should be prepared like this:

utt_id path/to/wave/file.wav

Whilst the text file will contain the transcribed text separated by a space and each line should be prepared like this:

utt_id phone1 phone2 phone3 .... phoneN

Where the utt_id of each files should map the wav file to the transcribed text.

After we have created the files, we will then need to prepare the audio and text features for training, we can prepare the audio features by:

python -m allosaurus.bin.prep_feat --model=some_pretrained_model --path=/path/to/your/directory (train or validate)

we run the above code once for the train dataset and once for the validation dataset.

Similarly, we can prepare the text features by:

python -m allosaurus.bin.prep_token --model=<some_pretrained_model> --lang=<your_target_language_id> --path=/path/to/your/directory (train or validate)

The path and model should be the same as the previous command and the target language id should be the 3 letter ISO id of the language of the dataset.

Training the Model

The model can be trained by running the command below

python -m allosaurus.bin.adapt_model --pretrained_model=<pretrained_model> --new_model=<your_new_model> --path=/path/to/your/data/directory --lang=<your_target_language_id> --device_id=<device_id> --epoch=<epoch>
  • pretrained_model should be the same as the model that you used to prepare the data
  • new_model is the name of your new model, you will use this to call your model later.
  • device_id is the id of the GPU used for training, use -1 if you don’t have a GPU
  • lang is the same as when you prepared the data
  • epoch is the number of epochs to train for.

After the training has finished you will have your new model to use!

Testing Your New Model

You can list out your models by using this code:

python -m allosaurus.bin.list_model

You should see your newly trained model in the list now.

You can now import your new model:

import allosaurusfrom allosaurus.app import read_recognizermodel = read_recognizer("your_new_model_name")

and test it out:

output = model.recognize(path_to_voice_file,'tha')

Wrapping Up

Hopefully you can now fine-tune your own allosaurus model and make it work better in recognizing phones in your own language!

--

--