Assembling a large German speech corpus

Sina Schaeffler
Linagora LABS
Published in
8 min readOct 6, 2020
French company for free and open source software

Today, there are many useful applications for Automatic Speech Recognition (ASR), in entertainment, in private live, in the public spaces and also at work. An important step in ASR is the text transcription of an audio file — speech memorized by a machine — to written text. Building this speech-to-text part always needs huge amounts of speech data, ideally with its transcription.

This data is needed for different purposes : Most of it is used to tune the internal parameters of the chosen model according to the data it will handle. The data used for this purpose is called the training set. Another completely distinct set will be used to test the results of the obtained models on new data, in order to manually choose some external parameters, called hyper-parameters, of the model. This is the validation set. A last data set, called the test set, completely independent from the others, permits evaluation of the final model.

This is where free and open-source applications run into difficulties : Many large speech corpora aren’t free to use, and even those which are free have a large variety of structures, contents and formats. Thus, we need to find such corpora, to bring them in a single format, to test them by training simple speech-to-text models on them and to finally join them in order to get a more complete and larger corpus.

All these steps have been done to obtain a free German speech corpus, and will be explained in the following.

Search for free German speech corpora

In order to train the kind of model we need, it is necessary to get a large number of small audio files (each one containing only several seconds of speech, usually one sentence) and the transcription of each one. In order to train the model to adapt to different speakers, it is useful to know which sentences have been spoken by the same person. These needs strongly limit the amount of accessible data, since the only collections of audio files suited to our use are those which have been specially formatted for it, for example by cutting them into single sentence files.

As of July 2020, according to my research, there are only three large German corpora, each of which contain over 100 hours of speech and are freely available on the internet. These are:

- Commonvoice : An open collaborative project of the Mozilla Foundation

- Caito (m-ailabs) : An independent open source project by software engineer Imdat Solak

- Voxforge : A site which offers open speech corpora in several languages

Even at first glance the contents are very different and their general structure and content differ as well as their size and the characteristics of their audio files :

Audio data per corpus

(Total : 1085h of German speech data)

Content and format

It is clear that Commonvoice is by far the largest corpus of the three as it contains more data as both others combined.

Mozilla’s Commonvoice corpus contains almost two third of the available data

Even though all of these data sets are thought for the training of ASR models, each of them has some limitations or flaws.

Commonvoice, for instance, is a very large corpus of speech registered in various circumstances, as volunteers contribute to it using their own devices. This is very good for training a model usable in real life. Although, as the sentences are often checked only twice for their correctness by other volunteers, the quality of the transcriptions and even the choice of the sentences is not always perfect.

Caito has several issues with the speakers. Since the data comes from audio books, and the only person associated to them is maybe the author, maybe the reader, it is difficult to know for certain who reads which sentence. For two books, it is marked that there is a mix of male and female speakers, but it is never specified what each of them reads, so this corpus is less useful for speaker adaptation.

Voxforge, finally, uses the same few sentences several times. Each reading has been registered by multiple microphones at same time, with different qualities or noises. This, again, is good to get a model usable in many different situations. However, their proposed train/validation/test separation is maybe not clean enough: several sentences of the training set are also in the validation or test set, thus biasing the tests if a model trained on the training set is tested on one of the other sets. Another possible source of bias is the language model we used for the tests, which comes from CMU sphinx, which is related to Voxforge, so this model could overfit the Voxforge corpus.

So none of these datasets is perfect, but at least they do exist, they all have some kind of advantages and together they amount to over 1000 hours of audio data, so it should be possible to train decent German ASR models using these freely accessible datasets which is what we tried next.

Formatting, tests and comparison

In order to compare the different corpora, we used simple models provided by the toolkit kaldi. As this toolkit requires the corpora to be in a very precise format, we put all of them in this form.
Additionally, the corpora Commonvoice and Caito have been split in train/validation/test sets containing, respectively, 80, 10 and 10 percent of the data. The already existing split of Voxforge has been kept despite its flaws.
Once this work is finished, two very simple hidden Markov model - Gaussian mixture model (HMM-GMM) model architectures, one monophone (ie pretending a sound’s prononciation does not depend on its neighbors) and one triphone (the sound is thought to depend exclusively on its two neighbors) are successively trained on each of the training sets.

The kaldi toolkit facilitates testing the corpora

The results below were obtained by training a model with the steps/train_mono.sh function from kaldi/egs/wsj/s5 on the training set of each corpus and then decoding each of the validation sets with this model using utils/mkgraph.sh and steps/decode.sh from the same example.
This function applies the model to the audio data of the set to decode and compares the resulting transcription to the real transcription in order to find the percentage of wrongly transcribed words, called the word error rate (WER).
More precisely, it calculates the WER values for several decoding hyper-parameter variations, and the following WERs are always the best ones for the given decoding.

For the simplest model, the monophone model, we got the following results:

WER of monophone training

The same table for triphone results looks like this:

WER of triphone training

First, the decoded test data using models trained on Voxforge looks too good to be true, especially in the monophone table. By investigating, we found out that the validation and training sets of Voxforge have many sentences in common, maybe too many. Additionally, the dictionary and the language model from CMUsphinx used for this and the following tests overfit Voxforge, as the following table show.

Perplexity and percentage of out-of-vocabulary words on validation sets on the validation sets of the corpora

Aside from the bias of Voxforge, it is interesting that every model does best “within corpus”, i.e. scores best on the validation set from the corpus on which it was trained.This is normal and probably due to resemblance in content and thus sentence type and context, such as background noise or microphone types, between the files of the same corpus.

As a conclusion, the model trained on Voxforge is quite certainly not as good as it looks, and the decoding on Voxforge/validation doesn’t say much about a model’s quality.

Merging the corpora

By combining the corpora, an even larger corpus can be created.
There are several small difficulties in the practical realization of this. For example, in order to keep this corpus easy to maintain and to update in case one of its component corpora is updated, it is better to maintain their audio data separately, and to merge only the transcriptions once they are in the same format. Once this is achieved, the previous tests can be run on the new, combined corpus.

WERs of monophone training

The combined model performs worse on each validation set than a model
decoding in-corpus. Though, it is better than the models trained on
Caito or Commonvoice are on the validation set of the other corpus. Thus
it seems that the model trained on the merged corpus has more stable
results. It could be interesting to check this by decoding an entirely
new validation set, outside any of the initial corpora, with each of the
models. If the merged model is more stable, it should give better
results on a new corpus than any of the other models.

So, it is possible to conclude that combining the corpora gave a more performant and less specialized model. This is not very surprising, as the use of more data is likely to give better models, but it was still necessary to check it.

Another interesting detail is that the results of the merged corpus model are often close to those of Commonvoice (except on Voxforge, where it inherits the bias of the Voxforge train set). This phenomenon is also explainable: Since Commonvoice makes up a majority of the total data (677 of 1085 hours), it has more influence on the merged model as the other corpora.

As more data should help to train more accurate models, combining the corpora should give better results with the same tests as each of the individual corpora. Thus, we did run the same trainings as above (and an additional one using speaker adaptative training) and decoded every model each of the validation sets.

WERs of a model trained on all corpora

Conclusion

From the three large German speech corpora found online, it was possible to create an even larger merged corpus. The tests showed that it is necessary to take care of some bias, but the combined corpus still seems to be adapted to train even better German ASR models.

In the test described above, we used only models with a HMM-GMM (hidden Markov model — Gaussian mixture model) architecture with default hyper-parameters and a fixed language model (this and the dictionary came from CMUsphinx).

Adapting the hyper-parameter or modifying the dictionary and the language model could already improve the high error rate. Moreover, these corpora and especially the merged corpus can also be used to train more sophisticated models, which use for example neural networks. With these techniques it should be possible to train high-performing German ASR models exclusively on freely available datasets.

--

--