Why i’ve chosen to donate my (german) voice for mankind

9 min readAug 11, 2020

Ok, since i’m not being the first man on the moon the word “mankind” seems a bit too much. But what is this all about.

One of several tacotron2 training configs

Guude (Hi) from the “future” — 2023er Thorsten here:
Do you know i have a Youtube channel (“Thorsten-Voice”) on open voice technology? With (hopefully) helpful tutorials and cool stuff :-).
If you like voice tech — please feel free to subscribe this amazing channel ;-).

Thorsten-Voice

Guude 👋 (Hessisch for "Hi, nice to see you"), I'm Thorsten Müller, a german guy with passion for #opensource #voice…

www.youtube.com

If someone had asked me one year ago if i’d like to contribute my voice to the community i’d propably asked why i should do so. So what happened within the last year? It wasn’t a big bang event leading to this point than more a series of little things making it comfortable for me sharing my voice.

But let’s get way more back in time …
Since being a young man i was fascinated by voice controlled human machine interaction as being showed in various tv series like Knight Rider, Star Trek, … . As being a technology enthusiast i tried to code software in this manner. Long story short — i didn’t succeed on that. So i put this plan aside.

… Much time passed …

Decades later Apple, Amazon, Google and Microsoft where much more successful on that and released their smart voice assistant products.

Well — i’m still a technology enthusiast but didn’t directly buy these gadgets though being amazed by it’s possibilites on the engineering side. Why not? Technology is one aspect on it, while data privacy is another concern not to ignore completly. I know collecting lots of data by these companies make these products better due better machine learning data and model training. And i’m not too critical on these devices and services but nevertheless i checked if there are any (preferred open source) smart voice assistant projects running without nessesary cloud connection. The truth is yes, but … .

… two major aspects of smart voice assistants still requires (mostly) cloud services (due compute power) to work in a decend quality.

stt (speech to text) — spoken input by human user
tts (text to speech) —generated speech output to human user

Just one warning: Using a smart voice assistant completely offline (including stt and tts) in an acceptable quality isn’t simple or a “one-click” and requires additionally lots of knowledge and compute power (currently).

When looking for alternatives i found one project called MyCroft which is opensource, under active development and has a helpful community. By default stt/tts is provided by cloud services too, but it’s anonymised through MyCroft proxy infrastructure and you have the choice switching to other models cloud based or local services such as deepspeech, kaldi, ... (stt) and mimic, pico, … (tts).

I’ve chosen to firstly dive into the topic of tts (text to speech) and found out that i can synthesize my own voice based on “some” recordings including metadata what i’ve spoken. That sounded like a funny and nerdy part time project. Unsure if it’s not creepy if your own person voice assistant sounds like yourself but i was willing to give it a try as i didn’t find any alternatives in good quality and free to use.

Free german tts model in acceptable quality level

I found out that there seems to be a standard “paper” or implemented framework for this called “tacotron” being available in version 1 or 2. The first version is lower on quality but runs/trains with less compute resources while version 2 (combined with vocoders) produces much more realistic results.

These implementations require lots of “clean” audio recordings (wave files) and a csv based metadata file where all recorded sentences are mapped to it’s text. For english language there’s a public domain dataset called “ljspeech” available which is widely used in projects and model trainings. To do the recording stuff i found out that MyCroft offers a tool called “mimic-recording-studio” which manage your recordings based on a csv metafile and makes recording really easy over a simple browser application.

Allthough i’ve read that good recording microphone and a silent room situation is required i was somewhat naive and put on my old usb headset and started recording just within my current room.

So i started doing my recordings and had really fun watching the recording counter getting higher and higher. In the meantime i found out that it’s not enough to record just (lets say) 100 recordings. To get an acceptable model after training we’re talking about a minimum of 15 hours of pure recordings. After having read around 7k phrases (or round about 6 hours pure audio) i shared my dataset and experiences with the Mozilla community (https://discourse.mozilla.org/t/contributing-my-german-voice-for-tts/). In this thread i received a lot of positive feedback and “thumbs up” emojis.

Thank you community for your feedback and support :-)

This was the first time i really recognized that there was a real interest in free german voice dataset. From that point on i was sure to push things more forward — for me and the communites.

After having read something over 14k recordings i decided to give training a try and started a training run with tacotron (version 1).

“Hey, it’s 2020 — it can’t take that long to train a tts model.”

Generally machine learning processes (as relevant in this case) love gpu memory, but my graphic card just has 4gb of gpu ram and that’s way to low to end in good training performance. So fallback was cpu based training which isn’t that efficient but however it worked. On my test i really hoped i’ll receive a model after some minutes or hours, but i was wrong with that.

It took time, and time, and time, and time, … . I kept my pc running for almost a week night and day and the training process took constantly 100% cpu ressources. Sadly, the generated test audio samples during training where far away from good so i asked for help on Mozilla and MyCroft communities.

“When documentation tells you to use good recording setup you really should take this serious!”

Soon after asking the community for help i got contacted by a really nice guy from MyCroft community called Dominik. He listened to some of my recorded files and found out that that are in heaviliy mixed quality. Some recordings where good while others had trouble with random noise, echos and beeping tones in background. Personally i didn’t recognize this because i never heard my own recordings in “full volume” — but machine learnining (of course) learns from ALL SOUNDS including background random noise.

So Dominik optimized lots of my recordings while i borrowed a better microphone, build myself a better recording room situation and continued with further recordings.

The number of posts in Mozilla thread increases as the number of recordings did too. During the next months some other really nice guys joined my journey to provide a free to use german tts model. We created a group on MyCroft chat and named us according to a quote from “Lord of the rings” —

“the fellowship of the ri… — uh. free german tts model”

These guys supported me …

with gpu compute power
with knowhow around machine learning stuff
with audio optimization knowledge and tools
knowhow on german phonemes
and last but not least — nice words

We did (and still do) training in various configurations, since tacotron offers plenty of different configuration parameters and we are trying to find out which configuration fits best for this dataset.

Phase 1 (dataset creation) — has completed

After almost six month of regular recording sessions the final dataset is finished (hopefully ;-) )

It consists of

22.668 recorded phrases (wav files)
more than 23 hours pure audio (no silence at the beginning or end)
samplerate 22.050Hz
mono
phrase length (min/avg/max): 2 / 52 / 180 chars
avg spoken chars per second: 14
sentences with question mark: 2.780
sentences with exclamation mark: 1.840

Dataset download url and further details are available on my github page: https://github.com/thorstenMueller/deep-learning-german-tts/
Next steps

We (the fellowship of the free german tts model) are still on experimenting with different taco2 configs and can hopefully provide a ready to use model to the community in near future.

Who are these “fellowership” guys i’m talking about

Please give a round of applause for…

domcross (https://github.com/domcross/)
eltocino (https://github.com/el-tocino/)
erogol (https://github.com/erogol/)
gras64 (https://github.com/gras64/)
krisgesling (https://github.com/krisgesling/)
nmstoker (https://github.com/nmstoker)
othiele (https://discourse.mozilla.org/u/othiele/summary)
repodiac (https://github.com/repodiac)

And of course — to all communities around the globe working for our open futue.

Is there any negative about voice contribution
— but what if…

Your own voice is of course a very personal thing. Can i still use voice activated systems in future when my voice (original recordings and model) is public? To be honest — i’m not sure on that. Will i have a good feeling if my onlinebanking could be unlocked with my voice in future or my car or whatever? Will anybody who personally knows me use my voice to make “funny” phonecalls to my friends saying them mad things i’d never say to them? Surely there will remain a little friable feeling.

These are all good questions and aspects, but i decided to think positive and hope my voice contribution will be used in positive ways and not for abuse.

I’d like to share some of my personal opions with you:

I contribute my voice as a person believing in a world where all people are equal. No matter of gender, sexual orientation, religion, skin color and geocoordinates of birth location. A global world where everybody is warmly welcome on any place on this planet and open and free knowledge and education is available to everyone.

So hopefully my voice is used in this manner to make this world a better place for all of us :-). So please don’t use for evil!

tl;dr — why i did contribute my voice

I believe free models/tools for voice based human <-> machine interactions are important
I love open source and wanted to make “my most personal “ contribution
I like the idea that in future when I’m maybe not alive any more a refrigerator will say that milk is empty using my voice (on one side it is a little creepy, but hey it’s cool too)
I wanted a German tts model to be freely available to all people (from little community driven projects, over educational institutions to commercial use without any license struggeling)
I often heard and read things like: Why isn’t there a German model available, why hadn’t someone done this, this should be available,…. So I decided to be this mysterious “someone” people seems to be waiting for.
Because I can :-)

Links

Mozilla TTS
MyCroft Project
https://github.com/thorstenMueller/deep-learning-german-tts/
My Mozilla discourse thread
GitHub
Discourse Link

We will hear us in future :-)
Thorsten

Why i’ve chosen to donate my (german) voice for mankind

Thorsten-Voice

Guude 👋 (Hessisch for "Hi, nice to see you"), I'm Thorsten Müller, a german guy with passion for #opensource #voice…

Written by Thorsten Müller