Why i’ve chosen to donate my (german) voice for mankind

One of several tacotron2 training configs

… Much time passed …

Decades later Apple, Amazon, Google and Microsoft where much more successful on that and released their smart voice assistant products.

  • stt (speech to text) — spoken input by human user
  • tts (text to speech) —generated speech output to human user

Just one warning: Using a smart voice assistant completely offline (including stt and tts) in an acceptable quality isn’t simple or a “one-click” and requires additionally lots of knowledge and compute power (currently).

When looking for alternatives i found one project called MyCroft which is opensource, under active development and has a helpful community. By default stt/tts is provided by cloud services too, but it’s anonymised through MyCroft proxy infrastructure and you have the choice switching to other models cloud based or local services such as deepspeech, kaldi, ... (stt) and mimic, pico, … (tts).

Free german tts model in acceptable quality level

I found out that there seems to be a standard “paper” or implemented framework for this called “tacotron” being available in version 1 or 2. The first version is lower on quality but runs/trains with less compute resources while version 2 (combined with vocoders) produces much more realistic results.

“Hey, it’s 2020 — it can’t take that long to train a tts model.”

Generally machine learning processes (as relevant in this case) love gpu memory, but my graphic card just has 4gb of gpu ram and that’s way to low to end in good training performance. So fallback was cpu based training which isn’t that efficient but however it worked. On my test i really hoped i’ll receive a model after some minutes or hours, but i was wrong with that.

“When documentation tells you to use good recording setup you really should take this serious!”

Soon after asking the community for help i got contacted by a really nice guy from MyCroft community called Dominik. He listened to some of my recorded files and found out that that are in heaviliy mixed quality. Some recordings where good while others had trouble with random noise, echos and beeping tones in background. Personally i didn’t recognize this because i never heard my own recordings in “full volume” — but machine learnining (of course) learns from ALL SOUNDS including background random noise.

“the fellowship of the ri… — uh. free german tts model”

These guys supported me …

  • with gpu compute power
  • with knowhow around machine learning stuff
  • with audio optimization knowledge and tools
  • knowhow on german phonemes
  • and last but not least — nice words

Phase 1 (dataset creation) — has completed

After almost six month of regular recording sessions the final dataset is finished (hopefully ;-) )

  • 22.668 recorded phrases (wav files)
  • more than 23 hours pure audio (no silence at the beginning or end)
  • samplerate 22.050Hz
  • mono
  • phrase length (min/avg/max): 2 / 52 / 180 chars
  • avg spoken chars per second: 14
  • sentences with question mark: 2.780
  • sentences with exclamation mark: 1.840

Who are these “fellowership” guys i’m talking about

Please give a round of applause for…

Is there any negative about voice contribution

— but what if…

Your own voice is of course a very personal thing. Can i still use voice activated systems in future when my voice (original recordings and model) is public? To be honest — i’m not sure on that. Will i have a good feeling if my onlinebanking could be unlocked with my voice in future or my car or whatever? Will anybody who personally knows me use my voice to make “funny” phonecalls to my friends saying them mad things i’d never say to them? Surely there will remain a little friable feeling.

I contribute my voice as a person believing in a world where all people are equal. No matter of gender, sexual orientation, religion, skin color and geocoordinates of birth location. A global world where everybody is warmly welcome on any place on this planet and open and free knowledge and education is available to everyone.

So hopefully my voice is used in this manner to make this world a better place for all of us :-). So please don’t use for evil!

tl;dr — why i did contribute my voice

  • I believe free models/tools for voice based human <-> machine interactions are important
  • I love open source and wanted to make “my most personal “ contribution
  • I like the idea that in future when I’m maybe not alive any more a refrigerator will say that milk is empty using my voice (on one side it is a little creepy, but hey it’s cool too)
  • I wanted a German tts model to be freely available to all people (from little community driven projects, over educational institutions to commercial use without any license struggeling)
  • I often heard and read things like: Why isn’t there a German model available, why hadn’t someone done this, this should be available,…. So I decided to be this mysterious “someone” people seems to be waiting for.
  • Because I can :-)

We will hear us in future :-)




