I believe, everyone must have tried Speech To Text (STT) systems at least once in their lifetime, and most of us failed miserably; right?
However, how many of us ever wondered how this amazing piece of technology actually works under the hood? Moreover, why it doesn’t work properly with Indian accent? Let’s dive deeper today!
How Speech To Text system works?
Typically, most STT systems record the audio from the local device and send it to the cloud, where Machine Learning models analyze the audio and generate the text; which is sent back to the local device where it can be used for many tasks such as accessing Google assistant to playing your your favorite PewDiePie (and don’t forget to subscribe to PewDiePie). STT APIs are provided by Google Cloud, Azure, AWS, IBM Watson, Speechmatics etc.
Then why Project Common Voice?
As it’s out of scope for this article, hence we're not going to discuss about how machine learning model works here. But you should know that audio sent from your device to the cloud stays there forever, which is making your STT system more accurate.
Most users don’t care about it since they’re getting free services but there are number of evidence can be found where STT systems is recording the audio even when it’s not being used, if you’re a company which cares about your customer’s privacy, you’d definitely like to avoid that.
- Internet Connectivity
Even though we’re living in an era where 5G technology is knocking our doors, there is 52% of world’s population which doesn’t have access to internet, mainly due to “lack of infrastructure”. So, there should be a way to avail cutting edge technology offline as well.
Project Common Voice
Common vice is a project by Mozilla to help machine learn how humans speak, with variation in accent to variation of languages.
Mozilla is already working on another project called “DeepSpeech”, where we’re creating an open source Speech-To-Text engine, more importantly this engine works offline as well.
There’s a roadblock though, training a Speech-To-Text engine requires huge amount of voice data, along with it’s text annotation. Even if we’re just talking about English, there’s 160 distinct dialects of English throughout the world. That’s where Common Voice comes into picture, where anyone can donate their voice or even just validate the small audio clips (1–3 seconds). To make it accessible to everyone, we’ve launched 35 languages on Common Voice portal, as of now.
How you can contribute?
- Create an account at https://voice.mozilla.org
- You should see the two option in the middle of the screen: Speak and Listen
- Speak: If you’d like to donate your voice (one-two sentence a clip only), just click the microphone icon and speak the given text (don’t forget to give permission to microphone, if prompted).
- Listen: Listen to the audio by clicking the play icon and validate that given text is correct or not. Click ‘Yes’ for correct or vice versa. While Listen, you should also consider that background noise should not much plus pronunciation of words are correct.
You just made a contribution to open source, you may pat your back. All of the recorded voice data is publicly available for free. If you’re a developer, who’d like to use this data, just go to this link and you can download the voice data for your own speech recognition projects, don’t forget to share it with community 😉.
Common Voice Night
We at Mozilla Gujarat frequently host night long meetups where we contribute to Common Voice for straight 8 hours, isn’t that amazing!
REPS EVENT: https://reps.mozilla.org/e/mozsquad-planning/
DATE & TIME:
August 17, 2019 2200— August 18, 2019 0600
Alka Society, Vadodara
Apart from Open-source contributions at Mozilla, I’m a Microsoft Student Partner and community member at GDG Baroda. I would like to thank Mozilla and the MozillaIN community for providing me a chance and the resources to learn about VR/AR and Open Source.
This is me, Pratik Parmar signing off till the next tech adventure. Over and Out…