Sharing Our Common Voice — Mozilla Releases Second Largest Public Voice Data Set

Since the launch of Common Voice, we have collected hundreds of thousands of voice samples through our website and iOS app. Today, we are releasing a first version of that voice collection into the public domain.

From our beginning, Mozilla has relied on the creativity, compassion, and resourcefulness of people all over the world to help us build and promote the web as a global public resource accessible to all. This has been the foundation of our experimental work in the field of machine learning and speech recognition, and in building a large, high-quality voice data resource with Common Voice.

This collection contains nearly 400,000 recordings from 20,000 different people, resulting in around 500 hours of speech. To date it is already the second largest publicly available voice dataset that we know about, and people around the world are adding and validating new samples all the time!

You can go download the data right now!

The Common Voice Download Page

Having ourselves experienced how difficult it can be to find publicly available data for our speech technology work, we also provide links to all the other large voice collections we know about on the site. And we are eager to continue growing the website as a central hub for voice data.

When we look at today’s voice ecosystem, we see many developers, makers, startups, and researchers who want to experiment with and build voice-enabled technologies. But most of us only have access to fairly limited collection of voice data; an essential component for creating high-quality speech recognition engines. This voice data can cost upwards of tens of thousands of dollars and is insufficient in scale for creating speech recognition at a level people expect. By providing this new public dataset, we want to help overcome these barriers and make it easier to create new and better speech recognition systems (like our own Deep Speech). We’ve started with English, but we will soon support every language. With our parallel work on an open source speech-to-text engine, we hope to open up speech technology so that more people can get involved, innovate, and compete with the larger players.

Are you interested in learning about our open-source speech recognition project “Deep Speech” and how Common Voice data can be used to create better speech recognition products? Reuben Morais from Mozilla’s Machine Learning team just published an article about their “Journey to <10% Word Error Rate”. It provides a compelling summary of the challenges and learnings while working towards the team’s first open-source speech recognition engine model, which has been released today on their github repository!

We continue to welcome collaborators on Common Voice. Please reach out with any ideas that you have about how we can work together, to let us know how you are using the data, or to give us feedback on how this project could be more useful.

We’d like to thank Mycroft, SNIPS, Bangor University, LibriSpeech, VoxForge, TED-LIUM, Tatoeba.org, Mythic, SAP, and of course all our contributors on github. We couldn’t have made this progress this without you!

We are also constantly aiming to improve the quality of our dataset. Head on over to the Common Voice website now and help us verify the recordings which is equally important as donating your voice.