Sonosco — Deep Speech Recognition Framework

Wiktor Jurasz
Sep 30, 2019 · 5 min read

Train your own speech recognition model…

…and apply it to a real-world robot!

Have you ever heard of Roboy? He is an anthropomimetic robot (a robot that is going to be as good as the human body) and one of the most fascinating research projects at the Technical University of Munich.


As part of Roboy’s lab course during the last six months, we’ve developed in a team of three, a framework to not only train top-notch speech recognition models but also nicely compare them among each other and apply ASR (automatic speech recognition) models in a robotics system with ROS.

High-Level Design of Sonosco

This article is about Sonosco framework, that can be found on this Github repository.


What is the most important thing when training any deep speech recognition model or any other statistical model for that matter? It definitely is the data!

So the first thing you need to do is collect data by either combining different publicly available datasets or recording it yourself. (Yes, you heard that right…. And you know what? We help you with that!)

With Sonosco you can download and catalogue (i.e. create manifest files) many publicly available data sets with one function call. You can also use our web app to acquire more data yourself (more on that later).

If you already have data prepared (i.e. audio files and respective transcripts) Sonosco helps you to prepare it for training, check our documentation to see how this is done.

Start training locally….

Now comes the fun part! For model training Sonosco comes with some predefined PyTorch modules for Listen Attend Spell, DeepSpeech2 and a Sequence-to-Sequence model with Time-depth-separable convolutions.

The best part about Sonosco is, that you can train your model in less than 20 lines of code!! (Yes, you heard that correctly)

For this, we created a nice and extensible training process.

Analysis Object Model for the training process

The training process consists of an experiment that allows you to keep track of provenance, a model trainer, that takes care of all the training and validation process. The model trainer also computes metrics and applies callbacks, i.e. arbitrary code like learning rate reduction or tensor board logging.

Here you see a code snippet on how to train the LAS model in less than 20 lines of code:

Full example in Jupyter Notebook here:

Maybe you noticed the serialiser and deserialiser?! This is an awesome feature that lets you save the complete state of your training and …

…continue training on the cloud

or on any other computer, from exactly the same point where it was stopped. Basically, our own serializer is capable of saving not only the model weights (like PyTorch does) but it can save the whole model trainer with all its parameters. This allows you to start the training locally, see how it goes, and then continue training on powerful machines in the cloud. And later load the serialized model into our inference pipeline.

If you want to try it out yourself, see the documentation.


Sonosco makes inference much easier providing Websocket and ROS integration. Websocket interface was created with Flask-SocketIO and you can check out the implementation here. On top of it we’ve built a frontend with Vue.js which allows you to compare models as shown below.

Sonosco Transcription Server

You can find in our quick start tutorial how to start the transcription server in a docker container. However, for running it locally, you should:

git clone

2. Install Sonosco and all its dependencies by running

pip install -e .

3. Build the frontend

# Change working directory
cd server/frontend
# Install node-modules
npm install
# Build frontend
npm run build

Inside of the server directory, you should now see a dist folder which should contain the bundled Vue.js code. After this, you should be able to run the Flask backend by simply executing


inside of the server directory. It will automatically serve the newly built frontend on localhost:5000.

Of course, our frontend is just one of many applications that can be built on top of this REST and WebSocket interfaces. Beyond this, we also offer ROS integration (in case you’re working on a robotics project such as Roboy).

A few words about our transcription server...

You remember, we said something about model comparison and collecting your own data? For this, Sonosco comes with a super intuitive transcription server, that lets you compare your freshly trained models among each other.

You can transcribe your own voice, compare the transcriptions and when you correct it and click on “improve”, the audio with its transcription is saved to your computer, so you can easily collect your own test data.

Model, models, models, models…

Most of the Github repositories that we’ve encountered had either no trained models which could be readily downloaded and used or they were only trained one of the publicly available datasets.

We were willing to change this situation and make ASR more easily available to the open-source community. With this in mind, we took many different publicly available datasets and combined them into ~3000 hours of speech and text. This dataset was then used to train DeepSpeech2 and LAS, which we then released here! Yes, you heard me right, these models are available for your own use for free. So don’t wait and grab them as fast as you can while they are still there. We are most proud about the LAS model which achieved a CER (Character Error Rate) of 5.3 on our validation dataset.

These models are automatically downloaded and made available on our Transcription server when you follow these steps.

Check out our documentation to find out more about this project. And if you’re really interested to check out the code itself. If you have any questions, you ask questions in the comments or open issues in the Sonosco repository.

For the future, we envision a general framework for speech, that not only includes speech recognition, but also speech synthesis.

Dealing with the speech in machine learning is currently one of the most exciting topics and we believe that Roboy will be a big part of it!

Thanks to the whole Roboy Team, especially Vagram Aiirian and Rafael Hostettler, for enabling us to work on this topic.

The Authors:
Yuriy Arabskyy, Wiktor Jurasz and Florian Lay


Framework for Deep Speech Recognition

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store