Train your own speech recognition model…
…and apply it to a real-world robot!
Have you ever heard of Roboy? He is an anthropomimetic robot (a robot that is going to be as good as the human body) and one of the most fascinating research projects at the Technical University of Munich.
As part of Roboy’s lab course during the last six months, we’ve developed in a team of three, a framework to not only train top-notch speech recognition models but also nicely compare them among each other and apply ASR (automatic speech recognition) models in a robotics system with ROS.
This article is about Sonosco framework, that can be found on this Github repository.
What is the most important thing when training any deep speech recognition model or any other statistical model for that matter? It definitely is the data!
So the first thing you need to do is collect data by either combining different publicly available datasets or recording it yourself. (Yes, you heard that right…. And you know what? We help you with that!)
With Sonosco you can download and catalogue (i.e. create manifest files) many publicly available data sets with one function call. You can also use our web app to acquire more data yourself (more on that later).
If you already have data prepared (i.e. audio files and respective transcripts) Sonosco helps you to prepare it for training, check our documentation to see how this is done.
Start training locally….
Now comes the fun part! For model training Sonosco comes with some predefined PyTorch modules for Listen Attend Spell, DeepSpeech2 and a Sequence-to-Sequence model with Time-depth-separable convolutions.
The best part about Sonosco is, that you can train your model in less than 20 lines of code!! (Yes, you heard that correctly)
For this, we created a nice and extensible training process.
The training process consists of an experiment that allows you to keep track of provenance, a model trainer, that takes care of all the training and validation process. The model trainer also computes metrics and applies callbacks, i.e. arbitrary code like learning rate reduction or tensor board logging.
Here you see a code snippet on how to train the LAS model in less than 20 lines of code:
Full example in Jupyter Notebook here: https://github.com/Roboy/sonosco/blob/master/demo/demo.ipynb
Maybe you noticed the serialiser and deserialiser?! This is an awesome feature that lets you save the complete state of your training and …
…continue training on the cloud
or on any other computer, from exactly the same point where it was stopped. Basically, our own serializer is capable of saving not only the model weights (like PyTorch does) but it can save the whole model trainer with all its parameters. This allows you to start the training locally, see how it goes, and then continue training on powerful machines in the cloud. And later load the serialized model into our inference pipeline.
If you want to try it out yourself, see the documentation.
Sonosco makes inference much easier providing Websocket and ROS integration. Websocket interface was created with Flask-SocketIO and you can check out the implementation here. On top of it we’ve built a frontend with Vue.js which allows you to compare models as shown below.
You can find in our quick start tutorial how to start the transcription server in a docker container. However, for running it locally, you should:
- Clone the repo
git clone https://github.com/Roboy/sonosco.git
2. Install Sonosco and all its dependencies by running
pip install -e .
3. Build the frontend
# Change working directory
cd server/frontend# Install node-modules
npm install# Build frontend
npm run build
Inside of the server directory, you should now see a dist folder which should contain the bundled Vue.js code. After this, you should be able to run the Flask backend by simply executing
inside of the server directory. It will automatically serve the newly built frontend on localhost:5000.
Of course, our frontend is just one of many applications that can be built on top of this REST and WebSocket interfaces. Beyond this, we also offer ROS integration (in case you’re working on a robotics project such as Roboy).
A few words about our transcription server...
You remember, we said something about model comparison and collecting your own data? For this, Sonosco comes with a super intuitive transcription server, that lets you compare your freshly trained models among each other.
You can transcribe your own voice, compare the transcriptions and when you correct it and click on “improve”, the audio with its transcription is saved to your computer, so you can easily collect your own test data.
Model, models, models, models…
Most of the Github repositories that we’ve encountered had either no trained models which could be readily downloaded and used or they were only trained one of the publicly available datasets.
We were willing to change this situation and make ASR more easily available to the open-source community. With this in mind, we took many different publicly available datasets and combined them into ~3000 hours of speech and text. This dataset was then used to train DeepSpeech2 and LAS, which we then released here! Yes, you heard me right, these models are available for your own use for free. So don’t wait and grab them as fast as you can while they are still there. We are most proud about the LAS model which achieved a CER (Character Error Rate) of 5.3 on our validation dataset.
These models are automatically downloaded and made available on our Transcription server when you follow these steps.
Check out our documentation to find out more about this project. And if you’re really interested to check out the code itself. If you have any questions, you ask questions in the comments or open issues in the Sonosco repository.
For the future, we envision a general framework for speech, that not only includes speech recognition, but also speech synthesis.
Dealing with the speech in machine learning is currently one of the most exciting topics and we believe that Roboy will be a big part of it!
Thanks to the whole Roboy Team, especially Vagram Aiirian and Rafael Hostettler, for enabling us to work on this topic.