State of Voice Coding — 2017
About a year and a half ago, I began to develop debilitating Repetitive Strain Injury(RSI) while working at bambuu. I’ve chronicled what’s helped for me and what hasn’t, but there’s one thing in the original blog post I only briefly mentioned that I feel deserves further exploration. It’s voice coding, by which I mean using your voice as an input device to write code. I feel voice coding has huge but unrealized potential, both for allowing people that are injured to code again, but also to help people that are at-risk.
RSI happens when you do the same thing over and over again — hammering on the keyboard in this case. It’s obvious that people who are now unable to type could benefit from coding hands-free, but everyone might benefit from a day or two a week going hands-less. Mixing it up keeps us healthy.
With that out of the way, what I want to do here is to paint with broad strokes, how the landscape for voice coding looks.
For most people that know of voice coding, I assume they found out from Tavis Rudd’s excellent talk at PyCon where he explains his own setup using Dragon NaturallySpeaking, along with a language he’s created himself. The talk is just below and the demo starts at 8:50 — I’d suggest you go watch at least a minute of the demo to see how it looks and sounds.
For those of you who prefer reading, Tavis maps one-syllable words to special characters and actions, so “slap” means space, “dash some random thing” outputs “some-random-thing” in the editor. Navigation is performed through e.g. “up 10” to jump ten lines up.
Mapping one-syllable words to actions seems unnecessarily limiting to me though. We’re missing out on all the rich meaning a sentence can carry, when instead of creating sentences like “Create a new dictionary called my dictionary” we say things like “pam snake my dictionary ak par” — it’s not even much shorter, and it‘s much harder to understand. More on that later!
The way Tavis accomplishes his impressive setup is by using Dragon NaturallySpeaking which is a speech recognition program, some hefty Emacs magic and a self-created Dragonfly grammar.
Dragonfly is an open source framework for associating actions with voice commands. A Dragonfly grammar is a way to hook into Dragon NaturallySpeakings speech recognition, and execute code for specific commands, usually by emulating keyboard actions.
I think for a great deal of people, this is how they’ve attempted to voice-code. At least there’s a wide variety of personal projects taking this approach on GitHub, such as code-by-voice and dragonfly-modules. However using someone else’s grammar is usually difficult, as they’re rarely very well documented, and as Tavis’ talk shows, usually rely on arbitrary words that are hard to memorize. You could of course write your own, but if you’re starting off with voice-coding to prevent RSI, it seems cruel that you’d have to program the solution yourself.
So what ways are there to get started using voice coding? There’s a few, and we’ve already skimmed the surface on the first, so let’s dive in.
The Do-It-Yourself approach
The DIY approach consists of writing your own Dragonfly grammars, and modding your editor with appropriate shortcuts and extensions to make your life easier. Generally DIY projects seem to use Dragon NaturallySpeaking to capture the speech, but Dragonfly does support Windows Speech Recognition as well.
Using Dragon has a wide variety of advantages though, Dragon comes with built in support for dictating documents, writing mails and navigating around in Windows — it even has support for web browsing (though the experience is sometimes so, so).
There’s a disadvantage here however, Dragon NaturallySpeaking only runs on Windows (though projects like Aenea exist that will let you run it on Linux.)
For macOS the scenery is a little different. There exists a version of Dragon, Dragon Professional, for macOS but it’s supposedly less capable of things like navigating around. So while it’ll work for voice coding, you’re still worse off than when using Dragon NaturallySpeaking on Windows.
With the DIY approach you get a lot of freedom, but you also get very little out of the box unless you’re on Windows. If you’re interested in pursuing this angle, some good kickoff points are the dragonfly-modules git repository and these blog posts.
Out of the box Solutions
Maybe you’re not willing to do it yourself. The good news is that out of the box solutions exist. The bad news is that there’s not very many and they’re quite similar. Let me try to give you an overview.
Drop-in Dragonfly macros:
Some people have done the DIY approach and have been nice enough to put their macros on github. Here are a few, however they’re usually reasonably tightly coupled to the persons workflows and sparsely documented.
Caster in particular stands out from the crowd as it seems to be a feature-rich voice coding toolkit, and perhaps more importantly — it has actual documentation. Unfortunately it looks like it hasn’t been updated for a few years.
There’s an old project called VoiceCode by the National Research Council of Canada.
It seems to have been a full solution for coding by voice, that anyone can pick up and use. Looking at the source code, it’s close to Tavis’ setup, basing itself on Dragon and a heavily customized Emacs. Unfortunately, it seems to have been abandoned over five years ago, and the documentation pages aren’t hosted anymore.
Then there’s voicecode at voicecode.io, which confusingly enough has the exact same name. This is a Mac-only system, that promises, not only to let you code by voice but also to increase your productivity. It uses SmartNav for replacing your mouse, and under the hood it runs Dragon for converting the speech to code. You can see a demonstration here.
If you view the demonstration, you’ll notice that this is similar to the way Tavis’ does it, with lots of strange, arbitrary one-syllable words that maps to commands. Even with these shortcomings, — I think voicecode.io is currently the most feature-rich out-of-the-box voice coding experience. It’s only for Mac so far though, and it does come with a reasonably hefty 300$ price tag. And that’s without Dragon or SmartNav. A full setup here will probably cost you around a thousand dollars.
Silvius is the offspring of Dragon NaturallySpeaking and Aenea. It uses a custom speech recognition framework called Kaldi and works both online and on small embedded devices. Silvius works by piping the microphone output to a server, and the server responds with the sentences it recognize. The parsed speech is then run through a grammar that produces virtual keyboard strokes. Looking at Silvius you might think it’d be slow as the audio has to take a roundtrip to the server — but it actually seems surprisingly snappy.
I think the strong innovation in Silvius is the fact that it relies on a platform-agnostic speech recognition algorithm— in the end that might allow for something that will work across all platforms.
Vocola is a Voice Command Language — that allows you to map voice commands to keyboard commands and other functions, in a way that’s very reminiscent of AutoHotkey. I don’t think this is particularly well suited for code, but I think it might be very good for surrounding tasks, e.g. opening up applications.
Speech Recognition Engines
I like to refer to the thing that powers the actual speech recognition as the speech recognition engine. E.g. Tavis uses Dragonfly to execute the keybindings that result in his actual code, but the engine translating speech to text is Dragon NaturallySpeaking.
The pattern here is generally, that most commercial software uses Dragon NaturallySpeaking for speech recognition, as it appears to have the best accuracy out of the available options, but also comes with a hefty price tag of up to $300.
Some notable exceptions are Dragonfly which supports Windows Speech Recognition and Silvius which uses Kaldi, which seems to be the only offline platform-agnostic framework currently. A few days ago on November 29th, Mozilla also launched the first release of Deep Speech and Common Voice, which I hope will become a viable alternative as well.
Complimentary modes of interaction
I think there’s a lot of potential in voice as an input, but I’m not sure it’ll get us all the way here. If desktops were designed for voice, I think we could probably get all the way, but as of right now, we’ll still need to interact with regular desktop programs. Most of them are GUI-based, which means we’ll need to emulate a mouse.
There’s a few possible ways to do this.
- Descriptive voice commands
aka “Click the red button called send” — There’s a possibility to use visual recognition and try to visually parse what the user means. Currently I don’t think there’s anyone taking this approach -the closest I could find was the SpeechStart+ addon to dragon that lets the user enumerate clickable elements in most programs, and then click on them via voice.
- Non-hand-controlled mouse
We’ve already been introduced to SmartNav, but a mouse that’s controlled by muscles that aren’t the hands is an appealing way. Current alternatives are: Headmouse Nano, Camera Mouse and SmartNav.
- Eye tracking
You’d think the most natural replacment of the mouse would be eye-tracking, we usually look at what we click at. Tobii’s been making strides in this area with first the Tobii EyeX and now the Eye Tracker 4C. However our eyes naturally drift around, and so eye tracking will never be able to achieve the pixel-precision that a mouse can get. However for most tasks, I think soon eye tracking will be good enough for a lot of tasks.
(Previously I’ve co-authored an academic paper about combining speech recognition and eye tracking for coding. Contact me if you want a copy, there’s also a quick demo here)
As we’ve seen demonstrated a few times during this blog post, it’s definitely possible to navigate inside a file and output code with voice. As I’ve written about before— for me it’s usually navigation between symbols and files that are lacking — but I think this is only because the work hasn’t been done yet. I’m hoping that we’ll get there soon.
There’s also the problems of noise. A lot of people fear that their microphones will pick up random chatter and output gobbledygook — with a good microphone this isn’t an issue. As you can see Tavis easily talks with a room full of people, and I’ve personally demonstrated voice coding in a demo situation with a large room of more than 30 people having multiple conversations around me.
So while outside noise disturbing you isn’t a problem, the other way around could be. Talking to your computer might become a problem depending on your co-workers, office situation etc.
Anecdotally I’ve worked in an open office plan using voice coding for a few days and when I asked my co-workers, nobody said that it bothered them.
Even though we’ve made some strides in editing and navigating code — being a developer is so much more than our editor (unless you use Emacs, then it’s basically the same thing).
We use web browsers, terminals, file systems etc. If you’re on Windows you get support for a lot of the normal customer facing applications from Dragon, but if you’re not on Windows you’re out of luck. You can build support for web browsing through browser extensions but for most applications you’re stuck emulating keyboard input. For some applications that have good shortcut support, this is tolerable. For others, not having a mouse is very difficult.
Coding with voice isn’t real-time. When using a keyboard you get feedback after every keystroke —but looking at voice coding, it usually consists of a sequence of words followed by a pause, after which the commands are executed.
The feedback only comes at the very end of the command sequence. Now this has some disadvantages —you’re not able to see if you’ve done the right thing before the very end of the command. Unfortunately, this feedback delay is going to be hard to eliminate entirely.
Keystrokes are stand-alone in a way that voice commands are not. Imagine the voice command “up ten” — meant to move the caret ten lines up. Giving feedback before the end of this command is impossible, as the word said after “up” has an effect on what “up” means.
So while continuous real-time feedback is probably not possible, perhaps in the future we’ll see a hybrid approach of giving more continuous feedback during commands, where it is possible to detect that the next commands have no influence on the previous, but it’ll never be as real-time as it will with a keyboard.
It strikes me though, that no matter what approach you pick from this list, we’re still just emulating keyboards with our voices. Saying things like “ak bar pam slap” is nonsensical, and using the voice in a monotone way like that can even cause voice strain. We already have perfectly good spoken languages, and it seems insane to me that we’ll have to invent another one, just so we can use our voices as a keyboard.
I think voice coding is going to have it’s breakthrough soon, but I think it’ll be parsing natural language. It’ll be me saying “Rename that dictionary above to my new dictionary” — rather than me uttering what primarily sounds like an incantation for black magic, just to output some brackets. A little work has been done on this, but so far it has primarily been academic, with little real world usage.
Hi. I’m Gustav. I blog about RSI and Voice Coding. If you see anything I’ve missed, feel free to write me at firstname.lastname@example.org or tweet at me.If you’re interested in getting the occasional news about RSI prevention and coding via voice, sign up for my newsletter.
This article was written with feedback from Tavis Rudd, Christian Brevik, Desi Mazdur and David King. I couldn’t have done it without them.