The state of voice: a how-to guide to Speech-to-Text software

8 min readNov 5, 2019

In the last few years I’ve seen more and more apps and services experimenting with using voice interaction, spearheaded by the likes of Amazon’s Alexa and Google’s Assistant. Soon, your voice could be as powerful as the buttons you press or the words you type.

“Speech-to-Text” software covers anything where speech goes in, is analysed, and converted into text. In the Lab at Citizens Advice we want to investigate how Speech-to-Text software could help the work that we do. Can we take advantage of its potential benefits?

Can we save time for our advisors in writing up notes, if a machine can take notes during an advice session?
Does this kind of software allow us to support a more diverse set of clients, by offering more ways to use their voice to get help from our service?
Can this software make it easier to input information into, and get answers from, our systems, allowing a more diverse group of volunteers and staff to join us?
Does this allow us to use more and smarter tools to analyse the things people ask us about so we can continue improving the support we offer?

Before diving into the project, like any researcher, I went out in search of people that had done this before to see what had been done before. In this blog post I want to share some of the things I learned along the way as we’ve begun clearing the weeds at the edge of the Speech-to-Text forest.

Start Lo-fi to see it it’s valuable
Train the software
Train yourselves
Start small
Work out what success looks like, and measure it
Get your setup right
Consider security
Integrate the software into your team

You’re now entering the Speech-to-Text forest. Gif of someone moving through a forest in a computer simulation. Gif originally by Quasar Wei

Start lo-fi to see if it’s valuable

One of the first things I learned was that you don’t have to use the software in your first attempt to work out if there’s any value in using it. It’s called a “Wizard of Oz” Experiment. If you’ve watched the film or musical, you’ll know that the apparently magical Oz is really all just an illusion (if you haven’t, sorry to spoil it).

Researchers use this same idea to test new technology, by ‘faking’ the software bit with humans. With Speech-to-Text software, this basically means finding some people who can emulate what the system can do: to listen, understand what’s being said and type it out as a transcript. The user gets the same experience (they say things, they get written down and acted on), so they might not even know the difference!

Considering some of the hoops you might have to jump through to find, procure, pay for, and set up a new piece of software, this can be a very useful technique to test the idea quickly and cheaply.

If you want to learn more about that, this wikipedia page could be a good start.

Think of the simplest way to get started! Picture of someone talking into a tin can and string phone. Image from Pixabay.

Train it

The accuracy of the transcripts made by these software is hugely variable, but can always be improved by training. If, like us, you’re in a field that uses technical terms (like advice on debt or immigration in our case), the software might not pick up every word on the first try. Terms like “joint and several liability”, for example, might not get picked up immediately.

If you, or the people you work with, have accents and dialects that are different from those who made the software, you might also need to get the software to learn to pick them up. The BBC is already working on an ‘Alexa for British accents’, for example. Software will work out-of-the-box in some cases — but to be accurate enough when you need it, it takes time and resource to train it in the areas you need.

Train yourselves

As well as training the system, you’ll need to get your users familiar with it too so they can get the most benefit out of using it. There are a few different elements to think about if you’re setting aside time for people to try out the software:

Try out different ways of talking (louder or quieter, faster or slower).
Work out how the user is going to interact with it. Will they be watching the transcription as it’s being made (like subtitles), and if so how will that affect the way they talk to people?
Getting familiar with talking for a microphone to pick up, and working out what things might stop the system from hearing you properly (drinking water, covering their mouth, not looking the right way).
If using it for phone calls, explore how call quality might affect how well the software works.
Confidence comes with time, and practice makes perfect.
Setup routines for checking and editing the transcript afterwards.
Learn, or add in, the keyboard shortcuts for punctuation, especially where you’re talking to a machine to make a short summary for others to read.

If Sesame Street taught us anything, it’s that you have to keep trying and working as a team to get the best result. Gif of Bert and Ernie, where Bert is struggling to carry some weights. Gif from Giphy.

Start small

It’s better to start with a small group of people. There could be a lot of frustration at the beginning, because of low accuracy rates, getting used to new systems and trying (and sometimes failing) with different ways of doing things. Before inviting your first group of users to try it out, you want to make sure they’ve got the patience and support from you to set the foundation for everything else to come.

Work out what success looks like, and measure it.

Finally, you should think about what you want to achieve and how you’ll measure if you achieved it. You’ve made this change for a reason, whether it’s to make things faster or to make your staff happier. To give you the best chance of reaching that goal, it’s useful to collect some simple statistics that can tell you how close you are to achieving your goals. If things are going well, then that’s great: keep going, and maybe start inviting more people to use the software. If things aren’t going so well, you can change course to see if a different approach can get you the results you’re looking for. If you don’t have this evidence available, it’s much harder to know if the technology is working for you. decide whether to keep going or stop it in its tracks.

From the various case studies I read, some of the metrics that people use include:

The time it takes to write a casenote, from when you start writing to when you finish writing.
The time it takes from the beginning of the conversation or appointment start to the note being finalised.
The accuracy of the note (are all the words transcribed, are they transcribed correctly).
The quality of actions afterwards.
The level of comfort and confidence users have with the system, and how much they want to continue using it.

The key is to catch things early. As well as collecting some data, it’s worth scheduling some regular reviews to look at how the whole system is performing. Taking just thirty minutes every week to understand the change you’re making, why it’s working well (or not so well) and think of ways to improve the way it works could help you make the most of your investment.

Get the setup right

So you’ve set up the software, the people are trained: now to make sure the rooms you’re speaking in are ready for this. You might want to think about:

Wifi — many of these systems work over the internet, so having a strong, dependable Wifi connection is often important.
Hardware #1: Sound Cards — the laptop or desktop you’re using will have a sound card inside it, which turns the audio it hears from you into code that it can send to the software. Some older machines, with a lower quality sound card, might not be up to scratch for this.
Hardware #2: Processing Power — some older and less powerful machines might not have the capacity you need to do this well.
Dedicated microphone — a dedicated microphone can increase the quality of the audio being captured (especially if it’s a conversation between a few people).
Silence, please — make sure you’re in a relatively quiet space. If you can hear other people, the software probably can too.

Make sure your office setup is suited to Speech-to-Text! Gif of someone in an office doing a forward roll onto a sofa and kicking a lamp over. Gif from Giphy.

Consider security

Some of the software we looked at relied on a web connection, while some of it could be installed onto your computer and work offline, without a connection. Whatever you choose, there’s a payoff between accuracy and privacy. Connected software can make use of a network to bring better accuracy to how it transcribes what people say — but sharing information beyond your local network introduces risks that might not be acceptable. It’s important to choose the trade-off that works best for the content that will be processed by the software.

Integrate into your team

Think of all the things that make human teams successful (in particular, strong relationships) and use that to make sure you’ve set everything up so the software can work to its best. That includes things like

Setting clear roles and responsibilities.
Making that public, so everyone in the team knows what to expect from them and what they need to give them in order to be successful.
Being able to talk to or receive communication from other team mates.
Knowing its strengths and weaknesses, and building a team culture that reduces the risk of things going wrong and increases the chances of things going well.

Final thoughts

So that’s it! Your beginner’s guide to Speech-to-Text software. That was a lot to take in right. Take a deep breath. Here’s a cat gif to relax you before the conclusion…

You made it! Gif of a cat flying gracefully through the air with a blue sky behind it. Gif from Giphy

It’s an exciting prospect. Imagine using the specialist skills of computers (listening to audio and very quickly turning it into text, which can send requests to other software or be read by humans) and the specialist skills of humans (empathy, solving complex problems) to do things faster and better than we ever have before!

It will take time, and patience, and tinkering, and creativity to get it right for your situation. It might not always work. In some case studies I read (like this one), the humans aided by machines performed worse than the humans that weren’t. But that’s fine: it’s not a silver bullet that solves every problem magically.

The key is finding the thing that will give you the most value, and then constantly improving how it works so you get the most out of it. If you start small and nimble, and it doesn’t work, you can just throw it away without much investment lost.

The team and I will be back soon with some of the results of our adventure into Speech-to-Text Software. Hopefully I can cut back a bit more of the forest for you so you can join us on the exploration!

If you want to read more about how different organisations have used Speech-to-Text software, you might want to look here:

Some NHS Trusts are using it to speed up communication.
East Renfrewshire Council have used it to speed up communication.
The Australian Institute of Health Innovation used it for writing Electronic Health Records.
The BBC are using it to make it easier to search their archives.
And The RNIB has a really good guide on how this software can help people with sight loss.