What is Speech to Text Software?

Published in

takenote

6 min readMar 27, 2019

Speech to text software bills itself as the catch-all solution to transcription services — delivering the cheap, easy and fast transcript of your dreams. But, is this true? What is ‘speech to text’ software anyway?

In a nutshell, speech to text software, or automatic speech recognition (ASR) software, is a computer program that uses linguistic algorithms to sort auditory signals and transform that information into words using Unicode characters. Put in normal language, speech to text software ‘listens’ to audio and delivers an editable, verbatim transcript.

There are a large number of automatic transcription service providers online. Most provide price points that make them look very attractive to anyone familiar with human transcription service — averaging around £0.10 per minute of recorded audio, and some are even free. Most promise accuracy rates around 90%-95%. This, however, is only for ‘clean’ recordings, something that is absolutely critical to understand when looking at ASR software to meet your transcription needs.

Before you get overly excited and ditch your allocated transcription budget in favour of speech to text software, it is worth getting acquainted with this new technology. Here is a quick rundown regarding the truth about speech to text software, and how it stacks up against traditional human transcription services.

How Does Speech to Text Software Work?

There are multiple steps involved in the process of converting speech into text. When you’re talking, you create a series of vibration. These are translated into digital language by the analogue-to-digital converter, or the ADC. The ADC is able to complete this conversion by sampling sounds and taking frequent, very detailed measurements of the waves. The system has a filter to distinguish the sounds that are relevant and differentiate frequencies. The speed of the speech is also modified and the volume set at a control level.

The next stage involves segmenting the signal into hundredths or thousandths of seconds and matching these parts to phonemes. There are around 40 phonemes within the English language. Each phoneme is then examined and evaluated in relation to other phonemes around them, and the system then runs the network of phonemes through a complicated mathematical model to compare them to well-known sentences, individual words and phrases. The system then creates text based on what it ‘believes’ the person said. This is either presented as a chunk of text or as a final computer-based command.

Want to find out more about picking a transcription service for your business? Read our free guide, made just for you.

ASR/Speech to Text Software: the Good, the Bad and the Ugly

ASR may seem like a brilliant option on the surface. But, if you delve deeper, there are issues. When comparing ASR and human-based transcription services, it’s wise to explore the good, the bad, and the downright ugly.

Speech to Text Software: The Good

The most significant advantages offered by ASR are speed and cost. Automatic speech recognition (ASR) produces rapid results, and can even offer a real-time service in some cases. The associated price tag is also considerably lower than human services.

Some charge per minute. Others have a set subscription fee. Fee-based services generally cap the total amount of uploads you are allowed to make per month. No matter how you are charged, you can expect to pay around £0.07-£0.10 per minute of audio for an automatic transcription service.

A few services, however, are free. By paying for access to transcription software, you are likely to get slightly better results. But, now we get into some of the problems with speech to text software.

Speech to Text Software: The Bad

One major limitation of ASR is its ability to produce verbatim text only. In the absence of a human, the system is only capable of transcribing what is there. This means that you could end up with a transcript that doesn’t read brilliantly. When you speak, it’s very common to pause, to make noises such as ‘erm’, and to stumble on certain words. A verbatim text will include everything on the recording. Human services often offer to clean this up and deliver a much more readable transcript that still retains all of the detail and accuracy of the original recording. In fact, this last point is something that your speech to text transcript might be missing anyway.

Speech to Text Software: The Ugly

The most concerning aspect of ASR is its accuracy. Even the best speech to text software rarely achieves accuracy rates over 80%, which often means that you have to spend time and effort making corrections and improvements. If there are ‘complicating’ factors, ASR can produce unintelligible results. To get a usable transcript from a speech to text service, you need ‘clean’ audio. That means a high-quality recording of people speaking slowly, one at a time, without accents and with little to no background noise.

How ASR Stacks up Against Human-Based Transcription Services

There are several key differences between speech to text software and human-based transcription services.

Cost

Cost is an important factor for many people, and human transcription services are significantly more expensive than ASR. Some ASR services are free, but most charge around £0.10 per minute. In contrast, human services usually have a fee of approximately £2 per minute. Lower rates may be available for long turnaround times. But, even if you can wait a week for your transcript, you will never be able to get a human-based service as cheaply as is standard with speech to text software.

Time

The timeframe in which human services operate is much longer than ASR. In most cases, human services offer a turnaround of 12–24 hours. But, some businesses may take days, or even a week to return the finished product. ASR is much faster, it produces transcripts within seconds. If you need a transcription urgently, you’ll probably have to a pay a premium for human-based transcription services.

Click here for your free no obligation quote!

Options and versatility

With ASR, you can only get a verbatim transcript — if the software is even up to the task from an accuracy perspective. Human-based services offer a much broader spectrum of options, including verbatim, intelligent verbatim, and summary. Verbatim is a word for word transcription, while intelligent verbatim will correct errors, eliminate pauses and ‘ums’ and ‘errs’ to deliver an edited version that will read much better. Summaries provide a brief overview, which can be incredibly useful when there is a lot of information to digest and process.

Confidence and quality

When you invest in human-based transcription services, you enjoy greater confidence in the quality of the product. Human services have quality control guarantees and generally deliver 99%+ accuracy rates, only failing to do so if the audio is completely indecipherable. Transcripts will be proofread, so you don’t need to devote your own time to checking the text or making changes. If you use ASR, you may find that you have to spend valuable time combing through the text looking for mistakes, fixing garbled text and removing words and unwanted sounds.

Summary: Speech to Text Delivers a Budget Solution, But it is Not Yet on The Same Level as Human Services

Speech to text software offers an attractive budget solution for those looking for transcription services in a hurry. But, it is not yet capable of producing the quality and accuracy of human-based, quality transcription services. Because ASR is so cheap, and sometimes even free, it is worth experimenting with to see what kinds of results you can achieve. By trying different options, you can determine what kind of sound quality is required to acquire intelligible results.

The speed and price of ASR are undoubtedly appealing, but there are flaws. Ultimately, these serve to highlight the benefits of investing in human-based services. ASR produced texts are sometimes unintelligible, speech to text software only produces verbatim transcripts, and accuracy rates are always significantly lower than human-based services. To achieve a good-quality transcription with ASR, you need to invest in making a high-quality recording. But, if you want a range of options, an accurate transcription, and unrivalled attention to detail, you will need to invest in a human-based service.

You have been reading a guide to speech to text software and how it compares to human-based transcription services. If you want more information on how to make the right choice when it comes to your transcripts, we have written an Ultimate Guide to Transcription Services just for you!

Originally published at info.takenotetyping.com.