5 Incredible AI Speech-to-Text Services You Can Benefit From Today

Artificial Intelligence is more accessible than ever, find out how you can harness its power in your personal and professional life

Nathaniel McCarthy
ECFMG Engineering
8 min readAug 7, 2018

--

Today it is common knowledge that artificial intelligence has the potential to be a driving force in simplifying our lives, yet many individuals, companies, and even developers are reluctant to adopt this technology. Most of the time this hesitation is not due to the fear of the robot apocalypse or sentient computers but the fear of complex implementation. How would you even go about creating a useful AI tool? Would you create a custom model, and train it using thousands of data samples? Would you be able to recognize whether or not your algorithm is just making blind predictions? How would you measure if your AI tool is improving and at what point its performance plateaus? Hundreds of these questions and ideas serve as a barrier and impart the feeling of an extreme knowledge deficit for many people who could benefit from AI. To stop dawdling and start reaping these benefits we need to start asking a different question: Do we have to create our own AI tools, and are there none out there already?

I am currently a sophomore at Drexel University studying computer science and employed in a six-month software architecture co-op at the Educational Commission for Foreign Medical Graduates (ECFMG). Starting with zero AI experience, I have spent the last three months attempting to tackle this very question. Just like many, I had uncertainties, fears, and endless questions about understanding what AI is and how I can utilize it effectively in an enterprise environment. However, as I progressed in my AI research surrounding what larger corporations are currently doing with their AI tools and studied the logistics of using these tools as an individual, I found things weren’t so convoluted after all.

My research initially consisted of exploring what tech companies were using AI, yet I pivoted quickly upon learning that almost everyone has their hand in the game. Leveraging a different angle, I continued researching only tech companies that were profiting by offering their AI tools to other businesses, as they must have a leg up on the competition to be able to provide such a service. After a week of intense research, I realized that my findings were surprisingly conventional. Amazon, Microsoft, Google, and IBM all use AI in their daily operations and also extend their enterprise offerings to individuals to test for free. These offerings range from simple things like speech-to-text and better search functionality to extremely complex video analysis involving facial recognition and speaker identification. Due to my inexperience with AI, I decided to begin by testing the capabilities of their speech-to-text offerings across the board. The five tools I continued to investigate during my co-op were as follows…

1) Microsoft’s Cognitive Services Video Indexer

2) Amazon Web Services’ Amazon Transcribe

3) IBM’s Watson Speech-to-Text API

4) Google’s Cloud Speech API

5) Microsoft’s Cognitive Services Speech-to-Text API

1) Microsoft’s Video Indexer and the Cognitive Services Suite

The Video Indexer is one of the most powerful and straightforward tools offered under Microsoft’s Cognitive Services Suite. It offers an easy to navigate web portal anyone can use along with extremely excellent API documentation for programmers. Simply navigate to the portal page and click sign in at the top right to create or use a previously existing account. From here it is pretty self-explanatory, but if you would like additional help the getting started section can provide explicit guidance. The most amazing aspect of the Video Indexer is that you can test a wide range of AI tools in one product, and it has additional developer features such as the ability to train custom models. It also has a very generous free trial offering compared to the other services and the largest pool of supported file types. All these features make it one of the easiest and cheapest to use services without sacrificing any accuracy or flexibility.

Benefits:

  • 2nd most accurate transcriptions generated (Indistinguishably close to Amazon Transcribe)
  • Easiest service to learn and use (Someone with no technical background can figure it out)
  • Free Trial allows for 100 videos uploaded daily
  • Remarkable price for speech-to-text: $0.96 per hour ($0.00026/sec)
  • Largest pool of supported file types

Pitfalls:

  • 2nd longest processing time for files
  • Can become pricey if you want to use its full range of capabilities: $4.00 per hour ($0.0011/sec)

2) Amazon Transcribe and the AWS Machine Learning Suite

Amazon Transcribe is one of the many tools Amazon offers under its AWS Machine Learning suite. Amazon Transcribe gives an easy to use web application interface for anyone to use on the AWS console, or for code-savvy developers an application programming interface (API). Using the service is as easy as executing steps one and three under “Getting Started” in the developer documentation provided. Overall, this is another friendly option that almost anyone can use with the most accurate transcriptions, but it also has the highest cost of all the services emphasizing a get-what-you-pay-for mentality.

Benefits:

  • Most accurate transcriptions of all the services
  • Second easiest service to use (Anyone can use but might take a few minutes to learn how to navigate)
  • Second largest pool of supported audio file types
  • In first year of usage, (AWS free-tier) the first 60 minutes of audio transcription are free each month
  • Extremely easy to add in a custom word-list

Pitfalls:

  • Tied for the most expensive speech-to-text service: $1.45 per hour ($0.0004/sec)
  • Longest file processing time

3) IBM’s Watson Speech-to-Text API

Under IBM’s Watson suite resides their Speech-to-Text Application Programming Interface (API) which is extremely friendly for developers, yet will be quite hard to navigate for anyone with no coding experience. If you do have some experience under your belt then the documentation and API reference provide quite a bit of guidance and allow you to test API calls from the site. The only thing you have to do is sign up for a “lite” (or free) IBM cloud account and then start reading. The service’s learning curve and the time you must spend before actually getting it working are justified by the efficiency of the service. For instance, if you want to recognize speech all you have to do is POST (HTTP command) a supported file type to the exact endpoint for speech recognition. While this takes some time to learn, it is an explicit statement of exactly what you want from the service and due to this nature the file is processed in about a tenth of the time it would take the Video Indexer or Transcribe services. In Short, this API offers the best balance of speed and accuracy but is time-consuming to implement and lacks an intuitive way to use it.

Benefits:

  • Very close third to Indexer in transcription accuracy
  • Has best transcription accuracy/speed ratio (Best transcriptions in shortest amount of time)
  • Unlimited lite-plan with 100 minutes of transcriptions free every month
  • The cost of the service goes down if you plan on transcribing over 250, 000 minutes of audio each month. It can get as low as $0.60 per hour ($0.00016/sec) if you transcribe over 1,000,000 minutes of audio, or approximately 2 years worth of audio/video.

Pitfalls:

  • Requires programming knowledge and skills to use effectively, and takes time to learn and implement
  • The traditional price for the first 250,000 minutes is $0.02 per hour ($0.00033/sec) marking it at third most expensive

4) Google’s Cloud Speech API

Google’s Speech Application Programming Interface (API) offered through their cloud service platform bridges the gap between the functionality and usability. Even though it has an initial learning curve, it offers excellent documentation with easy to follow examples. The documentation is so good that it’s possible for someone with no coding experience to use the service through the built-in cloud command line interface (CLI). This CLI is the catalyst for bridging the gap between functionality and usability. It allows for less flexibility but similar functionality to an API, and the documentation makes it easy to use.

Benefits:

  • Once you have signed up, you receive a year of free-tier usage and $300 to use in google cloud credit.
  • Only Microsoft’s Speech API can match its file processing efficiency.
  • Even if you are not a programmer this still offers flexibility, efficiency, and malleability of an API through the google cloud command line interface (CLI)
  • Lot of features and extensions included allowing for extreme flexibility

Pitfalls:

  • Has the second worst transcription accuracy of the five services, and the accuracy gap between IBM’s Speech API and Google’s is quite large.
  • Regrettably, this API costs the same price as Amazon Transcribe at a $1.45 per hour, ($0.0004/sec) and is more laborious to use.
  • Upgrading to use the enhanced audio model doubles the cost ($0.0008/sec)

5) Microsoft’s Cognitive Services Speech API

Finally Microsoft’s Cognitive Services Speech Application Programming Interface (API), and my initial venture into AI. This API’s functionality fell short of all the other services, even while holding the award for the fastest processing speed. As its transcription accuracy was poor, and not all their documentation is updated. Specifically, most of their software developer kit (SDK) documentation is quite old compared to the new updates they are rolling out for the web documentation. Due to these documentation issues, I would not recommend this service to anyone other than a capable programmer. While long transcriptions are inaccurate, it provides a quick and dirty solution for voice command recognition. Thus, this service is less of a speech service and more of a voice command service, as that’s where it excels. Overall, it has useful short transcriptions and the cheapest pricing for extended use, but it suffers due to its large initial learning curve and adoption time.

Benefits:

  • Fastest file processing speed
  • It is the cheapest service at $0.50 per hour ($0.00014/sec)
  • Translation and custom word-lists offered which can improve speech command accuracy greatly

Pitfalls:

  • Very short 30 day free trial for testing
  • Poor transcription accuracy
  • SDK documentation outdated
  • Must have programming knowledge and skills

Summary

Overall Comparison
Relevant Feature List

With all that said, for my purposes, I only looked at speech-to-text, yet there are many more offerings such as text-to-speech, speech or text translation between languages, text recognition from an image, personality insights, and even emotion recognition from photos of facial features. All these tools have extreme value found not only in their current state but also in how they change over time. As through our constant use of these AI tools, we introduce a crowd-sourced stream of data which is used to improve the already existing infrastructure. Today the video analysis tools could miss seeing your face in a video, and by tomorrow could pick it out of a crowd. It is this notion of consistent development that explains why we should all strive to incorporate AI tools into our lives, right now and not down the road. Being able to see the current state of AI will bring solace to those dreading the robot take-over, as it looks a lot further down the road than we are currently estimating. Along with this peace of mind, you will also discover feelings of insight and wonder, as seeing the tools rapid improvement and ease of use is something of a modern-day marvel.

--

--