Learn the Process to Build an Enduring AI Foundation in 15 Minutes

Learn how to build a robust foundation for speech AI in your company, and gain a current snapshot of today’s AI tools and performance.

Nathaniel McCarthy
ECFMG Engineering
15 min readAug 22, 2018

--

Comparing Speech AI Services

The initial step in building a robust foundation for artificial intelligence is to evaluate and compare what is currently offered on the market. As of writing this article I have investigated four of the large scale AI providers, Amazon, Google, IBM and Microsoft and conducted comparisons between their speech-to-text services. It should be noted that these are not the only tools out there, some other notable names include Nuance, Baidu, AT&T, Apple, Cisco, and I will continue to update this article as I explore these other providers.

Specifically, I investigated Amazon Web Services’ (AWS) Transcribe, Google’s Speech API, IBM Watson’s Speech-to-Text API, Microsoft’s Speech API, and I also dove into Microsoft’s Video Indexer. While technically Microsoft’s Video Indexer is a video analysis tool, I noticed its accuracy is quite different from their speech API and felt the need to report it after my initial investigation. All five of these services have free trial periods or can be used for free indefinitely, and are a great starting point for learning about the different AI offerings on the market. For the purposes of testing only the base functionality of each service was reviewed, yet more expensive advanced offerings are available.

Scoring

After researching which AI services are openly offered, a standard for scoring each service that could potentially be incorporated must be created. This standard has to evaluate these AI services not only empirically but also analytically so we can justify our final design decisions to ourselves and others. In trying to generate criteria for comparing the services to each other four defining factors were reviewed…

  • Quality
  • Time Investment
  • Pricing
  • Flexibility and Future State

…this being said, it is quite difficult to judge the quality or future state of speech-to-text services. Beyond just developing a standard for scoring speech AI empirically, a standard for scoring each service’s accuracy must be used to judge current quality and then be re-evaluated over time to analyze future state. The National Institute of Standards and Technology (NIST) created the score lite (sclite) tool for just this purpose.

The sclite tool is a part of the NIST scoring toolkit (SCTK), which is a collection of software tools designed to score benchmark test evaluations of Automatic Speech Recognition (ASR) Systems. The sclite tool is a flexible Dynamic Programming alignment engine used to score output from an ASR system by comparing the output to a correct reference text. After alignment, sclite generates a variety of summaries as well as detailed scoring reports. These scoring reports give a purely analytic view of each speech-to-text service’s performance. Evaluating these four defining factors alongside generating reports with the NIST sclite tool provide an excellent in-depth comparison. Below are my findings with output from the sclite tool when scored against the same video across all services, along with the transcript from each service. This is the correct original transcript to help you make your own comparisons visually.

Correct, Original Video Transcript

A Few Notes on the Scoring…

  • These are not the EXACT transcriptions received from the services, as the format had to be modified to receive accurate sentence error percentages when using the sclite tool. Note the tags at the end of each line and the how they are split based off of the original transcripts context.
  • I have changed certain words so that if the transcription service had a different spelling for a colloquial word, such as “OK” versus “Okay” it is still reflected as a correct word by the sclite tool.
  • This is only A SINGLE video example that I ran out of many, so if I describe a different picture than what is shown in the screenshots it is out of experience with each service not just pure conjecture.

The Future State Development Mindset

Armed with this information and the groundwork out of the way, the next step is to look at your company’s future state and decide what tools fit into it. If you are looking for a command interpreter for a company application short-term maybe Microsoft’s Speech API is the perfect solution for you. Or, maybe you have a completely different need such as an extremely stable application, where the cost doesn’t matter as much and decide upon Amazon Transcribe. If you want the most bang for your buck, then Microsoft’s Video Indexer or IBM’s API might be ideal. No matter the use case desired by your company there is a tool out there for you, the real question is what tools will continue to perform reliably in the future?

Well, what if you didn’t have to choose? Due to the fact that all these products have APIs working behind the scenes, it is possible to build an application that incorporates all of them. This is the true way to build a foundation for not just speech-to-text but any AI service you want to take advantage of. This way you can decide which service you want to use at anytime by scoring and bench-marking the services at given intervals throughout the year. Amazon might reign supreme in speech-to-text accuracy this year, but in four years maybe it has the worst performance. The key to building a foundation for AI is building a platform that delivers the functionality of each service, yet can be adjusted based off the fluctuation in AI performance rates.

Quality Comparison

Amazon Transcribe

Transcription from Service
Output From Sclite tool
  • Performance — Best scoring percentage on average out of all the sample tests run. As you can see, even its worst performance is quite accurate.
  • Feature(s) — Easy to add word-priority lists
  • Technical durability — Because it is already in the lead as far as accuracy, and Amazon has such a large focus on future AI development this service will exist and adapt well in the future. However, it will probably only integrate well with other Amazon products.
  • Usability— Ranked 2nd in ease of use, and 4th in documentation usability
  • Aesthetics — Bland Graphical User Interface (GUI), but only two of the five services had GUI interfaces at all.
  • Perceived quality — Amazon Alexa has existed since 2014 and they have been using data it collects to train their services.
  • Value for money — Average, as the transcription is very good but the price is very high in comparison to other services

Google Speech

Transcription from Service
Output from Sclite Tool
  • Performance — Fourth overall in accuracy, generally it has very readable transcripts but it will miss words easily, so clear speakers and loud volumes help in transcription accuracy.
  • Feature(s) — automatic language detection, word-priority lists, Speaker Identification
  • Technical durability — Extremely Durable, Google doesn’t generally release a product until it has some inherent value. For example, Google Home was released approximately two years after Alexa in 2016 but still can compete with it even though they are working with much less field data. Because of this rapid quality acceleration its future state will most likely be very stable as they continuously are gathering data.
  • Usability— 3rd in ease of use, 3rd in documentation
  • Aesthetics — Easy to navigate cloud dashboard that includes a Command Line Interface (CLI) that you can access the API from if you are not a programmer, no GUI.
  • Perceived quality — Google Home was released in 2016 and is considered similar to Amazon’s Alexa, Google’s speech products seem to be much better for deciphering colloquial communication, where Amazon’s strengths lie in commands and breadth of language interpretation.
  • Value for money — Probably not worth it currently, but it is definitely the service to watch for in terms of growth as things move fast when driven by Google.

IBM Watson

Transcription from Service
Output from Sclite Tool
  • Performance — Third Best Average of all services in terms of accuracy. Has the most efficient transcription time to accuracy ratio.
  • Feature(s) — Speaker Identification, word-priority lists, custom speech model support
  • Technical durability — Watson was one of the first in the AI game and they currently are the only ones to offer their free tier plan indefinitely, so they are certainly future state oriented. If you are an individual developer I highly recommend this API for speech-to-text investigation.
  • Usability— 4th in ease of use, 2nd best documentation, can test API calls straight from the documentation
  • Aesthetics — Easy to navigate cloud dashboard, no GUI.
  • Perceived quality — Lesser Known for their Speech-to-Text, however they have always been a major playing in the machine learning and AI fields
  • Value for money — 2nd best value of the five, and definitely the best API.

Microsoft Speech

Transcription from Service
Output from Sclite Tool
  • Performance — Worst Transcription Accuracy
  • Feature(s) —Intent Analysis
  • Technical durability — Probably going to be deprecated or updated rapidly with Microsoft’s new AI initiatives, as their video indexer and other services currently have better quality speech-to-text. I believe the reasoning behind this is to keep their quality speech-to-text models out of the reach of the public, while offering this to developers for short command interpretation use in apps.
  • Usability— 5th in ease of use, 5th in documentation quality (this is mostly do to the lack of support for their Software Developer Kit (SDK) documentation. Their normal documentation for the service is just as updated as their other AI products)
  • Aesthetics — No GUI.
  • Perceived Quality — Microsoft is one of the fastest growing in the AI field and is continually making huge efforts to push their AI products through the Azure cloud platform.
  • Value for money — This is actually quite a valuable service, however its value lies in short command interpretation rather than full blown transcription, something the other services trump it in.

Microsoft Video Indexer

Transcription from Service
Output from Sclite Tool
  • Performance — Second best transcriptions on Average
  • Feature(s) — Not just a speech-to-text service, includes facial recognition, intent analysis, sentiment analysis, and much more.
  • Technical Durability — Microsoft has one of the largest supporting AI teams and continues to be a leader in the AI field. This service will be around for a long, long time.
  • Usability— Extremely easy to use, very nice graphical interface, 1st in documentation quality.
  • Aesthetics — Easy to use attractive GUI interface, and extremely easy to use API.
  • Perceived quality — Microsoft is a leader in AI efforts, and constantly looks to make them more accessible to the public and businesses.
  • Value for money — Best value for your money if you are looking for a full blown transcription service.

Time Investment

*These approximations depend on multiple different factors, such as the skill and motivation of the developers implementing the technology and should be taken with a grain of salt. That being said, they provide a very good indicator of the time you should allot to learning each product.

Amazon Transcribe

  • Research and Training Time — This will vary depending on how you want to use the service. If you are just looking to get transcripts once in a while then the research and training time will be around 5 minutes as the GUI interface is very self-explanatory. However, if you are looking to automate the process it will take some time, as learning how to use the API in unison with Amazon storage can be a hurdle. I recommend starting with learning how to create and destroy a S3-bucket, upload and delete files in it, and then start to learn how to use the transcribe functions on a bucket. Overall this process could take 1–5 days.
  • Implementation Lead Time — Similar to the research and training time, it will be extremely quick to implement acquiring transcriptions every now and then as you can simply use the GUI. The GUI does not take more than 5 minutes to upload the files and then you simply wait for the final transcription. Using the API to fully automate it will take even more time, and will require that you customize the functionality used of both the storage and transcribe APIs. A good estimate for understanding and automating its base functionality would probably be 2–7 days, but all of this depends on what you are trying to achieve by using this service.

Google Speech

  • Research and Training Time — There are also two methods to get started with the Google speech service. Option one is to read their documentation and learn how to transcribe files from their cloud command line tool for the web. This can be pretty easy to pick up as the documentation gives a very easy tutorial and can probably be learned in 15–30 minutes. Option two is the fully automatic option, where you learn their API and how much time will depend on the level of customization you are looking for. Much like Amazon you will have to create a Google storage bucket, upload the file, and then transcribe it by running speech recognition on it. To fully understand all the quirks of the service I would allocate 1–4 days of research.
  • Implementation Lead Time — The cloud command line interface allows you to start transcribing files in under 5–10 minutes. The API will take you a bit longer to fully implement and allocating 1–7 days of coding and testing will be necessary to achieve the minimum results.

IBM Watson

  • Research and Training Time — The research and training time for this API should be shortened greatly when compared to the other APIs as the documentation is extremely good. Also unlike the other services you do not need to create any cloud storage to utilize its functionality, all you have to do is make a POST call including the file you want to transcribe to the correct endpoint. It should take approximately 1–3 days to understand this APIs nuances.
  • Implementation Lead Time — Implementing this and getting it fully automated will take the least time of all the APIs due to the stellar documentation. It should only take 1–5 days to implement the minimal functionality of this service.

Microsoft Speech

  • Research and Training Time — The way you decide to learn this API, through HTTP and REST calls or through their software development kits (SDK), will greatly influence your research time. Their SDK documentation is not nearly as developed or updated as their base HTTP and REST tutorials. Thus if you take the SDK route it will probably take 2–5 days to learn, and the HTTP and REST route will take 1–4 days.
  • Implementation Lead Time — If you choose to use their SDKs implementing this API will take around 2–8 days to implement. Otherwise the HTTP and REST route can be implemented in a 1–5 day time-frame.

Microsoft Video Indexer

  • Research and Training Time — Much like Amazon the GUI interface and API will have two very different learning time-frames. The GUI interface could probably be picked up in less than 5 minutes. The API on the other hand will take more time, as even with its extremely good documentation it will probably take 1–2 days to learn in full.
  • Implementation Lead Time — Implementing the GUI version takes the same amount of time as the Amazon GUI so in less than five minutes you can be done with your first transcription. The API will take some time, however due to the documentation it is extremely easy to do. It will only take around 1–3 days to fully implement the functionality of this API.

Pricing (Billed On a Monthly Basis)

Amazon Transcribe

  • Free Trial Period — Year of free tier access that gives you 60 minutes of transcription free each month
  • Running Cost — $1.45 per hour ($0.0004/sec)

Google Speech

  • Free Trial Period — Year of free cloud access with $300 in credits, and the first 60 minutes of uploaded transcriptions each month is free.
  • Speech Model Running Cost — $1.45 per hour ($0.0004/sec)
  • Video Model Running Cost (Advanced) — $2.88 per hour ($0.0008/sec)

IBM Watson

  • Free Trial Period — Only truly infinite free trial, and you get 100 minutes of transcription free per month.
  • Initial Cost — $1.12 per hour ($0.00033/sec)
  • Over 250,000 Minutes — $0.9 per hour ($0.00025/sec)
  • Over 500,000 Minutes — $0.76 per hour ($0.00021/sec)
  • Over 1,000,000 Minutes — $0.61 per hour ($0.00017/sec)

Microsoft Speech

  • Free Trial Period — Year of free service and $200 credit, and you get up to 5 hours of transcription free each month
  • Running Cost — $0.50 per hour ($0.00014/sec)

Microsoft Video Indexer

  • Free Trial Period — Infinite, capped by limiting your video uploads to 100 a day (as of right now, changes in this policy have been mentioned for the future)
  • Running Cost — $0.96 per hour ($0.00026/sec)

Flexibility and Future State

Amazon Transcribe

  • Supported SDKs — Java, .NET, Node.js, PHP, Python, Ruby, Javascript/Browser, Go, C++, AWS IOT SDK.
  • Supported Mobile SDKs — Android, iOS, React-Native, Web/Javascript/AWS amplify library, AWS mobile SDK.
  • Flexibility and Modification — Highly Integrated with any other AWS service. Word-lists and Custom Speech Models allow for easy customization yet a larger time investment initially.
  • Volume — Can only accept audio under two hours in length. There is no max size in terms of audio content that can be processed in the monthly billing period.
  • Future State — As this is currently the leader in best average accuracy I expect to see this trend hold as Amazon’s reputation for growth and support have not wavered.

Google Speech

  • Supported SDKs (Client Libraries)—.NET, Go, Java, Node.js, PHP, Python, Ruby.
  • Flexibility and Modification — Has largest feature list among the APIs. Including automatic language detection and translation, speaker identification, keyword spotting and much more.
  • Volume — Each request is rounded up to the nearest multiple of 15 seconds, (example: 2 seven second requests become 30 seconds of transcribed audio) and you are capped each month at 1,000,000 minutes.
  • Future State — Google is currently not up to par with the other service’s accuracy, however, it does seem to have some very semantically accurate transcriptions. Due to Google’s reputation for releasing products slower yet more robustly built, I believe great AI improvement will be made in the coming years. As AI’s accuracy and functionality is all influenced off the base algorithms, yet the sample size of data you have it just as important. Google in its current state has extremely good base algorithms, but they have not yet collected enough data to excel in the speech-to-text field.

IBM Watson

  • Supported SDKs — Node.js, Python, Swift, Java, Unity, .NET, Ruby, Apache Openwhisk, Web/Javascript, Salesforce, React, much more (Open Source Github Repo)
  • Flexibility and Modification — Has much less additional features than the other services, yet it can be integrated into other applications easily.
  • Volume —No cap and the price gets cheaper the more you send to it a month. Thus it the best option for large volume transcription needs.
  • Future State — As IBM was one of the very first to really showcase the power of AI to the public through Watson, I do not see this service lagging behind any of the others in the future in terms of cost. However, as it does not have many additional features and the IBM cloud does not offer the same variety of integrated services. So if new AI power services are created you might not be able to access them through the IBM cloud, unless you build an application that manages your calls to it.

Microsoft Speech

  • Supported SDKs — C++, .NET, Web/Javascript
  • Supported Mobile SDKs — Java
  • Flexibility and Modification — Very little flexibility, the only real stand alone feature it has is the intent analysis. This allows it to pull semantic meaning from transcribed text, in the same way you tell your digital assistants what to do and they understand what you mean. (Google home, Cortana, Alexa, etc.)
  • Volume — No currently listed cap
  • Future State — This API is probably going to be deprecated or upgraded to match the newer services Microsoft offers such as their video indexer.

Microsoft Video Indexer

  • Supported SDKs —No true SDK at this point but as it is a microsoft product C# examples are given in the documentation
  • Flexibility and Modification — This service provides a crazy amount of other smaller AI services meshed together. However, it does not have a large amount of individual customization for each service under the hood.
  • Volume — No currently listed cap
  • Future State — This is currently an extremely powerful tool, as it is easy for anyone to use and also has the second best transcription accuracy among the five services. It does not seem useful compared to the other services when viewed from the perspective of additional features and ability to modify it. Yet, the very fact that it already is most of Microsoft’s AI offerings integrated together is quite amazing As this indicates that the more AI services they have the more they will work to provide products that weave them together, allowing for a one time purchase of a wide array of AI offerings.

--

--