How to optimise automatic speech recognition performance using MKL

Published in

Speechmatics

8 min readAug 12, 2020

Note: This blog entry contains a number of acronyms. If you start to feel lost, there is a glossary at the bottom of the page.

At Speechmatics, we’re constantly looking for ways to improve the accuracy and efficiency of our language products. We do this in multiple ways:

Gathering more useful data. Language packs are trained on data from a variety of sources. It’s important that this data reflects the use cases of our customers. For example, it’s important that we can recognise speech within noisy or poor-quality environments. Sometimes, we’ll also be more specific, and fix known bugs or issues for specific languages or use cases
Improving the tools, algorithms, and processes by which we create a language pack. This is carried out on an ongoing basis
Optimising the efficiency of our products to the limit on the recommended hardware

This article focuses on point number 3 and discusses the improvements Speechmatics has carried out this quarter. We continue to ensure our technology takes advantage of improvements in computing performance. One way Speechmatics measures performance is via a metric called real-time factor or RTF, which is the time taken to transcribe the audio divided by the duration of the audio.

A value of 1 would mean the file takes as long to transcribe as its actual duration (e.g. a file that was an hour takes an hour to transcribe). Our service-level agreement for batch transcription must be — and is always — at most 0.5 or under (that hour-long file then only takes half an hour or less).

RTF and speech accuracy are not separate functions; rather they entwine. We can use improvements in RTF to offset the deployment of larger and more accurate language packs and continue to improve ASR accuracy over a wider range of use cases, commercial sectors, and sound quality.

Changing the internal build for our ASR deployments

As part of our latest release, we’ve changed the internal build of our methods of deployment — containers, appliances and cloud offering — to lay the foundations for greater accuracy improvements in the future.

Our changes concern how our ASR software interacts with the most fundamental part of computing — the core processor chipset that converts every action we do in information technology into billions of mathematical operations per second.

This number-crunching involves huge amounts of mathematical operations, implemented by a software library called BLAS (basic linear algebra subprograms if you want the full mouthful!) and the better these perform, the more efficient our software becomes. We’ve carried out performance testing with our batch and real-time products on various types of commonly-available CPUs to understand the optimal settings on Intel related hardware.

At first, this doesn’t sound terribly exciting, but out of small acorns, mighty oaks grow.

Continuing to optimise our performance with the MKL software library

As part of performance testing, we’re using a software library within our products known as MKL (Intel Math Kernel Library). MKL, provided by Intel, is a collection of commonly used mathematical libraries that have been specifically optimised for performance when running on Intel processors. In particular, this is the case where Intel chipsets support a set of instructions called Advanced Vector Extensions (known as AVX). AVX enables these mathematical calculations to be done in parallel rather than in series and thus at a higher rate of calculation than anything that came before.

Why is this relevant?

To run Speechmatics’ technology you must be using hardware that supports AVX, but these have been present in Intel chipsets since 2011, and Intel now supports the subsequent releases, AVX2, and AVX-512, which promise further improvements of this compute capability. Part of our work was to understand how much better AVX2 and AVX-512 actually were, and, if so, which implementation of AVX would provide the greatest improvements.

To summarise:

The process of automatic speech recognition (ASR) makes use of a mathematical software library, BLAS. Intel has included a version of BLAS optimised for Intel processors in their MKL library when running Speechmatics’ software on Intel hardware. Calculations necessary for ASR run better on Intel processors that support AVX, therefore, giving Speechmatics’ technology more headroom to enhance language accuracy. We can implement it, and go home for tea and medals, but it’s always worth being more rigorous than this. We need to test out these assumptions, not just internally, but on the latest most commonly-deployed hardware.

Many customers use the two largest cloud providers, Amazon Web Services and Microsoft Azure, to deploy our ASR technology on virtual machines. These machines can be highly divergent in size and resource. We have to ensure that we don’t see any poor performances on a particular machine as well as seeing which we perform best upon. We had to ensure that each environment and the most common machines did not generate any unwanted resource spikes. As we provide a cloud offering ourselves, we had to ensure any updated technology would not impact the quality of transcription expected by customers, or cause any unwanted downtime.

Implementation

During implementation, it was actually found that AVX-512 was not as effective as AVX2, due to the resource requirements of AVX-512 preventing clock boosting, something which has been noted about under certain circumstances. This causes performance increases with AVX-512 to not be as noticeable as when using AVX2. So the team configured our products to always use AVX2, as this is now a baseline on most machines. If a machine doesn’t have AVX2 — a possibility for customers on older hardware — we will fallback to AVX, to ensure backwards compatibility. So, based upon this logic, we tested the latest build of our containers and appliances with an extensive performance plan.

We explicitly tested a wide variety of Intel-based machines, and a cursory check on AMD, as we generally recommend customers run our software on Intel. We also focused on testing in AWS and Azure, with a sanity check using VMWare in our own development environment. Most of our customers will deploy our products using one of these major hosting platforms.

We experimented with different AWS EC2 and Azure DS virtual machine sizes and measured average memory usage, CPU consumption, and transcription performance on each one. We processed files that were under five minutes, ten minutes, and one hour long on each virtual machine multiple times using our latest Global English language pack compiled with MKL and an older model compiled with ATLAS for comparative analysis.

What we found

Our findings were as follows:

We see overall performance improvements when using MKL combined with AVX2 across files of all lengths, and also when using our Custom Dictionary feature
We see, on average, across AWS and Azure slightly lower memory usage on Speechmatics’ technology when using MKL, but in extreme circumstances, the 99th percentile can be slightly higher
Memory and CPU usage are slightly higher if you are using our Custom Dictionary feature to capture additional vocabulary items not recognised by our standard language models. However, if you are following our existing resource requirements when provisioning our on-premises software, transcription should work as normal, and you do not need to take additional action

For comparison between Azure and AWS we noted the following for batch processing:

On Azure, the best average performance on batch processing for short, medium and long files was on an Azure Fs_v2. The difference between this and other Azure virtual machines is relatively small
On AWS, the best average performance on batch processing for short, medium, and long files were achieved on an AWS M5.large EC2 instance. The difference between this and other AWS instances can be very large. The best performances are from virtual machine types with Intel chipsets when choosing your VM. Some AWS instances offer AMD instead, for which Speechmatics should still work, but is not optimised for. If you are required to use AMD for other reasons, we still see an improvement in performance, of about half the improvement documented below

Below is some of the data for the best performing machine types for batch, which show up to just over 25% improvement on RTF in the longer files. Improvement is rounded to the nearest whole number:

AWS M5.large — batch processing performance

Azure Fsv2 — batch processing performance

Real-time processing performance

Measuring real-time performance improvements is slightly trickier. There are generally two uses for real-time: either using it as streaming to transcribe as it is spoken — where RTF cannot be under 1 — and transcribing existing files in real-time, for which the definition of RTF above still applies.

In the latter cases, overall we see the following improvements averaged across machine types. Please note that RTF in real-time depends on multiple factors, and so the choice of the machine here is less important than your resource set-up. If you use hyper-threading — a process that improves resources but has a slight financial cost — you could potentially achieve even faster RTF times than we display here.

What the future holds

Looking forward, we will continue to experiment and integrate new ways of improving our technology regularly, as well as talking about our latest improvements. We’re also experimenting with ways of stress-testing our technology to ensure it can scale rapidly and sustainably while ensuring the quality of service. All of this contributes to making Speechmatics’ any-context speech recognition engine a constant leader in its field.

Glossary

AMD: Alternate chip to Intel

ASR: Automatic speech recognition. What Speechmatics does best!

ATLAS: Open-source mathematical calculation software library, used by Speechmatics up until its latest release responsible for BLAS

AVX: Advanced Vector Extensions, an extension of Intel x86 architecture since 2011, and also adopted by AMD at a later date. These allow larger more complex calculations in parallel and are required for any hardware running Speechmatics software. AVX2 and AVX512 are now used as well

AWS: Amazon Web Service, one of the largest cloud hosting vendors on the planet

Azure: Microsoft’s hosting platform, also one of the largest platforms on the planet

BLAS: Stands for Basic Linear Algebra Subprograms, a specification that prescribes a set of low-level routines for performing common linear algebra operations originally defined as a FORTRAN library in 1979

Intel: A technology company who produce a wide range of hardware and software products integral to computing, some of which form the focus of this blog

Intel MKL: Intel Math Kernel Library. A software library responsible for BLAS that replaces ATLAS in all Speechmatics software from the July 2020 release onwards

RTF: Real-time factor. An internal metric used by Speechmatics to judge transcription speed when processing audio files