**PLEASE NOTE: This article contains deprecated benchmarks due to the release of newer versions of these boards. Your results may differ from results below.**
Interacting with our everyday devices using spoken language is something we have been fantasising about for decades. Indeed, in many regards, voice is the epitome of human-machine symbiosis. No instruction manuals, no learning curve, no accidental tap. Just express your will, and the machine shall serve you. In our quest to make technology disappear, voice has a key role to play. But before projecting ourselves too quickly into Tony Stark-Jarvis relationships, there are a few technical hurdles that need to get out of the way.
At Snips, we are building a voice platform allowing anyone to add privacy-preserving, AI-powered voice assistants to their connected devices. But in order to perform to its full potential, such device need appropriate hardware, and in particular, a microphone providing consistently high quality audio signal as input to our system.
Many of the devices that we target (e.g. home assistants, speakers, entertainment hubs, coffee machines, room control) typically sit somewhere in a room, at a certain distance away from us. And just like the remote control allows us to switch TV channel while staying in the couch, we want our voice controlled devices to understand what we are saying without having to walk up close to them and start shouting. Furthermore, we don’t want our devices to start triggering commands when we are not explicitly asking for them to do so, so we expect some tolerance to noise, music or conversations others might be having in the room. Summing up, a good microphone for our purposes would:
- allow the user to speak anywhere in the room, that is, from long distances, and from any angle
- be resistant to various kinds of noise (music, conversations, random sounds)
Finding a good microphone matching these criteria was a bit of a challenge. We started off using cheap, generic USB microphones, but they quickly turned out to be unsuited for the task. They only capture sound coming from up close, and from a specific direction. In this article, we share with you our experience using various advanced audio capture devices in the form of microphone arrays. The microphone arrays themselves do not solve the problem of understanding what we are saying to our devices, but they certainly are an essential component!
How we tested the microphones
Disclaimer: this article has been written primarily for makers. We followed the guides offered by each microphone array manufacturer. For some equipment we could not adjust the input gain so this might affect the performance of these microphones.
What to test for
Many of today’s voice-enabled devices work in the following way: they remain passive until the user pronounces a special wake word, or hotword, such as “Alexa”, “OK Google”, “Hey Snips”, which tells the device to start listening carefully for what the user is saying. Once in active listening mode, it attempts to transcribe the audio signal into text — performing so called Automatic Speech Recognition, or ASR for short — , the goal being to subsequently understand what the user is asking for, and act accordingly. For instance:
What is the weather like in Copenhagen?
Hotword detection and ASR are two distinct problems, and they are usually treated independently. Good hotword detection software must have high recall and high precision: it should always detect a hotword when it is spoken, and it should absolutely not detect a hotword when it has not been spoken. Indeed, we don’t want to have to repeat ourselves when triggering an interaction, and we don’t want our device to start listening when we didn’t ask it to.
Similarly, the ASR processes the audio signal and translate it into the corresponding words. A spoken sentence is nothing more than a sequence of phonemes. Therefore a fair bit of the complexity lies in capturing a reasonably good audio signal so that the ASR can differentiate between phonemes and construct the most accurate text sentence. If noise levels are too high and the audio saturates, the ASR could misunderstand a word.
There are various factors which affect the quality of hotword detection and ASR. Some words are simply not suitable as hotwords, for instance if they are difficult to pronounce (“rural”) and hence difficult to detect, or if they are too close to words you would commonly use (“hey”), as they would constantly trigger the device when you are having a conversation nearby. Good acoustic models are trained so as to be robust to ambient noise, sound levels, variation in pronunciations and more. However, if the microphone can consistently provide a clean audio signal regardless of the situation, it will drastically improve the performance of hotword detection and ASR, and hence of the end user experience. That is why microphone arrays usually feature a dedicated chip, a Digital Signal Processor (DSP), for performing things like noise reduction, echo cancellation and beamforming.
Objectively benchmarking microphone arrays is not straightforward, as results will of course depend on the underlying acoustic models. For instance, we started our experiments using the default, thoroughly trained, “Snowboy” hotword from Kitt.ai. Results varied a lot across the microphones. Some performed well, others very poorly, even in a silent setting. In the meantime, we had built our own hotword detection engine, and trained it with a custom hotword, “Hey Snips”. We performed the experiments again, and to our surprise, all microphone arrays were now performing exceptionally well. So in fact, the first experiments were not representative of the quality of the microphones, but rather of some deficiencies inherent to Snowboy.
The following microphone arrays were tested:
- Seeed ReSpeaker Mic Array
- Conexant 4-Mic Development Kit
- Microsemi AcuEdge
- MATRIX Creator
- MiniDSP UMA-8
- PlayStation Eye
And for comparison, we also added a single USB microphone
We created a test bench with the seven microphone arrays, aligned in a row. Each microphone array was connected to a Raspberry Pi 3. All the Raspberries have a freshly installed Raspbian Jessie Light image. We used the Snips Voice Platform, which includes hotword detection and ASR, and proceeded with the documented configuration for each mic (see below for references).
We dedicated a quiet room in the office to this. The room is about 20 square meters (210 sq ft) in size, and features some dampening materials such as a lounge chair, two bean bags, as well as a floor carpet. To make sure that every microphone array was recording at the same level, we manually fixed the gain of each array at around 80% using
The real recording dB levels vary depending on the mic settings (the PS3 Eye can’t be changed for instance). To ensure identical conditions, voice queries were prerecorded, and subsequently played from a speaker, in the middle of the room, at a fixed volume corresponding to that of a human speaking normally.
Experiment 1: We measured the rate at which a hotword was successfully detected as distance increased between the microphone and the speaker. Distance ranged from 0.5 meters(1.6 ft) to 5 meters (16 ft), and for each distance, the hotword was repeated 25 times at 3 second intervals.
Experiment 2: We measured the rate at which a hotword was successfully detected as the incidence angle (tilt) at which the sound reached the microphones was varied. The speaker was fixed, and the bench was gradually elevated in a radial movement, keeping the distance to the speaker fixed at 1.5 meters (5 ft).
Results of the experiments
Globally, all the microphone arrays we benchmarked performed well, both for hotword detection and for ASR, at varying distances and tilts. The DSPs all feature excellent noise cancellation, and as a result, adding white noise did not have a significant effect on success rate. Performance took a small hit with background music, but with some common optimisations, as discussed below, we can still obtain excellent results.
Hotword detection rate: effect of distance
As distance is increased, unsurprisingly the success rate drops slightly for all microphones (with exception of the generic USB mic), but performance remains very high even from five meters away. With background music, this pattern is blurred.
Hotword detection rate: effect of tilt
As we see, tilt does not have a significant effect on performance. Microphone arrays deliver on their promise: good audio capture from all spacial directions.
$79 on Seeed Studio
The ReSpeaker features an XMOS XVSM-2000 DSP chip, providing excellent performance for wake word detection: 98% success rate in both a silent room and a room with white noise, indicating excellent noise reduction capabilities.
It offers better performance than the MicroDSP despite the fact that they use the same chip. This is due to the latest firmware update. ReSpeaker support is better, they have an active Github repo providing regular upgrades and fixes. Their Getting Started Guide is particularly helpful.
The small form factor is a bonus, and so is the led ring, which is an important component in an home assistant, allowing to provide essential visual cues helping the interaction. Furthermore, unlike the Matrix Creator or the Conexant, both microphone and leds can be accessed with a single USB cable, rather than the more cumbersome GPIO ports. This allows for more versatile configurations, and reduces the amount of space required.
The ReSpeaker is surprisingly easy to setup. It is immediately detected by the Raspberry Pi, and the
.asoundrc configuration is straightforward:
It also works out-of-the-box on a Windows, macOS and Linux machine, which is neat!
Unfortunately we discovered that the firmware needs to be updated, some of the ReSpeaker we received had mics adding noise to the audio, resulting in poor performance.
$99 on the MATRIX Store
The MATRIX Creator is much more than a microphone array. In fact, with its temperature, ultraviolet, humidity and pressure sensors, its gyroscope, accelerometer, magnetometer and NFC chip, it is more of a general-purpose IoT prototyping tool.
The MATRIX Creator features an 8-microphone MEMS array and an ARM Cortex M3 microcontroller. This is a very powerful setup. Not only does the board manage to capture audio in very high quality, it also allows you to control a bunch of microcontrollers, courtesy of the ARM Cortex chip. It is basically a dream come true for any maker wanting to build a robot.
It attaches to the Raspberry Pi via the GPIO port, and is quite big compared to the ReSpeaker.
The main issue we encountered is with the MATRIX driver installed on the Raspberry Pi. It requires a specific and complex
.asoundrc configuration. It merges the audio stream outputs of each mic into a single logical ALSA output. Alas it doesn’t work with our platform. It also lacks support for PortAudio since it is not viewed as a single hardware audio device. We’ve been in contact with the MATRIX support team, they are super helpful and are working hard to provide a solution to this specific issue.
The main strength of the Creator is its configurability. For instance, you can specify each microphone’s polar angle, or you can implement your own noise cancellation algorithm and beamforming.
The MATRIX team has been working on a new microphone array, the MATRIX Voice, which is a cheaper version of the Creator dedicated to audio capture. We are looking forward to receiving it late Summer, and add it to this benchmark.
Conexant 4-mic Development Kit
$349 from Arrow
The Conexant features some impressive specs, it has the same chip as in the Amazon Echo devices, preloaded with Echo hotword.
For setting it up, we followed the Amazon AVS tutorial, and it worked well with Alexa skills. The “Hey Alexa” hotword works superbly, but this is because it is coded directly in the chip.
And this is where the issues start. Conexant have an exclusive deal with Amazon, and the device is commercialised entirely with the aim of getting developers to ship Amazon Alexa on their device.
If you want to use the Conexant for what it is, namely a microphone array, without the whole Amazon Alexa package, it gets tricky. First of all, it only works on a Raspberry Pi 2. Second, you need to patch the Raspbian OS, and rebuild the kernel on this Pi, which takes a long time (several hours).
It won’t easily let you work with the onboard LEDs. We ordered several models, and some of them had to be returned due to faulty leds and mics. The documentation is sparse, the two boards are quite bulky. Connections between the two boards are already made when it arrived, and there is no explanation of what it does.
Note: the chip used by the Conexant, the Conexant CX20924, is available standalone, which is an option if you want to create your own microphone array.
$95 from MiniDSP
The MiniDSP features the same XMOS XVSM-2000 chip as the ReSpeaker. However, it did not reach the level of performance in hotword detection as ReSpeaker, which is probably due to the firmware.
Performance is still very good, with 94% of hotwords being detected in a silent room. In a room with background music, performance significantly decreased with just about 68% of hotwords detected.
The MiniDSP also features a GUI for controlling various advanced parameters in real-time. It is only for Windows, but we wish other manufacturers went down the same road as it is a powerful way to enhance the performances of the MiniDSP.
If you want to integrate this board inside an inclosure, having the ability to fine tune these parameters is a big win!
Microsemi AcuEdge ZLK38AVS
$95 from Arrow
The Microsemi is a linear array with 3 mics designed to be easily integrated with Amazon Alexa.
The mics are located on the lower side of the board. It can be surprising, but it doesn’t affect performance, but you’ll have to keep this in mind if you want to put your Pi in an enclosure.
No specific setup required, plug it through the GPIO and it will be recognized as a recording device by the Pi.
Performance wise, it’s quite similar to the MiniDSP. Though it hasn’t as much settings that you can play with. It has been engineered specifically for Alexa and features the Sensory chip for the wake-word engine.
PlayStation 3 Eye
$7 on Amazon
The PlayStation 3 Eye was a pleasant surprise in terms of performance. It sports four microphones, and works out of the box, even on a Raspberry Pi, via a simple USB connection. It is ridiculously cheap compared to the alternatives, which makes it ideal for rough prototypes.
Performance is excellent, with hotword success rates similar to the rest of the lineup, staying solidly above 90% at all distances and tilt angles in a silent and white-noise environment, and dropping only a few percent when music is playing.
Unfortunately, this device is discontinued by Sony, and no information is available regarding internal specs. Documentation and code samples are non-existant, which limits the potential for it to become a serious candidate for a microphone array. It is nevertheless an excellent choice for hacky setups and prototypes where the final product does not need to be sleek, compact and polished.
Tonor Stereo Condenser Microphone
$14 on Amazon
For the sake of comparison with single microphones, we included the Tonor in the benchmark.
As a single microphone, it works well. It is plug-and-play, and audio capture is fine, but only when the user is speaking directly into it. The whole point of microphone arrays is to be able to reliably capture audio from a distance, and with the Tonor, we can observe this difference sharply. As soon as we are more than 0.5 meter away, performance deteriorates rapidly, becoming pretty much useless at 3 meters away and more.
We do have a few Tonors in the office as they are cheap, robust, and work perfectly while working on a desk with a Raspberry Pi.
Benchmarking microphone arrays has been an extremely useful experiment for us to perform, and our initial goal — to find an excellent far-field audio capture solution — has been achieved. Globally, all the microphone arrays we tested perform well, both in a silent setting, and with white noise. Performance is also acceptable when background music is playing, although there is room for improvements (this can be achieved in various ways: for instance, by cleverly placing the microphone array so that there is minimal signal captured from the music source — that’s why you see the speaker in the Amazon Echo or Google Home placed in a plane perpendicular to the microphone array; or, if the music source can be monitored, we can subtract the music signal from the recorded signal).
Despite similar performance, our choice of microphone array is the ReSpeaker. It is relatively cheap, has a great form factor and is easy to set up, with the caveat that the firmware needs to be updated on arrival.
The runner up of the ReSpeaker is the MiniDSP. Having enough parameter fine-tuning is a big plus, it also features the same XMOS chip as the ReSpeaker. Even if their performances are not similar, we do think that we can achieve similar results with the MiniDSP after having fine tuned its parameters. For this benchmark we wanted to have the easiest plug & play experience possible, and in that respect the MiniDSP falls short.
Unfortunately, we were not able to make the MATRIX Creator work with our setup, as it outputs audio as a file rather than an output device, and our hotword detector and ASR, which run in a Docker, cannot access the file. This is not an issue with the MATRIX per se, and we are figuring out ways to circumvent this. We’ve been able to test the MATRIX solely for audio capture, and it looks very promising. The MATRIX team is also preparing a new, dedicated microphone array, the MATRIX Voice. We are eagerly waiting to test it and include it in this benchmark.
We hope this guide is helpful to the community of makers trying to build voice-powered assistants. We’d love to have your feedback, and hear about your experiments using microphone arrays. If you have another microphone that you would like us to include in our benchmark, let us know! We also like to feature cool hacks, so don’t hesitate to reach out on Discord if you have something worth sharing.
If you are interested in building your own, privacy-enabled voice assistants, we have some tutorials that you might find interesting, for instance for building a voice-enabled speaker, or an assistant to control your home IoT.
If you enjoyed this article, it would really help if you hit recommend below :)
If you have comments or questions, ping us on Discord.
If you want to work on AI + Privacy, check our jobs page!