Enhancing Voice Recognition with a Matrix Microphone Array

Kalonji Bankole
Kalonji Bankole
Published in
6 min readAug 17, 2017

One of my previous posts guides readers through configuring a Raspberry Pi to serve as a voice controlled home automation hub. The implementation uses a simple USB microphone, which works decently enough, but is noticeably less accurate when called from a distance or in noisy environments.

To address this problem, I investigated the potential benefits of adding a microphone array to the system, which consists of multiple microphones that can simultaneously capture audio. As the audio is captured, digital signal processing (DSP) algorithms can be applied to cancel out echoes, determine direction of individual sound sources, reduce background noise, etc. These are providing a much needed boost to the performance of voice enabled devices, as these processing methods can improve the accuracy of speech-to-text services.

There are a few microphone arrays on the market that are directly compatible with a Raspberry Pi, such as the UMA-8, Respeaker, and Matrix Creator. I ultimately chose the Matrix Creator because it offered quite a few additional sensors types such as temperature, pressure, UV, IR, Zigbee, accelerometer, gyroscope, FPGA…

To get started, I mounted the Matrix board to the top of the Raspberry Pi’s GPIO pins as seen in the picture below.

The next steps were to clone the hardware abstraction layer repository, and follow the README steps to install dependencies and build the demos. This repository includes implementations for beamforming, computing direction of arrival, collecting samples from the microphone array, and querying the board’s sensors.

After the dependencies were installed and code samples were built, I was able to record from the microphone array with ./matrix-creator-hal/build/demos/micarray_recorder . This command records audio for 10 seconds at each channel and stores the results in 9 separate raw audio files named like so mic_16000_s16le_channel_${channel_number}.raw . 8 of the generated raw files correspond to the separate mic array channels, and the 9th contains the beamforming result. The raw files can be converted to a more widely supported format like wav or flac with the following command.

sox -r 16000 -c 1 -e signed -c 1 -e signed -b 16 mic_16000_s16le_channel_8.raw beamforming_result.wav

After confirming that the mic array was working, the next steps were to configure a wakeword, and then see how the audio quality compared to a generic USB microphone. A wakeword is essentially a service that listens for a specific phrase like “Alexa” for the Amazon Echo, “Ok Google” for Google Home, etc. Once the phrase is detected, the microphone begins recording the users voice, and forwards the audio to a transcription service. I used a service named Snowboy, which takes 3 recordings of a custom wakeword and generates a model which can be used offline. After transferring my model to the Raspberry Pi, I was able to verify it worked with the Python example here.

After configuring the wakeword, I continued on to compare the audio quality between the mic array and USB mic by recording audio on both simultaneously. After the audio was recorded, it was uploaded to Watson’s speech to text service, and the transcription results / confidence was returned as a JSON object (example below).

## Record and transcribe from USB microphone
$ rec /tmp/test-usb.wav
$ curl -u "{username}":"{password}" --header "Content-Type: audio/wav" --data-binary "@test-usb.wav" "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize
## Record from Mic Array
$ ~/matrix-creator-hal/build/demos/micarray_recorder
## Convert Raw to Wav
$ sox -r 16000 -c 1 -e signed -c 1 -e signed -b 16 mic_16000_s16le_channel_8.raw beamforming_result.wav
## Transcribe mic array recording
$ curl -u "{username}":"{password}" --header "Content-Type: audio/wav" --data-binary "@beamforming_result.wav" "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize

To test the performance, I repeatedly spoke a sample phrase “Turn on the light”, and compared the speech to text results from the audio captured by each microphone. I found that the USB microphone was somewhat accurate at small distance (no more than a few feet), but had difficultly detecting the wakeword at a distance beyond 5ft. On average, the confidence of the transcription results were much lower with the USB microphone.

The microphone array was much more responsive to the wakeword, and was able to be triggered from >20 ft. Also, the audio captured by the microphone array sounded louder and clearer, with a minimal amount of background noise. I ran 40 tests in total at varying distances from the microphones (1, 5, 10, and 15ft).

The script at the linked repository directly below can be used to install the Matrix Creator dependencies.

How does this work?

If we take a single individual microphone, we’ll see that it’s equally sensitive to noise coming from all directions. In our case, we’d rather our recording device be able to listen in the direction of the user while they’re speaking commands and block out background noise coming from other sound sources. To listen in a particular direction without physically pointing the microphone towards the user, we can create a microphone array by taking multiple microphones and arranging them in a some symmetrical pattern (linear, circular, spherical, etc). Once the microphones are arranged, we’ll need to take note of the known location of each microphone in relation to some reference point, which is the center of the board in our case.

Matrix Creator Mic Locations (source)

Once we have our array set up, we’ll need a system to run DSP algorithms and make sense of the incoming audio. One of the most commonly used methods is “beamforming”, which is a method used to focus on signals coming from a specific direction while suppressing signals coming from other directions.

Omnidirectional vs Focused listening pattern (source)

“Sum and delay” beamforming is one of the most commonly used methods because it’s somewhat straightforward to implement in comparison to other beamforming solutions.

To determine which location the microphone array’s beam should “steer” towards, the system will need to first utilize a direction of arrival algorithm. This implementation of the algorithm in the matrix_hal repository results in two values, a “polar” angle and an “azimuthal” angle.

Diagram of spherical coordinates (source)

The amount of time it takes for the user’s voice to travel to each microphone in the array depends on each microphone’s location. So if we view each incoming signal on an x-y axis (x=time, y=amplitude), they‘ll appear to be similar, but will have a different time of arrival on the x-axis. Sum-Delay beamforming works by time-shifting the audio observed at each channel to compensate for the propagation delay. The shifted signals are then added together and averaged, which generally results in an improved signal to noise ratio. Below we have a visualization of the audio recorded at each channel in the Matrix microphone array (left side) and the beamforming result (right side).

Matrix Creator Beamforming Visualization

The beamforming implementation can be found in the Matrix Creator Github repository here.

Next Steps:

  • Use alternative open source implementations, such as those in the FRIDA and Respeaker github repositories.
  • Investigate AI solutions that are aimed to solve the “cocktail party effect”, which is geared towards singling out a specific voice in a noisy environment.

--

--

Kalonji Bankole
Kalonji Bankole

Kalonji Bankole is a developer advocate for IBMs emerging technology team. Day to day, he works with open technologies such as Ansible, MQTT, and Openwhisk