Implementing a Research Prototype of a Next-generation Voice Assistant
By Vikranth Srivatsa, Bhuvan Basireddy, and Nathan Malkin
This article provides background and architectural details for a research prototype of a new type of smart speaker, which has been newly released to the research and open-source communities. The system described in this article can be found on GitHub.
Smart speakers provide a tradeoff between quick access to information and privacy. Although companies promise protection of users’ data, there is no guarantee for the amount of the sensitive information these devices can access, since the microphone can always be on. Currently, smart speakers limit themselves to recording data after hearing “wake words” (such as “Alexa” or “Hey, Google”), but to provide a more complete experience, companies may start to continuously record users.
For users that enjoy the convenience of smart speakers and other “always listening” devices, restrictions on the type and availability of data should be standardized and protected. For this purpose, we describe below a passive listening system architecture and an open-source implementation of a smart speaker that allows users to restrict access and analysis of data.
The market for smart speakers is growing rapidly, with the largest companies, such as Amazon and Google, dedicating considerable resources to their research and development. The current market revenue from smart speakers such as Google Nest and Amazon Echo is estimated to be 15 to 30 billion dollars, according to a report by Statista. Companies also make money from smart assistant software that is integrated in many popular devices.
Despite their popularity, smart speakers have raised a variety of privacy concerns. Most of these speakers are located in private locations, such as homes, causing the potential for sensitive data to be collected and misused by the manufacturer. Even if the data is only stored on the device, hardware failures can cause further exploits of the data. For example, in 2020, researchers from University of Michigan found that firing a laser at an Alexa device could cause it to activate.
Current voice assistants, such as Siri, Google Assistant, and Alexa, work by using wake words that activate the recording and processing engine. We believe that, in the future, we may see assistants expand beyond wake words into “passive listening” (systems that constantly listen and process audio). Compared to a wake word, passive listening devices can record context. In a conversation, there is a lot of contextual information provided that could be used to improve the user experience without explicitly activating the smart speaker. For example, an entire conversation about movies or restaurants could precede the ultimate query to buy tickets or book a table. This information could be recorded and processed by smart speakers to provide more contextually salient responses to users’ commands and questions.
The shift away from wake words to passive listening creates new privacy concerns. Currently, most classification and analysis of user speech require audio to be sent and processed in the cloud (see policies for Alexa and Google). If this is extended to passive listening, a constant stream of possibly sensitive audio could be sent to remote servers. Even if the cloud computation is processed on encrypted information, third-party developers (such as developers of Alexa skills, which provide extra functionality not offered by the cloud provider) require access to the audio, which can cause privacy concerns. Any audio clip has the potential to be sensitive, as it may include arguments between people, private health data, or other personal information.
Any audio clip has the potential to be sensitive, as it may include arguments between people, private health information, or other personal information.
At present, wake words are usually detected in a custom-built chip that is designed to improve privacy and efficiently detect when a word is pronounced. The machine learning models for these wake words are usually created by repeatedly pronouncing phrases, converting and cleaning these with signal processing, and then using a prediction system to determine the presence of a wake word. However, a wake word wouldn’t be practical in a model that requires passive listening, a system that constantly records and processes in the background and doesn’t require active user input.
A full architecture of the different skills (i.e., applications) and smart speaker pipeline is required to fully study the effects of passive listening systems. In this work, we: 1) Provide a secure architecture for researching these next-generation smart speakers. 2) Develop passive listening skills to study the effects of privacy restrictions; 3) Provide an overview of tradeoffs with the current open-source smart speaker technology; and 4) Provide an open-source framework to extend and test.
Building an open-source smart speaker requires tremendous engineering effort, particularly if aiming to replicate the performance of products designed by companies such as Google, Amazon, and Apple, as these companies have far greater access to data and processing power. Open-source, easy-to-use systems are necessary for researchers to study and analyze the effects of voice assistants. For our use case, the architecture needs to be robust enough to insert privacy restrictions into the pipeline.
During our exploration, we considered extending some pre-existing smart speaker systems and architectures. Mycroft, for example, is an open-source, privacy-based smart speaker that supports skills and wake word detection. The issue with this architecture, as with other open-source smart speakers, is that it is only activated by wake words. The Mycroft skill system, for example, only activates skills if the phrases match a list of words. But this makes it challenging to implement a passive listening skill, which would have a more complicated input than a pre-written list of phrases. For example, a passive listening-based skill might try to detect phrases that contain grocery items, but applied to varying context and locations, which could be used to create a shopping list for purchasing groceries later.
The architecture of our voice assistant is split into two main components: 1) the server with the client, and 2) skills (independent applications that run on the processed text/audio from the client). The client handles the processing of audio information and converting it into text phrases to be processed. This information is then sent to the skills for processing and returning the results. Each part of this pipeline is described further below.
The general pipeline for a smart assistant starts with a microphone on a client device that continuously records the environment. From the user’s raw audio, speech is then detected using a voice activity detection (VAD) system, which separates the audio into phrases. This is then sent into a transcription system. Then, these transcribed phrases, and possibly audio chunks, are sent to a local or cloud-hosted server that has different passive listening skills enabled.
The different skills can process the text and return results that can be saved in a database stored on the client. The text processing could involve classification of the data or keeping track of certain details. For example, the text processing could record all details of locations of restaurants mentioned. A website that is run on the client device, accessible via an endpoint running on the host, displays the list of recordings made, a dashboard of the current skills’ output, and a list of permission options. The dashboard also sends queries to the skills to get information to be presented to the user. The different services can respond at different times and rates, based on the processing, as they are all connected by separate services that can respond asynchronously.
With the front-end user client, the user can have a standardized way of enabling and disabling permissions. This is also useful for users to audit the conversations by viewing what was recorded in a web-based graphical user interface, as well as what skills, such as classification, occurred on their conversations. With our setup, users can also choose which skills are activated.
We now cover some of the architecture details in greater depth.
Architecture: Audio Processing
Audio processing is the first step when it comes to audio classification pipelines. The audio recorded from the microphone may have issues, such as noise in the background. There are many signal processing approaches for handling this, such as running a Kalman filter through the system. Other issues may occur if multiple people are talking in the same conversation, and identifying the relevant speaker is difficult (the “cocktail party problem”). The current version of our voice assistant prototype does not distinguish voices from different speakers.
There are other approaches to finding the audio the assistant cares about. One is audio fingerprinting, which involves recording the audio clip, computing the spectrum of the audio clip, and then peak matching. This is similar to approaches used by Shazam, a popular song detection app. Audio fingerprinting was not integrated into our pipeline, but a system was implemented during testing for detecting clips using the same scheme.
With our current architecture, we use voice activity detection, which is commonly used in popular voice applications like Discord. Some options for voice detection are machine learning-based systems (which use prediction based on previous clips of audio data), human frequency detection (as human frequencies normally vary between certain ranges), and energy indicators (which detect absolute amplitude of certain audio ranges). There are tradeoffs between these techniques in terms of determining the aggressiveness of what is considered “speech”. The tool we decided to use is the webrtc_vad Python library, which uses Gaussian mixture models and hypothesis testing to determine if a clip of audio is speech. (The webrtc_vad library was originally developed by Google for WebRTC and was later extracted from Chromium.)
One issue with building a passive listening system is the lack of good speech-to-text transcription. An error in the transcription phase can propagate down the smart speaker pipeline. Companies like Google and Apple have large R&D teams exclusively dedicated to the transcription and classification phases. Although using a cloud-based transcription service would probably be more accurate and faster, it would require potentially sensitive audio information to be sent and processed by the cloud system, which could be misused and logged.
The open-source alternatives we considered were DeepSpeech (which is trained on LibSpeech, a large English Corpus) and Wav2Letter. DeepSpeech is a Mozilla project based on a recurrent neural network that is trained on spectrograms of the audio and maps to an n-gram language model. Wav2Letter is a Facebook-supported research project that uses a modified convolutional network to predict the text. From our experiments, we found that DeepSpeech struggled more with faster rates of speech, which might occur naturally in a conversation. Wav2Letter worked better, getting closer to the true text, but failed at times to understand similar words or phrases. We tested a number of other different transcription systems, such as Kaldi and Vosk, but we were unable to achieve the quality we wanted in our effort to develop a smart speaker pipeline comparable to those used in commercial systems.
Our alternate transcription system involves leveraging Google Chrome’s built-in accessibility feature, which allows for transcription. This uses Google models downloaded onto the devices, but runs fully locally. The direct Google Speech to Text web API will not work, as it requires access to the internet. Instead, the different WAV files recorded are opened in the browser. The text overlaid is then extracted with traditional object character recognition (OCR). OCR is a much more solved problem compared to speech-to-text. We tried a few offline OCR libraries, such as Tesseract, but found the Easy OCR library to be the most accurate.
We compute the bounding box on the Chrome webpage and take repeated screenshots of the area. However, there is an issue in sampling the same text repeatedly: the transcription feature tends to correct the transcription over time because it is a real-time system. In order to account for this, we oversample the text in order to make sure no information is lost. Then, we repeatedly merge common sub-phrases to deduplicate the sampled text.
Classification and Sample Passive Listening Skills
To communicate with the skills’ backend, we created a system to add skills to the smart assistant. Passive listening skills, unlike skills from wake words, have access to much more information and context. In order to show the flexibility of a passive listening system, we built the following applications.
First, we implemented a general classification system, which considers many different intents that are based on different problem domains (e.g., business, time, etc.). This system might be important for a passive listening device, which could be exposed to a lot of out-of-scope speech and must determine the validity of text. In our algorithm, we repeatedly look for a rolling window of text and check if the classification matches a chunk. Our system handles over 200 intents, such as Weather, Shopping, and Finance. In order to handle out-of-scope or irrelevant data, we added a dataset of movie conversation and Amazon Store reviews to our out-of-scope system. The dataset is then sent through a word encoder (to encode word embeddings), and a sentence encoder (to encode sentence meaning). For our experiments, we found that Infersent with GloVe worked really well. Our architecture is a simple multi-layered convolutional network that uses the text embeddings. For some intents, the dataset was too small, so we were able to employ strategies such as resampling. However, with minimal effort, we were able to get accuracy over 90%.
Another popular domain for passive listening systems could be a shopping list that keeps track of shopping items. Using a similar approach to the general classifier, we trained a shopping classifier to detect shopping intents. We then search through a grocery list of possible products. We also use a Q&A classifier to check if, in context, the grocery item is mentioned to be running out. In order to extract and identify the text, we used classical natural language processing techniques, such as cosine vector similarity. Compared to a wake word-based system, a passive listening device can actually collect and keep track of this information.
A common use case for traditional voice assistants is asking about the weather. If a device were passively listening, it could keep track of locations mentioned earlier in the conversation and, on request, provide the weather in all of them. Then, the location does not need to be mentioned explicitly in the query, as the system keeps track of the context of the previous conservation. In order to handle larger contexts, we use popular context-based language models such as Hugging Face question-answering models. Although this model might be accurate, we also use classifical NLP algorithms, such as grammar lexicon analysis, to identify locations. If a location appears in both the model and classical NLP, we treat it as a match for a valid location for which to query the weather.
Users’ ability to control and audit skills is an important part of the pipeline. When the assistant records some details, the user can choose to audit these requests, as well as how the third-party applications are handling them. For example, if a service wants to connect to the internet to provide further details or results, this can be blocked through the client-side permission system, while individually allowing certain connections through. If location permission is given, for example, the weather skill is allowed to execute by querying the OpenWeather API. This permission to access the internet or classify the data is blocked by stopping the request on the assistant directly to prevent issues with accidentally sending data to the skill. We built an auditing front-end dashboard in order to visually display the permissions created and blocked.
Future Work/Other Considerations
Even with our passive listening model, there are still some limitations with our system. When running all these skills locally, there are performance constraints with running on small devices. The different skills and the speech model required can get very large.
Other considerations could be how long to hold the context for the application. Currently, we handle it by running a rolling window, but for a passive listening application, a query system for a large-scale system would be better.
In this project, we implemented an architecture and a strategy for handling privacy in passive listening systems. We have also provided an open-source system for future researchers to implement new skills in order to test passive listening configurations.
About the Authors
Vikranth Srivatsa, Bhuvan Basireddy, and Nathan Malkin (2020 recipient of the Cal Cybersecurity Fellowship) are researchers in the UC Berkeley Department of Electrical Engineering and Computer Science (EECS). Other contributors to the system described in this article include Hrithik Datta, Prakash Srivastava, and Varun Jadia. This research was funded in part by the UC Berkeley Center for Long-Term Cybersecurity.