TL, DR; In this post, I will introduce you our security study on voice assistant platforms (Amazon Echo and Google Home). This research reveals a set of attack scenarios for both platforms, along with several promising countermeasures to address those issues. For details, you can also see the research paper published on IEEE Security&Privacy 2019 (acceptance rate is 12%). The research has also won 3rd place in CSAW’19 Applied Research Competition (3 out of 80), which serves as a good recognition of both research and industrial values. In this post, you will learn:
- What are voice assistant platforms?
- Attack scenarios introduced in this paper: Voice Squatting and Voice Masquerading. Both attacks have been reported to the vendors before publishing the paper, and both vendors acknowledged the attacks.
- Detection and prevention techniques proposed in this study. The authors have shared their defense techniques with the vendors, and also filed a set of patents which belong to Indiana University Bloomington.
- Analysis of the fundamental issues of voice user interface, which make those attacks practical and inherent.
Voice Assistant Platforms
In recent years, voice assistant devices are getting increasingly popular, especially Amazon Echo and Google Home. Using those devices, users can carry out a set of tasks including playing music, controlling IoT devices, sending or receiving money, and accessing personal medical information.
So, how it works? and how can those tasks be executed? Briefly, those platforms work like a service router/proxy. Every time a user gives a voice command, it will be captured, and transferred back to the platform server. And the platform server will apply a set of AI techniques (Speech and NLP) aiming to understand the given voice command, and decide the exact skill to handle further interactions. One thing to note here is most skills are provided by third parties, instead of the platform itself. And, such an architecture actually create some significant attack space, as introduced below.
Attack Scenarios: Voice Squatting & Voice Masquerading
The first attack scenario is called voice squatting, wherein, the attackers can mislead the platform and get the opportunity to handle users’ interactions. As shown below, when a user want to send money through PayPal, PayPal skill should be started to handle user’s further interactions such as asking for the recipient and the amount of money to send. However, the attacker may confuse the platform by registering a competing skill through some effective channels, and instead of the PayPal skill, the malicious skill may be started to handle further conversation with the user. What makes things even worse is that there is no deterministic indicators for users to distinguish PayPal from the malicious skill.
What makes things even scary is a malicious skill cannot only masquerade itself as another skill, but also the platform itself. Specifically, every time, users finish the interaction with a given skill, either the skill quits itself by sending quitting signals to the platform, or the users say some quitting voice commands such as “Alexa, stop”. However, researchers found that not all commonly used quitting keywords will be recognized by the platform. And if a quitting keyword is not supported by the platform, the platform will continue to pass future voice commands from the user to the current skill. This allows a malicious skill to recognize those ignored quitting voice commands, and pretend to have quitted, and masquerade itself as the platform to handle next voice commands from users.
Also, what if users want to start another skill when interacting with the current skill? Unfortunately, by the time when the study was carried out, this is not well supported by the platform, which means the corresponding voice command (“Alexa, switch to PayPal please”) will still be passed to the current skill to handle, which allows the current skill to pretend switching, resulting in the same consequence as voice squatting.
How to carry out those attacks in the real world?
You may wonder how realistic are those attacks? Apparently, voice masquerading requires the attack skill to be able to handle users’ interactions. Therefore, let’s focus on voice squatting as it will give attackers the opportunity to further carry out voice masquerading. The researchers have found two effective channels to achieve voice squatting and have demonstrated how easy it is to carry out such attack.
Consider you are the attacker and you want to voice squat Capital One (a popular finance skill). In other words, when users want to start Capital One, you want your attack skill can get started instead. As shown below, One way to achieve this is through name extending where you register your attack skill with invocation names such as “Capital One Please”, “My Capital One”, and “Capital One App”. The basic rule is to insert some commonly used prefixes (my, me, a, the) and suffixes (app, skill, please). As the platforms enforce a strategy of matching the longest invocation name, when users say “open my capital one”, instead of Capital One, your attack skill registered as “Capital One Please” can be started. Another channel is to compose invocation names with similar pronunciation as the victim invocation name. To compete with Capital One, you can name your skill as “Captain One” or “Capitol One”. Actually, the researchers have developed a tool to automatically generate, for a given invocation name, hundreds or even thousands of competitive names with similar pronunciations. And they can even make a competing invocation name looks much different from the victim invocation name, from the text format.
The authors have also demonstrated how effective both channels are when carrying out voice squatting attacks. As shown below, they register those attack skills, try to invoke victim skill as common users. and observe which skill, the victim one, or their attack one is actually started. And we can see both channels are found to be very effective for Amazon Echo. For Google Home, most prefixes/suffixes don’t work except for the suffix “app”, while attack skills with similar pronunciations were found to be more effective on Google Home platform.
Detection and Defense
The authors also designed a detector to identify real-world attack cases, as well as some defense components to dynamically block suspicious behaviors from skills, and to better understand users’ intentions.
Detecting Attack Skills
The researchers designed a detector consisting of a set of machine learning models. Briefly, for a given skill, its pronunciation similarity to existing skills will be calculated to decide if it is competing with some existing skills. For more technical details, please refer to the paper.
Also, the authors collected a set of commonly used prefixes and suffixes through their user study. They then generated a set of variations for existing skills by adding those prefixes and suffixes. If a given skill is found to be an variation of another existing skill, it is considered as suspicious of voice squatting through name extending.
They then applied their detector to the skill market of Amazon Alexa consisting of tens of thousands of skills. Results are shown as below. Surprisingly, almost 20% of skills have at least one competing skill with similar pronunciation. Also, another 3-hundred skills were found to be variation of another skill through name extending. For example, “dog fact” is a popular skill, and there is actually another skill named “me a dog fact”. As aforementioned, when users try to invoke dog fact by saying “open me a dog fact”, it is very likely “me a dog fact” will be started.
While the detector can help prevent voice squatting attack, how should we prevent the voice masquerading attack? The authors provided their solution consisting of two components, one is user intention classifier (UIC) to dynamically decide whether users want to switch to another skill or quit back to the platform. The other component is skill response checker (SRC) which will dynamically inspect response from skill servers, to make sure the skill is not pretending the system. Still, refer to the paper for more details.
Fundamental Issues of Voice User Interface
Two important factors make those attacks some kind of inherent in voice user interfaces. One is the ambiguity of speaking natural languages. Even if not considering the personal speaking styles and accents of different users, the same sentence can represent multiple semantics while the same semantic can be expressed with different syntactic structures. To accurately understand users’ intention underlying a given voice command, roundtrips of verifications seems unavoidable, which is not well implemented in both platforms (likely because of concerns on bad user experience).
Secondly, voice user interfaces suffer from low visibility. Most users are found to know little about existence of 3rd-party skills. Also, it is difficult for users to distinguish different skills, if only leveraging information from the voice channel.
Overall, It seems very challenging to address either issue, without making the interaction process length, and inefficient.
Overall, this security study reveals two important attack scenarios for voice assistant platforms, along with several defense techniques. However, there are still a lot work left. One is to automatically verify whether those competing skills have malicious behaviors. The other is to design and evaluate better context switching indicators which are both user-friendly and effective.