Our goal at Snips has always been to create voice assistants that are Private by Design: we should know nothing about what our users are doing. This zero-data approach creates a number of challenges however, as training machine learning models does require data in the first place, an excuse often heard from some of the other voice assistants to centralize our personal data.
At Snips, we are solving this problem through a combination of 3 factors:
- embedded processing of the voice query;
- decentralized data generation to train assistants before they are live;
- decentralized machine learning to improve the assistant over time.
In this post, we will show how we are solving the Privacy vs. AI conundrum using cryptography and a token, applying it to both analytics and machine learning to make them 100% Private by Design.
Decentralized Private Analytics
Before we can dive into the details of how we can securely and privately decentralize the training of voice assistants, we first need to look at a simpler problem: decentralized private analytics.
Analytics is a very important tool for any serious developer wanting to build a high quality app. As the famous saying goes, “you can only improve what you can measure”. But analytics also means tracking user behavior, and aggregating it on a server somewhere, leading to a massive privacy breach. Since this was not acceptable for us at Snips, we created a new technology that enables us to get statistics on a group of users without knowing anything about them individually.
To achieve this, we looked at a number of different approaches, from homomorphic encryption to differential privacy. But it was by using MPC with Secret Sharing that we ended up being able to efficiently and securely sum vectors over a set of users, without disclosing the contribution of any of them (aka private analytics). Our original research was published over a year ago along with an open source library called SDA (Secure Distributed Aggregation), and has now been adapted to incentivize participants to behave honestly through the use of a token.
Here is how the method works:
- Secret padding. Users encrypt their usage data by adding a random secret pad key, before sending it to the developer requesting the analytics. Because the developer doesn’t know the secret pad key, they have no way to know what the user has actually done.
- Secret sharing. The secret pads from users are distributed to a set of processing nodes called “clerks”, which then aggregate them securely. Anyone can be a clerk simply by staking Snips tokens, including devices users are running their assistants on. Clerks are then chosen with a probability proportional to the log of their stake, and are penalized if cheating or not responding in time. To improve security and resilience to clerks going offline, pads are shared with clerks using Shamir’s Secret Sharing. For added security, zero-knowledge proofs can also be used to publicly verify that the right shares are being sent.
- Reconstruction. Once each clerk has aggregated the partial shares they received from all the users, they send their result to the developer who can then reconstruct the sum of all pads. Subtracting this from the sum of the padded usage data originally sent by the users then yields the desired analytics. Once the process is complete, the developer then pays clerks in tokens for having processed the data, with the amount being proportional to how much data they processed.
Using this protocol, privately computing analytics to monitor 100 different events over 250K users only requires clerks to download a 1MB file, and run 6 minutes of background task computations on a device as light as a Raspberry Pi 3. This means any device — from a mobile phone to a smart speaker — can be a clerk and earn tokens by participating in this processing.
Our solution makes sure no personal contributions get leaked, and only the aggregated statistics are made available. This is a significant step forward in making analytics more private, but it is known that identifiability issues may still arise. To take extreme examples, atypical individuals may still stand out from aggregated distributions, or aggregated statistics may be reverse-engineered when computed twice on populations differing only by one individual. In this case, computing the difference of the aggregated statistics gives access to the user’s individual contribution.
The solution to these problems is called differential privacy. The idea behind it is that you want to add noise to the user data, such that you cannot learn anything certain at the individual level, while still observing meaningful signal at the aggregate level. This is typically what’s done for analytics in Chrome, that rely on an algorithm called RAPPOR. Before individual contributions are sent out from our browsers, a bit of noise is added making sure Google does not learn definite information about their users. Because noise is added locally, the solution is called local differential privacy.
One thing to notice is that the small bits of noise added by each user accumulate. As a consequence, the output received by the developer can become very noisy, meaning they also needs a lot more data to get a clear signal.
To tackle this problem, there is an alternative called global differential privacy. If one or several aggregators are trusted with aggregating the signal, or do it in an encrypted way like in SDA, the clerks can add noise instead of the users. Doing so can still give the same level of privacy, with significantly less noise being added to the data. An experiment we ran shows that using global differential privacy with 1,000 users offers a similar prediction performance than RAPPOR with 1,000,000 users, while guaranteeing the same level of differential privacy.
What this means is that the combination of private aggregation and global differential privacy brings voice app users and developers similar levels of privacy and accuracy than what Google Chrome can provide for its analytics, relying on millions of users. Beyond pure decentralization, our Private Analytics solution also democratizes the use of differentially private analytics!
Decentralized Machine Learning
Analytics can be key in identifying performance issues, or design errors in a voice app. To solve performance issues, training models on end user data will eventually bring the best performance. Is it possible to train a model on end user data without forcing this data to leave the devices they are sitting on? Yes. It’s called Decentralized Machine Learning.
The goal of Decentralized Machine Learning (also called Federated Learning) is to train a neural network by updating the gradient locally on the device of the user, before aggregating it securely via a network of clerks. Since aggregation of gradients is simply a summation, we can reuse the same protocol that we developed for analytics, thereby ensuring complete privacy.
The process is as follows:
- Annotation. Users annotate their own data using the Snips app or any other available tool. This is done privately such that nobody can access their data but them.
- Secret padding. Their annotated data is then used to update the gradient of the neural network running locally on their device (remember Snips runs 100% on device!). Using the same protocol as for analytics, the updated gradient is then encrypted by adding a secret pad key, and sent to the developer requesting the updated neural network.
- Secret sharing. The secret pads from users are then aggregated securely by clerks following the exact same protocol as for analytics.
- Reconstruction. The developer then performs the reconstruction operations, again following the same protocol as for analytics, but this time paying both the clerks and the users who contributed their data.
There is a key challenge that is often overlooked: most decentralised machine learning solutions, like OpenMined, DML or BigAI, assume that users have pre-labelled data on their devices. To train a supervised model, you need descriptions of situations (X’s), and the corresponding right choices that you’d like an AI to make autonomously (Y’s). This X’s and Y’s are strictly required for most practical machine learning applications, and make a strict distinction between problems that can today be addressed in a decentralized way, and those that cannot.
When it comes to a problem like wake word detection for example, there is only implicit feedback for false positives. The user may curse, complain, or ignore the assistant, but there is no option for him to directly say “this was a false positive”. And generally, there is no feedback either when a Wake word detector fails to detect the user saying the Wake word. To solve this problem, we are working on interfaces for the user to supervise the decisions made by the Artificial Intelligence:
There are several issues to solve with this problem however. First, users could contribute bad data, since they supervise their own data, and get paid for contributing. This validation turns out to be a fairly challenging task, as involving third-party validators poses privacy concerns, and there is no machine learning model that can be trusted with detecting frauds with 100% accuracy. Nevertheless, a validation model, potentially larger than the one used to run inference in the voice app, can be used to detect systematic perplexity with regards to the data contributed by a user. In this case, the user can simply be put aside from the contributor pool, temporarily or not.
Beyond incentivizing supervision and making contributions to the global gradient public, there are a series of other challenges to be addressed to make Federated Machine Learning a reality. These include minimizing the number of round trips between the devices and the server, the convergence of the optimization algorithm, etc. Those are challenges we are going to be working in the coming year, collaborating with partnering researchers from LIP6 and INRIA.
You can find more information on our approach to federated learning in this presentation.