Decentralized Data Generation

6 min readJun 25, 2018

A voice app is an action performed in response to a natural language query from the user. It can be anything from making coffee to finding a movie to watch. Creating a new app requires 3 steps:

defining the structure of what the assistant should understand (the “Intents”).
giving examples of how a user would ask something, and using it to train the intents (the “Training Set”).
writing the code that will execute the action the assistant just understood.

This means that each app will need to be trained with examples representing how people would interact with it. However, since most assistants are completely new products, the data to train it does not exist. It doesn’t matter if you have millions of users already, if they never spoke to their coffee machine, you will not have coffee machine data. As such, the biggest issue facing anyone building a new app is how to find the data to train it, before the app is actually published.

To deal with that issue, app developers do two things: first, they manually input a bunch of examples of how they expect people to talk to their assistant, and second, they track how real users talk to their assistant, improving the performance over time. There are multiple issues with this approach however:

there is only a limited number of examples anyone can think of, leading to voice apps being trained with an average of 30–50 examples only. This is why apps are usually terrible when they get launched: they simply didn’t have enough data to learn from!
tracking user data means invading their privacy, as what they say can be highly sensitive.
the data needs to be manually annotated (you need to say which part of the sentence is which parameter), which is a very tedious, error-prone process.

To solve these issues, we came up with a radically new concept: Data Generation. In a nutshell, the idea is to create “fake” user data by generating thousands of training examples from a handful given initially.

How data generation works

This yields a number of advantages:

it vastly simplifies the life of the developers, as they no longer have to spend hours figuring out all the training examples.
it preserves the privacy of end users, as their data no longer needs to be tracked.
it solves the day-0 quality problem, as assistants can now be trained with the equivalent of months of user data, before they are even launched.

Average improvement of NLU performance after data generation

Our data generation tool has been available as a centralized, fiat-based SaaS for a few months, and has been used across all our enterprise customers. Our goal is to now decentralize it, using the upcoming Snips token to incentivize “workers” to produce high quality training data.

The first step is to hire workers and onboard them. This is done via a bounty program, where they are asked to generate data for internal Snips tasks in exchange for tokens. This enables both workers to start earning tokens, as well as learning how the process works. Anyone can be a worker, including users of Snips AIR. Once a worker has accumulated enough experience, they can start participating in data generation for developers.

Generating data then works as follows:

An app developer creates a new data generation campaign by describing the intent and giving a small number of example queries. The tasks consist of a description of the intent and specific items to include in the sentence. For example, to generate smart lights data, the worker will receive a task saying “Formulate: Switch Lights On, Include: Living Room”. The price for each generated query is set by the developer, with a minimum price based on the average time per task and a fair hourly wage for workers.
Workers wanting to participate in a campaign must stake tokens, and are chosen with a probability proportional to their stake. To ensure linguistic diversity in the generated sentences and prevent the wealthiest workers from being selected too frequently, the log of the stake will be used in calculating the probability, yielding the following formula: P(worker) = log(stake) / ∑ log(stake). As the number of campaigns increases, workers will be able to earn more tokens. As a result, we expect the stakes to increase before reaching an equilibrium where a large number of workers have similar probabilities of being picked.
A worker is then chosen for each task (called a generator), and asked to generate a query that must include the items specified. In the case of the smart lights example, this could be “Turn on the lights in the living room” or any variant such as “Living room lights on”.
Three other workers, called validators, are then chosen following the same protocol, and asked to vote to confirm the formulation, spelling, and intent of the generated query. If the query is valid, the generator gets paid. If it is invalid, the stake of the generator is confiscated until they do at least 3 consecutive correct tasks. Similarly, if a validator does not agree with the other two, their stake is confiscated until they validate 3 consecutive tasks without being contradicted. This ensures both generators and validators behave honestly.
A further machine-learning based cross-validation is performed off-chain, as a precautionary measure to enable developers to spot ambiguous queries, as well as blacklist workers that cheated and were not caught by the previous validation steps. It is important to note that only using machine learning as a validation step would not yield the same diversity and robustness, which is why it is used as a final step only.

Data generation is an important tool for developers to create high quality voice apps. Thus, the more consumers adopt Snips, the more we can expect the number of data generation campaigns to increase, and with it the total transaction volume. With the worker staking mechanisms in place, we can expect this to contribute to a sound economy around the Snips token, while simultaneously solving the issue of data quality.

Finally, and importantly, data generation for natural language queries is just a first step. We have already started using the same technique to generate voice samples for training wake words. Our goal is to eventually enable any AI related annotation task to be conducted in the same way, turning our data generation tool (and the Snips token) into a generic purpose AI platform.

In the next posts, we will explain how we are building the Snips AIR blockchain, and how the token will be used for decentralized encrypted analytics, decentralized federated machine learning, and for our token-curated voice app store!

If you liked this article and care about Privacy, smash that clap button, then tweet everyone 👉👉 randhindi & snips

Interested in understanding what’s going on in tech these days? Follow my instagram stories @randhindi 🤓

Decentralized Data Generation

Written by Rand Hindi