Can we make data buying process through crowdsourcing better with smart contract, ERC20, and bonding curve?

Published in

Sotuu

6 min readJun 5, 2019

AI is eating the world. Everybody know it.

I know companies use crowdsourcing platform to purchase annotated data or letting crowd workers to label datasets. To tell the truth, we ourselves use a kind of crowdsourcing platform to acquire datasets.

The process is simple.

Crowd worker create labels for dataset to earn small amount of money.
Let’s say if we were dealing with image recognition AI and the price of one set of data is 10 cents, a crowd worker can earn 10 cents for each labeling task.

A company, which want annotated dataset, pay money for crowdsourcing platform and the platform company pay for each crowd workers.

While this process is working okay at lest to some extent, my question is

Can we make this process better with blockchain, smart contract, ERC20, and bonding curve?

Motivation

We all know AI is being dominated by large companies and having open alternative is good thing for our society.

Since AI is not just about source code, but about data and computation too, I think using blockchain, smart contract, ERC20, and bonding curve may make sense.

I think bonding curve idea which is proposed by Simon de la Rouviere could be very helpful for revolutionizing crowdsourcing-based AI data acquisition.

BTW, if you were new to the idea of bonding curve token, please check some articles like this one.

Tokens 2.0: Curved Token Bonding in Curation Markets

Since introducing Curation Markets in earnest earlier this year, quite a few projects have started adopting some of the…

medium.com

Anyway this is a thought experiment. And the objective is to develop mechanism/system for AI/AI datasets to improve/grow autonomously and sustainably with ERC20 and bonding curve.

Can we align incentives toward common goals?

Since the goal of a company is to make as much as money with limited resources, there is a fundamental conflict between data buying companies and crowd workers.

Companies try to acquire just large enough amount of data because they have budget limitation if they were not Google or Facebook.

But problem is more data always better in deep learning like Andrew NG said on YouTube.

And because any project has competitors to some extent, all project have to improve the service. I mean there is no good enough AI or large enough data ultimately.

So theoretically if a company could pursue more data without any budget limitation, the company can train better model with biggest datasets and make more money.

If a company made bigger success, the company can afford to pay more to crowd workers.

What if data is owned by smart contract?

My proposal is using a smart contract as data owner instead of a company.

How it works in the simplest manner

When a crowd worker ( we call them as data contributor ) contribute a data, they earn token(ERC20) instead of fiat money.

The data is owned and managed by smart contract. Since contract can keep minting new token for new data, there is no budget limitation.

Contract make money/token by selling access to the data or some other ways.
Companies which want to train an AI model can access the data as one of data consumers.

Data contributors ( crowd workers ) can convert the ERC20 token into other tokens like Ether, DAI, or other ERC20 anytime on the smart contract.

The price is calculated based on the value the smart contract is holding, total supply of the token, and formula written in the contract.

Benefits

Keep acquiring data forever
Since there is no budget limitation, project can keep increasing datasets forever.
Easy to share
Because data is owned by contract, it’s easier for multiple parties to share the data
Persistence
Data can be accessible even after a company stop the AI project if data is on decentralized storage.
Easy to connect
Since contracts can talk to other contracts, there are flexible ways to make tokens. ( Eg. our data project potentially can become data provider in decentralized data eco system like Ocean protocol)
Possibility of distributing more value to data contributors
There are possibilities that data contributor make more value/token because AI project could make bigger success with bigger datasets

Bonding curve example

Data contributor convert ERC20 token into bonding token ( Other ERC20, Ether, DAI ) anytime based on formula.
Price is calculated by formula, value in contract, and total supply of the token.

When a smart contract make token, the price of ERC20 token go up like this.

When a data contributor contribute new set of data, contract mint new token.
At that case price stay.

Eco system Image

Since there are many kinds of AI, there are many kinds of datasets.
Each datasets can have their own ERC20 which is bonded with token like Ether and DAI or other ERC20.
Data contract can make token either by connecting other blockchain-based data marketplaces and computing service or traditional company service.

A real world experiment

We’re experimenting the idea with Japanese speech recognition AI as part of Sotuu project.

So far we have released an app where data contributor can contribute voice data to earn token. But after token babble was bursted, token is not attractive anymore as an incentive.
So we purchased 100 hours of voice data through crowd sourcing platform. We trained a toy speech recognition model.

Our next step is acquiring at least 150 more hours of data and train good enough model for some production use cases and restart blockchain part again by donating the dataset and AI to contract. Our hypothesis is if contract have revenue no matter how small it is, token model (mechanism) may work.

Another possibility instead of bonding curve

We might be able to distribute value/token to token holders according to the token amount whenever contract make token. At that case contract don’t keep value and we need some market place for token holders to sell and buy token because contract can not calculate the price.

This implementation is even more interesting because data contributors keep earning token if the dataset/AI is doing well.

But I have no idea about legal aspect of this implementation.

BTW, please give me your feedback. Especially about…

What kind of equation/formula is good to attract data contributors and sustainability?
I don’t think linear function is the best formula to attract data contributors and sustain entire system.
What do you think about distributing token whenever contract make token model?
Which token is ideal for bonding (main/key) token?
Bonding token is the token when contract make revenue. Data contributors convert reward token into the bonding token on contract. We could issue our own key token like Ocean protocol and having our own token has 2 benefits. We can incentivize parties with token and this is important because we don’t have revenue yet. And our own token is helpful when we need to raise money ( We are not thinking about fundraising and ICO at all)
But probably data contributors and data provider want Ether or DAI, ultimately Fiat instead of our ERC20 token.
Any suggestion for datasets type?
What kind of data is interesting or suitable for this kind of projects? How about dataset for autonomous vehicle ?
Any feedback is precious!

BTW I’m strongly influenced by Ocean protocol. Ocean protocol is a protocol for decentralized data marketplace to work and more. Their vision is awesome.

If you have an idea for AI data project, let’s talk. I’ll prepare smart contract for that.

Original slide I made is here.

Thanks!