In search of decentralized data markets

Here are some introductory notes about smart contracts and related technology leading toward decentralized data markets. Introductory, as in “I’m trying to learn about this stuff.” Specifically, I got really interested in the notion of decentralized data markets because of: “Introducing Computable Labs”, “Tokenized Data Markets”, the computable.js repo on GitHub, and other related work from the Computable team.

However, many of the blockchain tutorials that I’ve found online seem … well, they could use more improvement. Here are notes about what I’ve found along the way, as a summary of some of the history, which resources seem to help, interesting ideas that experts mentioned in conversation, etc.

Nick Szabo wrote the original article “Smart Contracts” in 1994, which appeared in Extropy magazine #16 in 1996. An updated version of that article is available online. Szabo has degrees in both computer science and law, and one might think of a smart contract as a contemporary mingling of the two disciplines. In essence, smart contracts are computer programs running atop a blockchain which simulate a trusted, decentralized third party.

You’ll probably see mention about DApps, which is an abbreviation for decentralized applications. Ethereum introduced a “next generation” platform for smart contracts and DApps, described in their white paper by Vitalik Buterin in 2013. The article “DAOs, DACs, DAs and More: An Incomplete Terminology Guide” by Buterin on the Ethereum blog in 2014 explains about much of the related terminology, although it didn’t include the term “DApp”. The “yellow paper” by Gavin Wood in 2016 provides more formal specs for how Ethereum runs. Answers on also give good descriptions for DApps, and the prediction market Augur is one popular example. State of the DApps provides statistics about available DApps, by category. It’s fascinating to see what gets listed there — especially the work in progress that may never result in finished products. Notice how, when compared with the numbers on Apple’s App Store, there’s still quite a greenfield for DApps.

You may also run across TCRs: Mike Goldin wrote the “Token-Curated Registries 1.0” article in 2017, along with a public GitHub repo as a reference implementation. To wit,

“Token-curated registries are decentrally-curated lists with intrinsic economic incentives for token holders to curate the list’s contents judiciously.”

In other words, a TCR is curated list which can be monetized for those who provide content. TCRs are built atop smart contracts, where a DApp creates the UX needed to reach a broader audience of customers. Pulling these pieces together and going up one more level of abstraction, a decentralized data market can be based on DApps which handle transactions among buyers and sellers of datasets, where the datasets are defined by TCRs. See the article “Tokenized Data Markets” for more details.

To recap:

  • decentralized: not stored all in one place, controlled by one party
  • blockchain: blocks of data which are decentralized and cryptographically linked, providing a distributed ledger; for example, Bitcoin or Ethereum
  • smart contract: immutable code stored on a blockchain which can digitally “execute” the terms of a contract to perform some defined transactions
  • TCR: a token-curated registry, i.e., a curated list where changes to the list are managed by the transactions defined by a set of smart contracts
  • DApp: a decentralized application which creates a UX for leveraging a set of smart contracts, such as a TCR
  • decentralized data market: a data market which connects buyers and sellers of datasets using DApps, where the datasets are defined on TCRs

Okay, that’s a good starting point for building decentralized data markets.

What’s under the hood? How does this new technology work? My first impression from trying to read about blockchain, cryptocurrencies, etc., was that there’s a ton of new verbiage introduced, lots of complex papers to read and absorb, albeit a dearth of “A connects to B, which produces C” explanation. Here goes.

For those who have less coding experience… A smart contract is a small computer program written for a virtual machine such as Ethereum. If you’ve heard of the Java programming language, that is similarly based on a virtual machine. Code for a smart contract gets put on the blockchain, and when run during a transaction, it operates on data contained in that transaction. For public blockchains, these transactions have built-in transparency and accountability, and (in theory) clear ways to address disputes.

For those who have more coding experience… Ethereum defines a virtual machine (EVM), and the bytecode which runs in the EVM in turn lives on the Ethereum blockchain. You can write source code for this in Solidity, using tools such as the Remix IDE to generate the bytecode. You can write other code to interact with it via Web3, such as the Web3.js API for JavaScript.

Example source code for smart contracts, written in Solidity

Here’s a nuance: EVM bytecode on the blockchain is immutable. Once deployed, those smart contracts don’t change — not without lots of problems, most of which probably incur too much risk or cost. Your smart contracts may be responsible for determining the flow of lots of money … however, forget about debugging, patching, iterating, etc.

Any way you slice it, get the code for your smart contracts right before you launch or you’re opening up to potential attacks which can cause you to lose lots of that aforementioned money. Sure, there are testnets for Ethereum where you can try to debug a DApp, but once deployed “Agile” is not an appropriate process.

Oddly, this is sort of an apotheosis for “containers” and “light-weight virtualization” if you glance quickly from just the proper angle. Note that library calls in the EVM bytecode may require hand edits to fix their “addresses” — somewhat like manually linking post-assembler object code, if “serverless” had been invented circa 1985. Caveat haquer.

Great, so far we’ve been talking about a thing that’s painfully slow, expensive, the “property” is public anywho, and oh yeah one silly mistake may lose millions. That all sounds rather negative, eh?

While these characterizations may cause some to run screaming, there’s a good analogy found in hardware design. Building chips is tricky business. Large amounts of capital expenditures go into each project, way up front. While a contemporary fab may cost billions, and be staffed by hundreds of people, for most chip designs there’s no guarantee it’ll run correctly until physical chips come out of the fab that, well, run correctly. Large chip manufacturers may even outsource initial tests (called “first silicon”) to specialty consulting firms armed with logic probes and nerves of steel. That work is hard and entails a crazy amount of risk.

Even so, once a chip gets designed and built correctly, it becomes much cheaper/faster/better to scale production. Once that chip gets designed into popular use cases, it becomes a nifty way to earn money mostly by cranking out more copies. Which is what NVidia seems to be doing right about now with GPUs vis-a-vis deep learning, but I digress.

To summarize, hardware involves lots of capex up front, lots of wait-and-see, lots of risk; however, as soon as the fab begins manufacturing your chips correctly, that risky business all becomes a vehicle for lots of commerce [read: profit]. Which is all fine and dandy in the world of business since that kind of risk is about possibly losing money, not loads of people getting run over by trucks. Or something.

Back to analogies, the process for programming smart contracts and developing TCRs is more akin to hardware — especially the dark arts of writing microprocessor firmware, which once upon a time was my day job — than it is to developing websites. Even so, that analogy isn’t perfect. Contemporary microprocessor chips run blindingly fast, while Ethereum … not so much, not quite yet. Many details, such as scaling and transaction rates still need to get worked out.

For example, do you want to store your actual data on a blockchain? Or would it be better to use the blockchain to store pointers to your dataset? Hint: if your dataset is larger than megabytes, transaction rates and computing power become woefully problematic. A fundamental trade-off between transaction throughput and the degree of centralization is explored in “Blockchains don’t scale. Not today, at least. But there’s hope.” by Preethi Kasireddy.

Going with the “store pointers” option, consider that you might use a decentralized data market to store addresses to Amazon S3 buckets, which in turn hold large datasets, which are relatively cost-effective and simple to access. One trade-off, in that case, would be whether or not you trust Amazon to manage the data. As a transnational corporation with market capitalization trending toward a trillion dollars and beyond, ostensibly it is more centralized than Ethereum? That’s a good question. In many use cases, S3 and similar cloud services may be the best choices available.

But are there other ways to dial that knob between fully decentralized and fully centralized? Characteristic properties of datasets such as privacy, storage, reliability, verification, etc., almost all have gradients for decentralization, ways to dial that knob up or down. As an alternative to cloud storage APIs, IPFS is an example of decentralized storage based on blockchain technology. However, taking other factors such as cost, privacy, reliability, etc., into consideration, which alternative fits best for your intended use case? The crystal ball says many interesting vendors will emerge to explore this solution space.

When it comes to smart contracts, TCRs, DApps, etc., there are plenty of open questions to resolve, better crypto methods to research, use cases to identify, code to audit, best practices to pioneer, etc. Meanwhile, there are compelling reasons for this technology, and we’ll look at examples below.

Translated: people working in this field have lots of interesting work ahead!

While many rush toward blockchain, cryptocurrencies, etc., almost solely as investment vehicles [read: speculation, and possibly “fluff”], smart contracts provide interesting ways for multiple parties to participate together in transactions.

Clearly, those are not the kind of transactions where your awesome-sauce React mobile web app updates a password field in some row of a table in a MySQL relational database. In other words, while mobile apps typically rely on fast transactions performed on relatively cost-effective databases, DApps are utterly different creatures. The important distinction is that smart contracts serve use cases where commits are relatively slow and the database is quite expensive. Intentionally slow and expensive.

But tell me, why on Earth would anyone ever conceivably want to do that?!?

Consider how real estate escrow lends a helpful analogy. When you buy a home there are more than two parties involved: generally, both the buyer and the seller have agents representing them. Both probably have different banks. Credit agencies (3-ish) get involved when there’s a loan involved. Perhaps there’s a second mortgage or other liens too? Then a whole range of governmental agencies become engaged, at all levels— federal, state, county, city — to make certain that zoning laws and building codes are properly enforced, ownership transfers correctly, taxes get assessed and paid, etc. Depending on where the property is located, a host of agency inspectors or independent contractors go out to check the property and report back about the state of the home’s foundation, pest management, risk of flood damage, risk of fires, risk of earthquake damage, etc. Legal firms may get involved, e.g., tax attorneys, other advisors, trustees, etc. Just to add more fun, an entirely different bank may buy-out your loan within days (minutes?, seconds?) of the transaction settling. Surprise!

Meanwhile, that whole tangle described above typically runs through an escrow company, staffed by friendly people who love to triple-check each and every signature or initial and date on each and every page. For a significant fee. Afterwards, parts of the “transaction” find their way into the public record.

Good luck getting your home purchase to settle in less than weeks. Why? Because lots of people and organizations — I’m counting about a dozen, at a minimum, for a typical residential real estate transaction— have skin in the game. Everyone involved must make sure that absolutely nothing is horked when the transaction finally “commits”, or else they may be the ones who get fined, incur additional fees, or otherwise lose money. LOTS of money. The alternatives aren’t good: lawsuits fly, people lose their licenses, etc. Because of shared risks, like fires. Or buildings collapsing. Or power lines which don’t have access for repair crews. Risks which are regulated, for good reasons.

In short, while you may speculate by acquiring real estate or by day-trading stocks, most likely you don’t want the title for your resort condo to transfer ownership as quickly as shares in your E*TRADE portfolio. If you do, call me — got some beach property near the Everglades to, um, offer…

Okay, we’ve established how there are good reasons why we’d need transactions based on “painfully slow commits within an extremely expensive database”, where results go into the public record. How does this translate into examples of decentralized data markets?

Here’s a question for discussion … How do you feel about self-driving automobiles? How about self-driving school buses, or self-driving fully loaded semi-truck-and-trailers? Do you trust any particular corporation to build highly proprietary machine learning models for the myriad of sensors required in those bleeding edge AI-based vehicles? Will that same corporation also get their AI vehicles to work 100% correctly the first time out on the roads in production? Because that’s a lot of risk. Shared risk.

When we talk through concerns about self-driving vehicles, we’re often talking about centralized risks. Imagine that Google takes self-driving vehicles to market first. Suppose they build a cost-effective self-driving car with hundreds of smart sensors, loads of AI — and the company has invested years to put together huge proprietary datasets to train all those machine learning models. In that case, we’re centralizing a ginormous amount of risk within Google, although if all goes well they’ll reap ginormous rewards through profits.

Perhaps though, for some definition of “ginormous”, there isn’t any good balance of intellectual property, risks, profits all centralized within one corporation? In our looming future, we’ll be getting just that — except that the centralized risks will be Google multiplied by Uber, Ford, GM, Daimler, Volkswagen, Renault, Ferrari, Toyota, Hyundai, plus a few dozen Chinese brands that may not have names yet. Each with centralized intellectual property for training ML models such that their cars never ever make horrible, unimaginable mistakes which squish lots of people.

Let me know how that’s working out for you. Or, rather, us.

Frankly, I’d feel a lot more comfortable sending my kids off to school in a self-driving bus if the machine learning models hadn’t been trained solely by Google’s proprietary data. Instead, let’s get every possible edge case understood by mingling Google’s training data with that from the other manufacturers. Instead of just California state regulators working on “oversight”, let’s include US federal regulators, EU federal regulators, China’s federal regulators, and so on. BTW, this isn’t intended to pick on Google, although they are a likely subject for much of this conjecture. Let’s create decentralized datasets for training critical ML models such that no single party could obscure that level of risks, regardless of how much safety testing they invest in their proprietary products.

Let’s have a decentralized data market for those kinds of datasets, such that mission-critical ML models can be trained on data which many different auditors and other third parties can scrutinize. That way, Google and their unicorn friends can still receive their fair share of the rewards, and frankly they’ll get better data to build better products anyway. Meanwhile, the rest of us may enjoy more peace of mind when crossing the street.

Granted, this level of data infrastructure will require more than smart contracts and DApps running on Ethereum. It’ll require much work at the leading edge of economics, game theory, etc., and it’s what we must do, collectively, to ensure a safer, saner future. Speaking from a data science perspective, I’m painfully aware that collecting and curating large datasets correctly — error-free datasets — is quite difficult, even for the most expert teams. I don’t inherently trust any one corporation to get that right, let alone would I want to trust the silo’ed actions for each of the several manufacturers who’ll send self-driving vehicles out on the roads in quantity. After all, many of these firms are trillion-dollar entities with loads of competing interests.

That’s one place where decentralized data markets make sense. Cancer research is another, and genomics research with anonymized data is compelling in general. Online fraud and other forms of cyberthreats come to mind as large, difficult, risky propositions — especially with high-stakes elections looming — which no one corporation should be entrusted to “get right” based on proprietary methods. Credit scores are another concern: there’s a whole slew of issues where machine learning meets ethics meets racial bias, at scale, and those matters must be fixed ASAP.

In my opinion, that design pattern — replacing proprietary datasets which centralize risk with decentralized data markets — applies in almost every regulated industry. For each troubling case involving the potential (mis)management of datasets, I’d much rather get “multiple eyes on target” — participation by multiple/competing vendors, plus their oversight agencies, along with watchdog media teams overseeing the overseers, plus the public. That could help make both the shared risks and their corresponding rewards more transparent and accountable.

Chatting briefly with an interested MP recently, I have a hunch that EU regulators may be discussing similar notions. Large initiatives are already forcing us to come to terms with “shared risks” vs. “technology unicorns having free reign” vs. “ethics and polity” vs. “ways to use technology more effectively”. Translated: GDPR was a warm-up round. As with GDPR, California regulators probably won’t be far behind, and the rest of the US typically follows the West Coast in these matters.

Hopefully this article introduced some basic components (the “what”) involved in creating decentralized data markets. Plus, plenty of reading material linked, if you want deep-dives on particular points.

The question now is: Do the kinds of use cases mentioned above (the “why”) resonate with your needs and perspectives? Let us know.

Stay tuned for more articles about building (the “how”) these components from open source, plus example use cases.

Kudos to the @ComputableLabs team; the ideas presented here come from many discussions there.