How A Trustless Data Lakehouse Will Transform Your Web3 Data Strategy

5 min readMar 11, 2024

We need to talk about trust.

Because ‘Trust me, bro’ isn’t a great strategy for your data management.

Blockchains are a trust-minimized computing paradigm, and the promise of a decentralized web is what attracted developers to build an ecosystem that was free of intermediaries. It was one where everyone could participate in a digital economy powered by open protocols around payments, financial instruments (DeFi), digital property rights (NFTs), and community engagement (DAOs).

But there’s an elephant in the room that needs to be addressed — the data platforms powering many dApps are, in fact, not very decentralized. This irony is not lost on most developers, but before we get into that, let’s recap.

There’s a growing need for trust-minimization

In our last post, we wrote about how Web3 lacks deep personalization even though the open nature of blockchains makes them a treasure trove of data-driven insights. With the right data processing engines like Hyperline product teams can easily find trading opportunities, detect anomalies and fraud, make quick and informed decisions, and build products with high engagement.

But enabling such personalization is still hard because developers need to host nodes, combine data from disparate vendors, and invest heavily in dedicated data teams to even get started. And so, we ended our last post with a few questions:

  1. Why should product teams be forced to trust the data they source from different vendors?
  2. Shouldn’t developers be able to build apps faster using insights from a shared data layer?

The root of the problem lies in the fact that when applications built on the foundational premise of Web3 rely on data from vendors, they run into issues that can compromise their integrity, security, and value proposition.

Traditional platforms provide data integrity and security through rigorous internal practices and regulatory compliance. Most of the data platforms being built in Web3 have relied on these same approaches which ultimately requires some level of subjectivity. But in an open ecosystem powered by blockchains, why do we need to rely on centralized and subjective evaluations when at its core, on-chain data should be public and trustless?

Unlike traditional systems, Web3 is characterized by often anonymous, distributed, and untrusted participants around the world. This is primarily why decentralized blockchains like Bitcoin and Ethereum are designed on the principles of trust-minimized computing and provide strong security guarantees by building a Byzantine-Fault-Tolerant state-machine. This isn’t the only approach though — other models of trust-minimized computing like encryption, zero-knowledge proofs, watermarking, and secure enclaves have been evolving fast in the last few years.

But while this secures on-chain data, offline data processing still relies on third-party vendors to collect, normalize, and expose this data. Most applications, including wallets, don’t run their own nodes and rely on third-party node providers to enable their complete functionality.

The risks of using centralized off-chain data sources

Naturally, off-chain data that’s not trust-minimized poses a lot of problems:

Data Integrity Issues:

Trust-minimizing protocols make sure that data remains unchanged and accurate, right from the source to the destination. Without these protocols, there’s a risk of data being easily manipulated — even when well intentioned — by the centralized providers to create them. Data provenance will always be a critical challenge when building any data product, but Web3 benefits by having its data on a public ledger, and you shouldn’t compromise on those strong guarantees with your current vendors.

These challenges results in errors and compromises the reliability of the application– leading to a long-term impact on utility, user trust, and loyalty.

Security Breaches:

This is obvious, but worth stating — centralized systems are desirable targets for cyberattacks. A breach can expose sensitive user data, result in financial losses, or manipulate app data. As more and more applications build on these data sources, they become larger targets for adversarial attacks. Decentralized systems, on the other hand, distribute data across numerous nodes, making it significantly harder for attackers to compromise the system completely.

Single Point of Failure:

Platforms that are based partially on centralized data have a higher risk of becoming bottlenecks and single points of failure. If the data source provider experiences downtime or technical issues, this can cripple all dependent functionality, causing service disruptions that affect the user experience.

Censorship and Access Control:

This goes back to the vision of an open protocol and why the promise of Web3 was different. Centralized platforms can restrict or modify data access either due to regulatory pressures or their own changing policies. Not only does this limit data availability, it also undermines the permissionless nature of Web3.

We believe that a unified data layer serves everyone.

At Hyperline, we believe that offline data processing should also follow trust-minimization practices wherever possible

While one way to build confidence in your blockchain data is by running your own nodes that participate in the network, this doesn’t eliminate all of your data integrity concerns. This is also particularly difficult because looking at any individual node — even an honest one — is prone to network partitions, where part of the network carries a different view of the state of on chain data.

So what’s the solution? It’s what we’re calling a Trustless Data Lakehouse: a platform that provides shared infrastructure for building trust-minimized, on-chain data products.

Our vision is to allow end users to cryptographically verify consensus data and build mission-critical applications without relying on centralized data vendors when possible. And all of this is made available through a robust compute platform that saves you time and effort from having to stand up your own complicated data platform.

How does it work? That’s a topic for a future post.

In the meantime, imagine a world where you could focus completely on your product without worrying about the overheads of your data strategy. What would you build?

PS: Curious to know more about what we’re building at Hyperline? Get in touch and schedule a demo!