Data Lakes Explained

Comparing Data Lakes and Subgraphs

Published in

PARSIQ

6 min readNov 19, 2022

Welcome to the second post in the series Data Lakes Explained. In the first post, we explained what Data Lakes are and how they complement and extend the scope of our Tsunami API.

There, we learned that Data Lakes are highly customizable pools which extract and process specific sets of data from a blockchain, allowing developers to query it as easily and efficiently as possible. The kind of custom, concrete data support for Web3 offered by Data Lakes isn’t available anywhere else on the market. We’re proud to be offering something that’s, at once, unique and essential.

However, this isn’t to say there aren’t alternatives on the market that share important similarities with Data Lakes. For, example, The Graph’s Subgraphs also serve the purpose of collecting specific sets of data.

So, you might be wondering things like…

What’s the difference between Data Lakes and Subgraphs? What makes Data Lakes unique? Why opt for a Data Lake?

These are great questions! And this post is designed to address them head on.

So, let’s take another dip into Data Lakes!

Comparing Data Lakes and Subgraphs

While there are interesting similarities between Data Lakes and Subgraphs, there are also crucial points of difference. The chief among them being that The Graph offers a decentralized solution, while PARSIQ’s is centralized. Of course, a central tenet of blockchain ethos is decentralization. What benefit is there, then, in providing a centralized solution?

We agree that decentralization is a vital aspect of Web3 technology, containing much of its revolutionary core. So, we would certainly not deny the importance of decentralization. Yet, at the same time, we do not believe that a commitment to decentralized technology means that everything related to Web3 must be decentralized. Efficiency should not be sacrificed merely for the sake of an attachment to a concept.

Oftentimes, what is called for is patient reflection on the needs of the technology moving forward. Currently, Web3 data needs workable — accessible, usable, and easy — solutions, and not just decentralized ones. It’s our conviction that a proper balance between decentralized and centralized solutions will be key for ushering in widespread adoption of blockchain technology.

It is also important for us to be clear that all of the data that we’re providing has already been validated on chain, so you can rest assured of its accuracy. We recognize the importance of decentralization, which is why we’re working hard to preserve its revolutionary character. To maintain our commitment to the ethos of decentralization, we’ll be bringing a Proof-of-Consistency mechanism into our offerings, ensuring our clients that remains true and accurate — just as it is on chain.

Having said this, what benefit does centralization offer in this case?

Let’s consider three reasons.

First, there are the issues of speed and efficiency.

Similar to our Data Lakes, The Graph provides platforms with Subgraphs that host elements of data that pertain only to the platform. Yet, in order to do this, their decentralized solution involves coordinating multiple types and sources of data with one another, requiring constant points of connections between, for example, indexers and nodes.

With this type of solution, developers must frequently resync their nodes and reindex their data, in order to ensure that everything is accurate and up to date. All of this takes valuable time and resources.

With Data Lakes, you’ll never have to worry about roadblocks standing in the way of you and data you’re looking for. We’ve already indexed all of the data for you — going back to the genesis block! It is simply there waiting for you, allowing you to save your time and resources that are better spent achieving your goals.

Second, there is the issue of reliability.

Decentralized solutions (like Subgraphs) are not always as reliable as our Data Lakes. For instance, it is up to node verifiers to maintain the efficiency of a Subgraph. The benefit for verifiers is that they earn a profit in the process. But what happens when those verifiers decide to move to another, more lucrative, Subgraph? The consistency and reliability of the data is lost!

With Data Lakes, teams can rest assured that the data they need will always be ready and available, instantly accessible! In addition, PARSIQ Networks Data Lakes are fault tolerant. This means that if a Data Lake were to ever cease functioning (however unlikely), it can be brought back online, resuming its state at any point and restoring any missing blocks or other data.

Everything about this process is automated, so clients won’t need to worry about the state of their Data Lake(s). Even more than this, if there are ever glitches or technical errors — say, if 10 blocks dropped off the chain — then we could easily make the necessary calculations, providing reliable data in its place with little to no effort.

Third, there is the issue of customizability.

Unlike Subgraphs, Data Lakes do more than just extract, collect, and process data — as we’ve noted in the previous post in this series, they also offer a high degree of customization! The real payoff of this customization is not just that it makes jobs easier for developers on the back end (and it does do that!); what it also does is allow for a better, more frictionless experience for users on the front end.

Again, request response time is not always consistent with nodes — any Web3 developer would confirm that because nodes tend to drop from time to time they are neither the easiest nor the most reliable source of data. This is not to suggest that nodes are unimportant! Rather, it is only to point out that, when looking at the case at hand (i.e., ease for both back-end and front-end development), they are not the top choice.

Yes, decentralization is an extremely powerful innovation. But this does not mean it needs to be leveraged to accomplish every problem faced on the blockchain. The type of infrastructure required to support a highly refined user experience needs to not just be as fast as possible, but it also must be made as simplified and as reliable as possible. And this is exactly what PARSIQ offers with our Data Lakes.

For a more detailed look at how the customization of Data Lakes is a game changer for business logics, check out this post about how PARSIQ Enables Businesses to Build on the Blockchain.

These are all important differences to consider when comparing Data Lakes with other similar solutions, such as Subgraphs. To sum it all up, when it comes to speed, efficiency, reliability, and customizability, Data Lakes offer a level of service that can’t be beat.

What PARSIQ is offering with Data Lakes is unique. But not only that! It’s also direly needed in the world of Web3, at this momentous time of expansion and adoption.

After all… how else will blockchain technology gain traction, unless it’s founded upon data solutions that are accessible, usable, customizable, scalable, and easy to use?

Be sure to stay tuned for the next post in this series, where we explain what types of Data Lakes can exist and who can use them!

Are you or your team ready to get started using Data Lakes?

Check out our docs here: https://network-docs.parsiq.net/

Or, contact our team by visiting: https://parsiq.net/lakes

About

PARSIQ is a full-suite data network for building the backend of all Web3 dApps & protocols. The Tsunami API provides blockchain protocols and their clients (e.g. protocol-oriented dApps) with real-time data and historical data querying abilities.

Data Lakes Explained

Comparing Data Lakes and Subgraphs

Comparing Data Lakes and Subgraphs

First, there are the issues of speed and efficiency.

Second, there is the issue of reliability.

Third, there is the issue of customizability.

Are you or your team ready to get started using Data Lakes?

About

Written by PARSIQ