Building a blockchain data ingestion service (Part 1)

Published in

SQD Blog

9 min readDec 11, 2022

The story of how we built a blockchain data ingestion service

This post explains how, at Subsquid, we developed the Archive service, responsible for ingesting on-chain data, and serving it to ETL processes upon request. It covers the observed problem with similar products, and the technical design that we have pursued in order to solve the issue. The result is a service that can scale via replication and that allows for the creation of much more efficient, and consequently faster ETLs.

What is an indexing middleware?

In these instances it is often useful to introduce a little bit of glossary and definitions. In the context of blockchain, what’s described in this post is often referred to as an indexing middleware.

Let’s start by introducing the middleware term.

In short, a middleware can be thought of as connecting tissue for applications. The JDBC API to access a database is considered a middleware, in web servers, authentication is often implemented in a middleware (Django has middlewares for all sorts of pre and post-processing of web requests).

The concept of indexing should be relatively easier to introduce.

It can be argued it derives from the index in a book, allowing readers to easily find the page relative to a certain topic. In computer science, it is mostly known in the context of databases and it essentially describes the special references to certain (indexed) columns that improves the speed of data retrieval.

So, in conclusion, an indexing middleware, in the context of blockchain is an entity that sits in between two applications and provides a better way of fetching blockchain data.

Why is it useful?

Without involving the nitty gritty details of how Patricia-Merkle trees work, there are multiple reasons why an indexing middleware is and will be needed when having to deal with blockchain data:

Every data storage solution has its pros and cons, and blockchain storage was designed to prove data integrity, not to serve web requests
Often times, a web application’s backend is designed with the frontend in mind, or in conjunction. Smart Contracts are developed to serve their purpose and prove their execution, which means they are often not very “frontend-friendly”
Blockchain data is sequentially stored in blocks, to track an account’s transaction’s history, one would have to traverse the blocks, like flipping through a book’s pages

So for these (and likely many more) reasons, the need for indexing middlewares started to arise, in tandem with the complexity of smart contract use cases and as Decentralized Applications started to mature.

Issues with initial design

Since the dawn of smart contract blockchains, the issues mentioned so far started to surface, we are not exploring new territory here. The difference here is that, for lack of better tools, developers started building their own in-house indexing solutions, for the most part. And some still are (more on why they are in the next paragraph).

At some point, this became such a necessity, that indexing frameworks started to arise. The problem is, these usually tackle the blockchain data indexing challenge by letting developers build a monolith project, that connects directly to a blockchain’s RPC node, sequentially requesting blocks and applying processing logic, typical of ETL pipelines. Most times this is done to build, for example, APIs with a data schema that is decoupled from smart contract data structures.

It works, it serves the purpose and allows us to build better and more nuanced DApps. But the developer experience is astonishingly poor.

Because of the poor performances of the direct RPC connection, syncing times are incredibly slow. And since the blockchain has tons of historical data (the blocks already minted) and it’s ever-increasing, the indexer has a lot of catching up to do, while new blocks are getting written.

The process of developing such an ETL inevitably goes through a trial-and-error phase. Not to mention eventual updates and additions of new features down the line. So for these reasons, it’s likely the dev will have to wait days (if not weeks) to see the effects of a data change. It’s worse than going back in time to programming in the ’70s, with cartridges and time-shares on the university servers.

A different approach

When our CEO and co-founder approached indexing for Substrate, the absence of available tooling forced him to start from scratch. But, in doing so, he also analysed the flaws of the initial implementations, and the issues they caused.

The conclusion was that a typical data pipeline is segmented into different stages, and no one says they have to be executed by the same entity. As a matter of fact, traditionally, in most cases they are not.

Hence the idea of separating the ingestion and normalisation of on-chain data into its own shared service, which took the name of Archive.

The first stage of a blockchain data pipeline: ingestion

At this point, a quick introduction to Substrate is necessary, as it has its own unique design and terminology. This may not concern Ethereum (or EVM-chain) developers, but it’s useful to better understand the rest of the article:

Transactions are referred to as Extrinsics
Extrinsics are a request to execute a Runtime Call, which is the equivalent of a smart contract function
Calls can, often times, trigger an Event, which signifies its execution
The Runtime can have its own Storage, which can be used to store data in the blockchain and persist it between blocks

The concept of Archives translated to the development of the first generation of such service: a series of servers (one for each Substrate chain we started ingesting: Polkadot, Kusama, Moonbeam, Astar, …) made of two main components: an ingester and a GraphQL server.

The first component, as the name suggests, is the one in charge of ingesting the network’s data block by block. It also accomplishes the first data transformation of the pipeline, by normalising the data, recognising the blockchain’s entities (Events, Extrinsics, Storage, …) and saving data in the correct category. It’s worth pointing out that data is stored in a Postgres database, in this initial version.

The second component, on the other hand, is simply responsible for providing an interface for clients in subsequent stages of the pipeline and evading requests for blockchain data. GraphQL was a clear winner for this, as it allows to keep the design simple and light, have one endpoint and support arbitrary and flexible filtering and querying, without having to continuously re-design and iteratively add new paths, as one would do with REST APIs. The Archive would ingest data (the complete blockchain historical data), store it in its Database already categorised, and have it ready for any client’s requests.

This created the fundamental to develop a set of libraries to interface with such service and the Subsquid SDK was born. This framework allowed developers to build ETL projects with NodeJS.

The later stages of a blockchain data pipeline: transformation

Typically, when one of these processes is launched, the blockchain has already generated some blocks (in many cases, a lot, depending on how much time the network has been running), so the ETL has to “catch up”. It needs to synchronise.

To speed this up, we set off to design a publish-subscribe model. Simply using the SDK libraries, developers could instantiate the SubstrateProcessor class, subscribe to certain Events or Extrinsics, which generate queries to the Archive, requesting only the specified data. The Archive did not have to source information elsewhere, thanks to the ingestion and normalisation phase. And a GraphQL server is much more suited than RPC nodes for evading these requests. The performance gain was massive.

// this code is from an outdated version of the SDK, don't use it
const processor = new SubstrateProcessor("kusama_balances");
processor.setBatchSize(500);
processor.setDataSource({
 archive: lookupArchive("kusama")[0].url,
 chain: "wss://kusama-rpc.polkadot.io",
});
processor.addEventHandler("balances.Transfer", handleBalanceEvent);
processor.run();

Finally, thanks to the automatically generated interfaces, developers can use the power of TypeScript to extract blockchain information, observing the right types of the data stored in Events, or Extrinsics. The cherry on top is the resilience towards Runtime upgrades: an Event can be written in different ways, at different historical times of the blockchain, because a Runtime upgrade has changed its data structure. Our SDK takes this into account, recognising the Runtime version and returning Event data with the correct structure and typings.

// this code is from an outdated version of the SDK, don't use it
export class BalancesTransferEvent {
 constructor(private ctx: EventContext) {
 assert(this.ctx.event.name === 'balances.Transfer')
 }
get isV1020(): boolean {
 return this.ctx._chain.getEventHash('balances.Transfer') === '72e6f0d399a72f77551d560f52df25d757e0643d0192b3bc837cbd91b6f36b27'
 }
get asV1020(): [Uint8Array, Uint8Array, bigint, bigint] {
 assert(this.isV1020)
 return this.ctx._chain.decodeEvent(this.ctx.event)
 }
get isV1050(): boolean {
 return this.ctx._chain.getEventHash('balances.Transfer') === 'dad2bcdca357505fa3c7832085d0db53ce6f902bd9f5b52823ee8791d351872c'
 }
get asV1050(): [Uint8Array, Uint8Array, bigint] {
 assert(this.isV1050)
 return this.ctx._chain.decodeEvent(this.ctx.event)
 }
get isV9130(): boolean {
 return this.ctx._chain.getEventHash('balances.Transfer') === '0ffdf35c495114c2d42a8bf6c241483fd5334ca0198662e14480ad040f1e3a66'
 }
get asV9130(): {from: v9130.AccountId32, to: v9130.AccountId32, amount: bigint} {
 assert(this.isV9130)
 return this.ctx._chain.decodeEvent(this.ctx.event)
 }
}

This modular architecture proved successful when the first projects started adopting it and we received positive feedback on how much development time it saved, simply by shortening the build-and-test cycle, thanks to shorter sync time.

Furthermore, the Runtime upgrade resilience saved a lot of headaches, and the strong-typings protected developers from issues stemming from bad type-conversions. Not to mention the focus on flexibility of the SDK, which was a conscious choice since the beginning.

Second generation: FireSquid 🔥🦑

While our growth team was busy building adoption, our core-development was not sitting idly. A plan was already in place to secure the scaling of the architecture, and future-proof the framework. The speed gains made with our modular architecture were, in fact, just a start, and ideas on how to improve it already started to circulate while developing the first generation.

The second generation tackled two aspects, mainly:

Storage solution: Postgres is a great general purpose database, but it won’t be able to withstand the ever-increasing growth of blockchains, and the sheer size of data of big L1s such as Ethereum
Single entity processing: the SubstrateProcessor is making batch requests to the Archive, to save on network overhead, but it then processes items received one by one

So with the excuse of reworking these two aspects, the team started to redesign the framework and infrastructure from the ground up, incorporating all the feedback collected in the previous months. The result was the release of a major version of pretty much every library in our SDK, and a new generation of Archives. In fact, the changes applied to the core libraries were breaking and we could not seamlessly upgrade Archives, maintaining legacy compatibility.

We had to reach out to all users, alert them of the changes, write guides on how to migrate and keep the previous generation of Archives running to support them, in the meantime. The interfaces for the devs changed slightly, but the main difficulty was adjusting to a paradigm shift in development:

Developers had to start thinking vectorially

Because of the new SubstrateBatchProcessor class, items returned by the Archive were processed in batch, not one by one, like they used to. For developers, this meant transitioning for siloed processing of an item type at a time, to processing a bundle of non-homogeneous items, storing them in memory, and waiting until all items in a bundle were processed, to persist them in the database.

processor.run(new TypeormDatabase(), async ctx => {
 // extract all items from the bundle
 // in this case, they are homogeneous, but they could not be
 let transfersData = getTransfers(ctx)
let accountIds = new Set<string>()
 for (let t of transfersData) {
 accountIds.add(t.from)
 accountIds.add(t.to)
 }
let accounts = await ctx.store.findBy(Account, {id: In([…accountIds])}).then(accounts => {
 return new Map(accounts.map(a => [a.id, a]))
 })
let transfers: Transfer[] = []
for (let t of transfersData) {
 let {id, blockNumber, timestamp, extrinsicHash, amount, fee} = t
let from = getAccount(accounts, t.from)
 let to = getAccount(accounts, t.to)
transfers.push(new Transfer({
 id,
 blockNumber,
 timestamp,
 extrinsicHash,
 from,
 to,
 amount,
 fee
 }))
 }
// after all the entities in the bundle have been processed and prepared, save them in batch
 await ctx.store.save(Array.from(accounts.values()))
 await ctx.store.insert(transfers)
})

The trade-off was syncing speed increased by orders of magnitude, up to 50,000 blocks per second.

You can learn more about Subsquid, the Archive service, and how to use the SDK by visiting our docs.

The next article will discuss the follow-up evolutionary stage, and the plans for improving upon it, which are already in motion.