PoCo Series #9 — Indexing smart contract activity. It’s harder than you think!

Published in

iExec

10 min readJul 1, 2020

Today, we’ll be discussing how iExec indexes its smart contracts and how this is a complex but very important process. We’ll also explain why we use TheGraph to do so. For anyone that missed the recent news from TheGraph, you can read more here.

💡 Want to learn more about iExec? Check out iExec Academy!

iExec Academy aggregates all content related to the project. You’ll find articles, tech documentation, videos, interactive demos, and much more! Whether you are a beginner or an expert, a developer or crypto-enthusiast, you’ll find what you are looking for on iExec Academy!

📚➡️ https://academy.iex.ec

In the previous articles of this series, we extensively discussed iExec’s use of smart contracts to enforce the PoCo protocol in a trustless and decentralized manner. Their design impact the entire platform, as they set the rules on what can and cannot be done, by who, and which impact these actions have on the participants’ balance and reputation.

Previously, the series discussed the latest developments that will improve the modularity and future-proofing of the iExec platform. In this article, we will cover the process of handling data by smart contracts, and the indexation of smart contracts’ activity.

For an app to work well, and to integrate within a hybrid on-chain/off-chain ecosystem, it is not enough to just implement your logic into smart contracts. You also need to understand how these smart contract store data and trigger events.

Smart Contract Auditability

Smart contracts are often presented as “auditable by design”. What this usually means is that both the code, and all transactions are publicly available, meaning anyone can observe the state of the contract, and understand what everybody else has been doing. While this is technically true, the code and transactions can be difficult to follow and process. Contracts may include accessors that give read-access to some of the data, but even if there is no accessor, it is still possible to read anything that is part of a smart contract’s storage. This explains why a smart contract cannot securely hold a private key or produce a cryptographic signature. To access the state of a smart contract you can either look it up in an ethereum node or rebuild it through the process of synchronizing your own node, using the history of all past transactions.

However, the state of a smart contract is not everything. Past states (the balance of <xxx> at block height <yyy>) and past events (Transfer from <xxx> to <yyy> of value <zzz> in transaction <www>) also provides crucial information about a smart contract’s activity. Storage, memory, calldata, logdata … they are many memory spaces that smart contracts have read/write access to. Before going into the many challenges of smart contract indexation we first have to go into the mechanisms that power the EVM.

How is data stored on-chain, how can I access it?

People often think of the blockchain as a simple distributed ledger, composed of blocks that people can submit transactions to. However, an ethereum (EVM) blockchain is actually a lot more than this. First, we have to differentiate the following:

Calldata is the name given to the transaction’s data field. It is mostly used when deploying new smart contracts or submitting transactions to existing smart contracts. It is the input of smart contract calls.
Memory is the name given to all the temporary variables used during the execution of smart contracts. It is only used during the resolution of a call, but it is not stored or accessible after the call is complete.
Storage is the name of the long-term memory space that each smart contracts can read from and write to. This is the most expensive (in terms of gas) memory space to interact with. This contains all the state/variables that smart contracts have to keep between transactions. A good example would be ERC20’s balances.
Logdata is the name given to the memory space where event logs are stored.

Calldata and logdata are included in the blocks. This guarantees its long term availability. On the other hand, they cannot be read/overwritten by latter transactions, which means they can be stored on slow/cheap devices such as hard drives. Storage, on the other hand, can be read from and written to by future transactions. This means nodes have to keep it in fast (RAM/SSD) storage, explaining the extensive cost of writing to storage, and the refund one can get when cleaning it up.

Much more could be said about the cost of storage, including the difference between clean and dirty storage slots. I would encourage any (aspiring) smart contracts developers out there to learn about this, as well as the security consequences of gas cost changes.

What are the challenges?

As previously mentioned, storage is very expensive, which means that it should only be used for values that either has to be accessed by further calls or should be easily accessible through accessors. Putting everything in storage is however a bad (and very expensive) option.

To come back to the example of ERC20 tokens: here, you want the balances to be in storage but not the history of all transfers.

Transfers are represented as events (in logdata). They are much less expensive but are more difficult to retrieve. You need to know which transaction/block to look for, and if they are old nodes might not hold them anymore (A full node is required).

Past states are even more difficult to retrieve. Old blocks transactions (calldata) and events (logdata) are not enough, you need to know the storage state as it was at a certain block height. This is only stored by archive nodes, which are expensive to run due to the large amount of storage they need.

Rather than constantly looking for past events or past state, many dApps prefer building ad-hoc, centralized, databases that are fed from on-chain activity and, in turn, provide structured data in front-ends. The underlying smart contracts are still trustless, but auditing them requires either very technical skills or trusting a centralized source.

iExec is using this pattern to feed frontends like the iExec Explorer and the marketplace.

iExec’s explorer, displaying blockchain data indexed by the existing ‘iExec watcher’.

Indexing blockchain data into a structured database, while it may seem simple at first glance- it is a very complex endeavor!

We usually see blockchain as a linear data structure, where new blocks are added to an existing chain of blocks. To index that, you just need to build a transition function that processes a transaction, run it to every transaction in every new block, and be done with it. Right?

The first issue you might have is that this transition function might need to read storage to make sense of some calldata/logdata. Therefore, if you want to rebuild the database in the future (for example after having improved your database schema and your transition function) you will have to read blockchain storage in the past. This requires an archive node, which is easy to configure but requires a lot of resources to run, and a very long time to synchronize.

Another issue is that of ‘blockchain reorgs’!

Blockchain doesn't exist as the linear structure people may think. It is after a certain number of confirmation blocks, but if you consider the latest block you receive, they might end up not be part of the chain. Reorgs happen, blocks get dropped, and the database you are using to store blockchain activity will have to react to that. This means reverting your transition function or reloading a previous state before restarting from that point.

This type of event is actually quite common, particularly on some test networks. Not considering that can mess up your entire database, which would need an archive node (and a long time) to resync. While uncle blocks and blockchain reorgs are not such a big issue on mainnet, they are scary on networks such as goerli (which has been unstable and has seen big reorgs). Overall, running your stack on an unstable chain is a good stress test. If you can’t manage the reorgs, don’t blame the blockchain, blame your stack for not dealing correctly with it!

What if I don’t want or don’t know how to build this tooling?

As mentioned, many dApps experience the same issue, and while many started indexing on-chain activity through dedicated tools, there are now solutions ready to use off-the-shelf, that will do everything for you!

TheGraph is one of these solutions.

If I’m presenting TheGraph in an iExec article, its because I think it’s a great tool that addresses a real issue that iExec, and many other projects, experience. We are aware that the ENS app also uses it, and for good reason:

TheGraph separates the logic. You don’t have to understand and deal with events such as block reorgs. They do that for you, you just have to focus on your application’s logic.
It’s easy to build your database structure (using the GraphQL format) and handlers that will process blockchain events and update the database accordingly. These constitute a subgraph (a view for one app that lives within the TheGraph ecosystem).
Database schema and mapping for each subgraph are hosted on IPFS, so anyone can start their own database and ingestor node, making the solution decentralized.
You can either use their hosted services or run your own nodes, and running your own node is easy (as long as you have access to an archive node).

These features make TheGraph a perfect fit for iExec. It allowed us to express our smart contract logic in clear on concise modules, make them available to everyone to maintain the highest level of decentralization, and deploy our own infrastructure to serve our frontends. This last point is particularly important as we want to make sure we can run our entire stack with no string attached to any other projects. It also means we can use the TheGraph toolkit not only on the public ethereum networks but also on private sidechains.

What does iExec’s TheGraph infrastructure look like?

TheGraph provides sources, executables and docker images for their software, making it very easy to run your own node. Yet, running it at scale is not as simple as one would think. After discussing their infrastructure, we came up with a design that fits iExec needs.

Security is key at iExec. This means we don’t want to expose our offices to any DoS, and we are therefore hosting all public services on Cloud infrastructures. This includes our website, front ends (marketplace, explorer, …), and services like the iExec ‘watcher’. We are also relying on Infura’s archive nodes. We wanted to keep this design for our subgraphs, but we also wanted to index events on the sidechain, which is hosted on secured machines. These nodes are protected by firewalls, that only a few IPs can go through. Our offices are part of this whitelist. This is why we run an archive node, and a TheGraph ingestor locally, pushing structured data to a public database. In the meantime, we run an ingestor for the public networks on a machine that is close (low latency) to the database. A publicly available graphnode answers all queries from users, regardless of the blockchain they are asking for.

A (simplified) overview of the iExec and TheGraph hosted services infrastructures.

This design allows us to feed the database with data from both the public blockchain (mainnet and testnets) and our sidechain with no modification to our firewall policy and no inbound connection to our offices. It also means if one of the ingestor fails, the other can continue its indexing work, and the database is still accessible through the public API. The database is backed regularly by an external service (Not shown on the figure).

At the same time, the subgraphs for the public blockchain (mainnet & testnet) are also available on the TheGraph hosted service.

On top of all that, a load balancer provides a unified entry point to both our infrastructure and the TheGraph hosted services. This provides redundancy for the subgraphs indexing the public blockchains as they are indexed on both sides. The iExec sidechain subgraph is the only one not replicated, as TheGraph has no support for our sidechain.

Our TheGraph infrastructure monitoring tool, https://graphnode-monitoring.research.iex.ec/

The interfaces to query the V3 subgraph for mainnet are as follows:

iExec PoCo V3 on thegraph.com (hosted service)
iExec PoCo V3 on thegraph.iex.ec (iexec deployment)
iExec PoCo V3 on thegraph.redirect.iex.ec (load balanced)

We also have subgraphs for kovan, goerli, bellecour and viviani, as well as for the upcoming V5.

What is the point of all this?

Transitioning from the iExec ‘watcher’ to TheGraph will take time. The querying interface is different, and we are still evaluating the stability of our TheGraph deployment. However, this transition could have many benefits:

Expending the database to support additional mechanisms (for example when adding new modules in V5) is a much smoother process using TheGraph than our dedicated solution. It can also be done by the smart contracts developers, reducing the need for the front end developers to understands the inner specificities of the smart contracts.
The QraphQL interface is easy to work with which helps developing analysis and visualization tools.
The GraphQL interface is publicly available, meaning anyone can fetch the data their app might need.

We believe having access to detailed metrics on the platform activity is necessary for some dApp developers that want an in-depth integration with iExec. As such, this service will complement the SDK, explorer and marketplace that are already available.

Thanks for reading! Interested in following iExec?
V5 will be released this month (July)!

This year, the iExec V5 milestone will be split into two separate releases, both announcing new tech and business developments. In the July release, we’ll see the DeFi tools release, where some details on the future roadmap will be shared in the September release. We’ve been keeping of work under the radar in 2020, which should translate into some pretty exciting announcements!

Follow the project and Keep an eye out for the features and news associated with both milestones:

Website • Slack • Telegram • Twitter •Youtube •Github • Technical Documentation