Indexing Ethereum — the data consistency problem

Published in

Alethio

4 min readAug 22, 2019

When it comes to working with the Ethereum blockchain, accessing data, let alone analyzing and understanding it, can be quite difficult. At Alethio, we built block explorers to provide an easier view into this data, but the basic data available through JSON-RPC was not enough.

We wanted the ability to search through the data, to access things you can’t get directly from the node (like the list of transactions for an account). We wanted all this data to be available at a glance. That’s why we’ve built data pipelines and more advanced explorers making all this possible.

But how do you know the data you are looking at is consistent? How do you know that when you browse an account’s list of transactions there’s nothing missing? And just to make things a little bit funnier, how do you know this data doesn’t contain parts of reorganized data? Glad you asked!

While building our data pipeline at Alethio we had to find the answers to all of the above questions. We’ve commited ourselves to providing consistent, up-to-date, and reorg-proof data and so we’ve gone into the tiniest details to make sure we’re accomplishing that.

So, how do we do it?

Introducing Coriolis — the pipeline supervisor

Coriolis is an internal (for now) audit system which continuously monitors the scraping processes and its results, detecting any anomalies that might occur and employing its self-healing abilities in order to keep us — the developers — mentally sane. We’re able to sleep like babies knowing that Coriolis has our backs.

The problems we’re trying to solve

Missing data
Even though we considered this theoretically impossible to happen, we’ve learned pretty early on that reality doesn’t always match expectations. We were terrified (not really) to discover that some days, when stars align, our pipeline missed a block. Maybe the node crashed, or one of the scrapers was out chasing the postman, or something else, but what’s sure is that it happened. This was the first issue that led to Coriolis.

Reorganized data
After we implemented reorg handling into our pipeline we noticed that occasionally the correct block wouldn’t be processed — either the node didn’t send it (most frequently) or something bad happened. Coriolis and its little Workers are there to ensure the hash chain is never broken.

Multiple data silos to check
Having a microservices-based architecture, we’ve built our tools for really specific jobs which led to multiple (specialized) silos of data — some databases would hold balance data, others would hold on-chain information, and so on. Designed as a loosely coupled set of microservices that communicate flawlessly with each other through gRPC, Coriolis is able to watch all those for anomalies.

Self-healing mechanisms
Keeping track of all those issues and manually fix them wouldn’t really be feasible for a human being and even more so for a developer. Everybody knows we’re lazy. Since we couldn’t train our monkeys in time to do it, we had to instruct Coriolis to take actions by itself.

Alerting capabilities
For really out-of-the-ordinary events, usually when the self-healing mechanisms can’t fix it or the effects are delayed (just like when you combine antibiotics with alcohol), we’ve instructed Coriolis to cry for help. Those are the cases where we have to manually intervene (although I can’t remember when was the last time…).

Overview dashboard
Checking a database or a bunch of logs just to see if everything works as intended wouldn’t be that funny. Plus it required effort. So we’ve built a cool dashboard with 2 main attractions: the list of the workers that are currently connected and their status & a filterable timeline of events. This way even our non-technical friend might understand what it does.

Looking into the future

We’re constantly maintaining Coriolis and adding new features in our mission of providing correct and consistent data to the people using all our products. It also represents one of the building blocks in our long term strategy for providing the proof that everything is exactly like it should be. No missing data, no incorrect data, no data that’s been tampered with. More details on that will be available in the coming months.

Being an integrated part of the pipeline, we’re planning on including it with any deployments on any kind of Ethereum-based network, be it public or private. We think it is a really great tool for checking on the pipeline and understanding its status.

We will be releasing more features in the coming period and welcome your suggestions on what new features or enhancements you’d like to see. Head out and visit aleth.io for access today! Follow us on Twitter @AlethioEthStats! Interested in exploring an integration? Drop us a note at partnerships@aleth.io — we’d love to work with you!

Indexing Ethereum — the data consistency problem

So, how do we do it?

The problems we’re trying to solve

Looking into the future

Written by Casian Lacatusu