How to Query Balances for all Ethereum Addresses in BigQuery

Recently we announced beta availability of transaction traces (aka internal transactions) in our public Ethereum dataset in BigQuery: https://bigquery.cloud.google.com/table/bigquery-public-data:ethereum_blockchain.traces (updated daily). They allow querying all Ethereum addresses with their balances.

Below is the query that retrieves the top 10 addresses:

Run it in BigQuery

Here is the result, on the day of writing (as csv file):

Here is the high-level overview of the workflow we used to get this result:

  1. Export all traces from Parity node to BigQuery using Ethereum ETL.
  2. Enrich the traces with the status field which accounts for parent trace failures.
  3. Add genesis and daofork traces.
  4. Write an SQL to calculate the balances.

Along the way I had to solve a few interesting challenges, which I will uncover in the following sections.

Calculating Trace Statuses

As explained in the Parity’s Trace Module Wiki:

The traceAddress field of all returned traces, gives the exact location in the call trace [index in root, index in first CALL, index in second CALL, …].

i.e. if the trace is:

A
CALLs B
CALLs G
CALLs C
CALLs G

then it should look something like:

[ {A: []}, {B: [0]}, {G: [0, 0]}, {C: [1]}, {G: [1, 0]} ]

Parity also provides an error string for each trace in case it failed. Unfortunately it doesn’t account for the top-level trace failures. For example, if the call to C in the example above succeeded, but the top-level call to A failed due to say out-of-gas error, then the trace for the call to C will not return any error with it. This is problematic because any state changes in the call, including Ether transfers, are reverted. We want to filter out such nested calls that look like successes in the traces but actually were reverted due to top-level trace failures.

Your first thought may be to reconstruct the call tree from the flat traces returned by Parity, for each call traverse up the tree and see if any top-level calls failed, set the status for each call accordingly.

Another solution, which I ended up using, is to first query all failed traces, then find all their child traces and mark them as failures. This is easily done in BigQuery with the following SQL: https://github.com/blockchain-etl/ethereum-etl-airflow/blob/master/dags/resources/stages/enrich/sqls/traces.sql

The status field is 0 when the trace succeeded, and 1 when it failed due to any operation that can cause the call itself or any top-level call to revert. This is consistent with the status field in transaction receipts.

Handling Irregular State Changes in DAO Fork

If you simply query balances from the traces returned by Parity, these two addresses will come up at the top:

In reality, the above addresses have only 7 Wei each, on the day of writing. If you google them you will encounter this post about the DAO fork by Vitalik Buterin https://blog.ethereum.org/2016/07/20/hard-fork-completed/. Here is the summary:

Block 1920000 contained the execution of an irregular state change which transferred ~12 million ETH from the “Dark DAO” and “Whitehat DAO” contracts into the WithdrawDAO recovery contract.

The full specification of the changes is in this EIP https://github.com/ethereum/EIPs/blob/master/EIPS/eip-779.md.

These irregular state changes don’t appear in Parity traces, so I had to query them with this SQL: https://github.com/blockchain-etl/ethereum-etl-airflow/blob/master/sqls/daofork_traces.sql. The result is then hardcoded in the Ethereum ETL tool and ingested to the BigQuery dataset.

trace_type column for these irregular traces is daofork. Another special type of traces that I had to add to the table is genesis, which are the initial Ether allocations. They are not associated with any transactions and thus not returned in transactions or traces APIs.

Limitations

Johannes Pfeffer from alethio helped identify addresses for which the balances shown in the query results doesn’t match with the balances reported by the Ethereum nodes. Here is the list of the addresses:

0x4509008d923ef571fc1d29fd66d3135fa02f0b64
0xe5449e9a4f31c38d926b76f76571e5d0b143ef5d
0x0000000000000000000000000000000000000001
0x1f78775c8260df084f9a0e5fbdf06487b875ac4d
0x0000000000000000000000000000000000000003

The presumable root cause is this issue in Parity https://github.com/paritytech/parity-ethereum/issues/7765, which results in that the calls to precompiled contracts are missing from API results.

We will need to resync our Parity nodes, rerun the ETL, after which this issue will be resolved.

Challenge

It’s possible to query running Ether balances for every address on every date. Try to compose a query that will return the number of addresses with non-zero balance, and plot it over time. Post your SQL in the comments. (Solution)

Also read: