GSoC 2018 Stories 02: Parse Ethereum Raw Blocks

Umesh Prabushitha Jayasinghe
4 min readMay 27, 2018

--

Related image
Image Courtesy : Jamie Skella’s post

I’m here with my 2nd article on GSoC 2018 experience. It’s almost the end of 2nd week after the coding period officially started. There were pretty interesting things happened during this period and lots of learning too. As discussed with the SCoRe organization mentors, the first phase of the Etherbeat project is to develop a C++ parser which extracts data from ethereum leveldb blockchain storage.

There are limited resources and articles on ethereum leveldb storage, as most resources are using RPC endpoint to extract data. Mostly, I had to go through the Go-Ethereum code (written in GO, which is the official and most popular ethereum client) in order to understand the underlying mechanism. I got my concerns clarified through ethereum-go forum. It took me several days of the week to understand and generate keys, work with bytes , RLP encoding and decoding.

Ok, now let’s start from the begining. We’re not going to use RPC or IPC endpoints to work with the blockchain, instead directly interacting with blockchain files. Ethereum blockchain is saved in a leveldb database. Once connected with leveldb, you can iterate through the keys and values to see how the data really stored. I didn’t get any clue on what this data means, I initially thought they are some dummy values.

Actually, keys are present in a specific format.

Block Hash Key (10 bytes) =>
byte(‘h’) + 8 byte big endian notation block number + byte(‘n’)

Block Header Key (41 bytes) =>
byte(‘h’) + 8 byte big endian notation block number + 32 byte block hash

Block Body Key (41 bytes) =>
byte(‘b’) + 8 byte big endian notation block number + 32 byte block hash

Example Byte string of the block hash key for block number 28 would be,
Block Hash Key = “h\000\000\000\000\000\000\000\034n”
If we convert the above hash key to it’s hex representation it would look like this.
Hex(Block Hash Key) = 68000000000000001c6e WHERE 68 = hex for ‘h’, 000000000000001c = big endian 8 byte hex for 28, 6e = hex for ‘n’

Byte string should be passed as the key for leveldb.

So in leveldb, there’s specific values correspond to the each key.

By using the Block Hash Key, we can get the block hash.
By using the Block Header Key, we can get the RLP encoded header of the block.
By using the Block Body Key, we can get the RLP encoded body (transactions and ommers) of the block.

Psuedocode to extract data,

block_number = 28hash_key = create_hash_key(block_number)
block_hash = leveldb_get(hash_key)
header_key = create_header_key(block_number, block_hash)
rlp_header = leveldb_get(header_key)
body_key = create_body_key(block_number, block_hash)
rlp_body = leveldb_get(body_key)

Once we obtain RLP encoded header and RLP encoded body, we need to decode them to find the positions of meaningful data. Yes, RLP decoding gives you the offset and length of each and every meaningful data, so you’ve to slice the raw byte array accordingly.

In RLP decoded block header you’ll find 15 elements (parentHash, sha3Uncles, beneficiary, stateRoot, transactionsRoot, receiptsRoot, logsBloom, difficulty, number, gasLimit, gasUsed, timestamp, extraData, mixHash, nonce). In RLP decoded block body first two elements would be the list of transactions and list of ommers.
In a single transaction it has 9 elements (nonce, gasPrice, gasLimit, to, value, init | data, v, r, s) but you won’t find the sender’s address there. Instead you’ll find signature related values (v,r,s) which you can used to generate the address.

Generating Sender Ethereum Address from a Transaction

Here’s how we can get the sender public key, then the sender address out from a transaction. (Thanks to this Ref)

  1. Take v, r and s by rlp decoding the transaction
  2. Get the transaction hash
  3. Use ECDSA with secp256k1’s curve to recover the public key from tx hash and v, r, s values
  4. Take the Keccak-256 hash of the public key
  5. Take the last 40 characters / 20 bytes of this public key (Keccak-256). Or, in other words, drop the first 24 characters / 12 bytes. These 40 characters / 20 bytes are the address. When prefixed with 0x it becomes 42 characters long.

Note that ethereum is not using sha3, it uses Keccak-256 for hashing.
In order to recover the public key of the sender (tx signer), Elliptic Curve Digital Signature Algorithm (ECDSA) with secp256k1’s curve is used.

I’ll explain how to implement the above 5 steps in a future article. It took almost 2 weeks to implement the whole story I’ve explained in this article which I had to read research papers, several code bases.

Click here for the Previous Article

--

--