Image source: www.behance.net by pixego

Data structure in Ethereum | Episode 1: Recursive Length Prefix (RLP) Encoding/Decoding.

Phan Sơn Tự
Published in
8 min readJan 3, 2018

--

There are literally a lot of papers, blogs which explain how Ethereum organizes its data, but they all seem to be so disconnected and really hard to get an overall picture. In order to help you and confirm my understanding by the way, this series will explain one by one problem of data structure in Ethereum.

Discover and review best Blockchain softwares

I will break out this subject into 5 main episodes and 1 extra espisode (I will call it the 1+ espisode):

  1. Recursive Length Prefix (RLP) Encoding/Decoding. And 1+ for Hex Prefix Encoding.
  2. Trie — Radix and Merkel.
  3. Trie — Patricia.
  4. Examples.
  5. State Tree Pruning.

First of all, we are going to make sense about RLP, so what is the purpose of RLP in Ethereum?

Recursive Length Prefix (RLP)

In computer science, data serialization is necessary for many complex data forms to be stored or transmitted in only one formal format. Because of that, RLP is an encoding/decoding algorithm that helps Ethereum to serialize data and possible to reconstruct them quickly.

RLP Encoding

As Ethereum mentioned, the RLP encoding function takes in an item. An item is defined as follows

  • A string (will be converted to byte array) is an item
  • A list of items is an item

For example, all of objects below are items:

  • “dog”
  • []
  • [“dog”]
  • [[], “dog”, [“cat”], “ ”]

RLD encoding is defined as follow:

  1. If input is a single byte in the [0x00, 0x7f] range, so itself is RLP encoding.
  2. If input is non-value (uint(0), []byte{}, string(“”), empty pointer …), RLP encoding is 0x80. Notice that0x00 value byte is not non-value.
  3. If input is a special byte in [0x80, 0xff] range, RLP encoding will concatenates 0x81 with the byte, [0x81, the_byte].
  4. If input is a string with 2–55 bytes long, RLP encoding consists of a single byte with value 0x80 plus the length of the string in bytes and then array of hex value of string. It’s easy to see that the first byte is in [0x82, 0xb7] range.
    For example: “hello world” = [0x8b, 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64] , because “hello world” has 11 bytes in dec or 0x0b in hex, so the first byte of RLP encoding is 0x80 + 0x0b = 0x8b , after that we concatenate the bytes of “hello word”.
  5. If input is a string with more than 55 bytes long, RLP encoding consists of 3 parts from the left to the right. The first part is a single byte with value 0xb7 plus the length in bytes of the second part. The second part is hex value of the length of the string. The last one is the string in bytes. The range of the first byte is [0xb8, 0xbf].
    For example: a string with 1024 “a” characters, so the encoding is “aaa…” = [0xb9, 0x04, 0x00, 0x61, 0x61, …]. As we can see, from the forth element of array 0x61 to the end is the string in bytes and this is the third part. The second part is 0x04, 0x00 and it is the length of the string 0x0400 = 1024. The first part is 0xb9 = 0xb7 + 0x02 with 0x02 being the length of the second part.
  6. If input is an empty array, RLP encoding is a single byte 0xc0.
  7. If input is a list with total payload in 0–55 bytes long, RLP encoding consists of a single byte with value 0xc0 plus the length of the list and then the concatenation of RLP encodings of the items in list. The range of the first byte is [0xc1, 0xf7].
    For example: [“hello”, “world”] = [0xcc, 0x85, 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x85, 0x77, 0x6f, 0x72, 0x6c, 0x64]. In this RLP encoding, [0x85, 0x68, 0x65, 0x6c, 0x6c, 0x6f] is RLP encoding of “hello”, [0x85, 0x77, 0x6f, 0x72, 0x6c, 0x64] is RLP encoding of “world” and 0xcc = 0xc0 + 0x0c with 0x0c = 0x06 + 0x06 being the length of total payload.
  8. If input is a list with total payload more than 55 bytes long, RLP encoding includes 3 parts. The first one is a single byte with value 0xf7 plus the length in bytes of the second part. The second part is the length of total payload. The last part is the concatenation of RLP encodings of the items in list. The range of the first byte is [0xf8, 0xff] .
  9. One more thing, it is not mentioned in wiki Ethereum but in Golang source code. With boolean type, true = 0x01 and false = 0x80 .

RLP Decoding

RLP decoding is easier when you figure out how RLP encoding works. Actually, RLP decoding just receives encoded input and decodes the type, the length of that data.

  1. According to the first byte of input, RLP decoding analyses data the type, the length of the actual data and offset.
  2. According to the type and the offset of data, decode data correspondingly.
  3. Continue to decode the rest of the input if still possible.

RLP decoding is fully explained in wiki Ethereum and I don’t wanna waste our time on repeating something unnecessarily. I will put the references below.

When you don’t know something, just skip it by telling that it is easy. Hah.

Diving into RLP

Actually, Ethereum wiki has explained RLP extremely easy to understand, so I just reminded them by my writing style and what I expect in this article is about diving into RLP and getting deep understanding. Hmmm, again 😑.

One more thing, all of idea here is just my personal points of view and it may be misunderstood. So, you should recheck it, convince by yourself and then believe it if correct.

As what I read from wiki Ethereum, RLP just focuses on byte, string and list. Some extra types of data such as big number, boolean, pointer, slide… are based on which programming language we use to implement RLP.

According to the document, the first byte of encoded data decides which the type of data.

  • [0x00, 0x7f]: byte
  • [0x80, 0xbf]: string
  • [0xc0, 0xff]: list

The first question, Why don’t we use a fixed prefix instead of a dynamic prefix?

First of all, you can see in RLP, sometime the data needs some bytes to describe the type, the length of data, but sometime the data also shows its type, the length by itself.

The main reason is to save the memory space.

If we try to use a fixed prefix, we would add them in every single input that we wanna encode and in some situations, the main data is even shorter than the prefix.

You can tell that it will become simpler to read, but it’s only applied to human. In case of computer, it cannot differentiate which one more complex. Computer just runs the code with the care of computational complexity, and in this case, I’m pretty sure that 2 source codes are the same computational complexity.

Furthermore, if it is fixed, so how many bytes will we use? We are not sure. So it’s unnecessary.

The second question, Why did they choose 0x7f, 0x80, 0xbf, 0xc0 as checkpoints?

Just think about it in a sequence. We don’t wanna use any prefix with encoding a single byte, because it will be double (or triple, or more) the memory to store the encoded data if we use a fixed prefix as what I explained in the first question. So we need to determine a range in which, the byte is encoded by itself.

ASCII Table.

It may be not accidental when 0x7f was chosen. The ASCII has used 7 bits to encode 128 single characters and it corresponds with 0x7f . I believe that is the reason why [0x00, 0x7f] was chosen. However, what is RLP encoding of byte with 0x80 value?
The answer is we add a prefix, RLP_encode(0x80) = [0x81, 0x80] .

After that, about string and list, we have no choice and must use prefix. It’s distinct when they divided the rest of range into halves. [0x80, 0xbf] for string encoding and [0xc0, 0xff] for list encoding.

The third question, Why must we use a range to describe a type instead of the only one value of byte, I believe that one value of byte is enough?

Yup, one value of byte is enough to represent a type of data but we need to know how long the data is to get the offset. In order to do that, we must add more prefixes if we just use one value of byte to represent the type of data.

For now, you understand that we will get two problems. Firstly, assuming that we use 0x80 prefix for string and 0x81 for list, so we waste this byte on storing just 2 values while it still can do more. Secondly, it seems that we are trying to fix the prefix again (one byte for type, some bytes for the length of data) and as what I discussed on the first question, it may waste memory a lot.

We choose a range of bytes to not only encode the type of data but also the length of data.

The fourth question, What would we do when the length of data is out of range of prefix?

That’s an understandable question. We add more dynamic prefixes after the first byte to represent the length of data. For example, with the [0x80, 0xbf] range of string type, according to the strategy we have done above, we divide this range into halves, one ([0x80, 0xb7]) for string with length in range and one ([0xb8, 0xbf]) for string with the length out of range. The same goes with list.

--

--

Phan Sơn Tự
Coinmonks

A lucky guy was born in the Age of Cryptocurrency Boom