2D Reed Solomon Encoded Merkle Tree Construction

Published in

Xord

6 min readNov 29, 2022

Introduction:

In our previous article Merkle Tree construction we had briefly discussed the formation of the Merkle Tree using the Reed Solomon Code. In this article we will look into the depth of how a 2k by 2k matrix is constructed in light clients.

Reed Solomon Erasure Coding:

In order to form a merkle tree using the reed solomon coding we first decide the size of our Galois Field that is 16 bit. The size plays an important role in coding it helps us to form the generator function which then results in the parity added data. For a detailed description about Reed Solomon code you can read our article here.

2D Reed Solomon Encoded Merkle Tree:

The roots of Namespace Merkle Tree is divided in the four quadrant where Q0 has the original data Q1 has parity data of rows of Q0 and Q2 has the parity data for column of Q0 so the question that arises in our mind is how are we forming it in 2k by 2k dimension there is no explanation for it yet. So the answer is a simple trick of maths. Let’s discuss it below.

In the Q0 quadrant we have our data divided into shares that are associated with namespace ID, whose data will be erasure-coded and committed to NameSpace Merkel Tree. A shares raw data is interpreted in different ways depending upon your namespace ID which has the following raw data.

The first part of the share is the namespace ID
Next we have a share reserved byte that is represented as * which represents the canonically serialised first request that starts in the share or 0 if there is none as the 1 byte big endian unsigned integer.
Remaining we have the requested data in which we have done some zero padding i.e adding zero at the end of the data to achieve the desired length of the byte to complete the size of 256 byte which can be shown below.

Erasure Coding of Original data:

First of all, the data in Q0 is carefully divided in the equal shares and is made sure that all shares lie in the set of 16 bit Galois fields (which is the length of field we decide to perform Reed Solomon coding) which gives us the matrix as follows.

Where txs are the transaction data and msg1 and msg2 stands for the messages which is the interaction between the block producer and transaction sender which is signed by the transaction sender. And then broken into the parts so that it can be laid out in the matrix

For the above matrix we decide on a generator polynomial that has alpha as the primitive roots and can also be written in the form of a matrix. The generator matrix helps us to add parity to the original data and also it helps to recover the loss from the parity. Generator matrix looks like below.

Which is obtained by obtaining a primitive polynomial. Now let us suppose that we have our following matrix as the original data in the GF(2⁸) modulo 0x11D

Now we create a matrix that can encode any matrix of size 4xn here we have n = 8 therefore we create a matrix of size 8x4 which can form polynomial of size n and has n-1 roots hence our parity encoded matrix is as follows:

Since we now know that Vandermonde matrix is an alternative approach to lagrange interpolation method it can interpolate the polynomials by expressing them in matrix form.

Also we know that polynomials are unique and therefore it is easy to generate polynomials using the determinant of the matrix or second approach is to form a generator matrix by taking the upper four rows since we have four rows in original data therefore we are considering only four rows. Which helps us yield the following matrix.

Since we have obtained our extended matrix we will now multiply our original data and we will have the following result

Recovering the lost data:

Now let us assume that we have lost our data, let say row 0 and 4 from the extended matrix. Hence our encoding matrix and original matrix becomes as follows

Since we know that A*(A)^(-1) = I. Therefore we invert the encoded submatrix and multiply the inverse with encoded matrix to check that our answer is correct so we get

As we have obtained an identity matrix therefore we are safe to say that we have a correct inverse matrix. Now in order to obtain our lost data we will multiply our inverse matrix with the extended data which will result in the restoration of the original matrix.

In actual process what we do is that we only multiply the extended data with the sub inverse row of the generator matrix as shown below

Once we have constructed the data matrix row we will also generate parity in the same way

Now it comes in mind that how are we extending our data horizontally so it’s not that of moon maths it’s basically we take another vandermonde matrix and apply the transpose of the matrix this how we get our generator function. Once we have our generator matrix we then continue to the same work that is explained above.

The advantage of Reed Solomon code is that the probability of an error remaining in the decoded data is much lower than the probability of an error if Reed Solomon is not used. But the approach of encoding a data in matrix form is not efficient since the Vandermonde matrix fails on the data of large size therefore in order to have a data more secure I believe lagrange interpolation would be a better approach to the problem