Exploring a Merkle tree database

Published in

Proxima.one

6 min readAug 16, 2021

The B+ Merkle tree can be leveraged to improve performance and efficiency by incorporating database file i/o instead of building on top of it. The feature set of the B+ Merkle tree includes multi-layer queries and relational objects. Extensions to support high-level queries like multi-key filters and relational queries can be done using general structural logic. Let’s look at how that works.

Existing Solutions for Complex Database Feature Set

Existing solutions focus on the data structure itself without regard to implementation and feature set (e.g. Patricia trie built on LevelDB). with improvements and features built on top. This suffers various performance reductions. Extending feature set to filters would suffer from the same issues

Proxima DB Node

The ProximaDB server is designed to implement the concept of logs, snapshots, and structure with relational queries at a graphql endpoint.

The database writer is responsible for bulk updates to the database index server. The document writer pushes document updates and removals to the server, as well as specifying the encoding format for the database server. This encoding format determines the method by which the data can be read, and queried.

In the case of structured data, it is possible to set up an api that can query in an intelligent manner. We do this using a graphql for querying and push operations, proof constructions, and audits to the db index. By pushing the logic for query planning to the graphql server it is possible to extend the functionality of the server.

Database Index

The database index maintains a flat-file system with a different, radix-based index. In this manner, the data indices have the ability to traverse a group of storage endpoints. The flat-file system is an improvement upon the approach that uses LevelDB/RocksDB as a storage layer because it reduces the number of times that the disk is used for I/0, and it prevents the random node lookups that are required for each traversal in the index. The RocksDB approach has a key-value lookup for each node in the tree, while the flat-file system only has lookups for values. This means that the flat-file system scales better with the size of the database, without any tradeoffs on speed for I/O and performance. The only tradeoff in this case, is the technical cost of implementing the flat-file system for a database index.

The Basic Proxima Database Node

The Proxima Database Server is responsible for maintaining individual tables, storing snapshots, and maintaining the logs of the database. The responsibility of the server is to data stored within a set of volumes, and relies on an authenticated data structure within its index.

Data Encoding

The encoding of the data for each table can be set such that it is recognized as a document database (json), or having the values set in blob format for image/additional assets. Objects and object models can be defined with specific hashing functions for the associated proofs or audits. This requires more development, but has the potential to be used in instances where data and verification is costly like in on-chain smart contracts.

Tables

In order to effectively use this system, there are several other components that have been added to the database server responsible for index and table writes.

Logs

This can be done within the index of the database, where logs are pushed to an additional index, the logs are then batch written as a whole.

Snapshots

The database stores the following data associated with each. Each table stores a track of the metadata which includes: snapshot root, operation offset, logs, storage, encoding, prev, config. This enables a set of basic requests of get, range, and filter (if it is a document database). For deeper queries we need to reference the look at the GraphQL api and query planning for schema structuring.

Proxima GraphQL Distributed Database

Data modeling and Reference Relations

The schema for the document database is designed in graphql and the correct indexes are generated according to these rules. The graphql schema requires no knowledge of the encoding scheme of the data within. The document writer writes to an encoder, then the graphql enables the usage for queries. Object references in the Proxima GraphQl DB are represented by ids, and their associated index. In the example of accounts, this means that the balances are represented as balance ids, and the balance entity is requested from a separate table. By using object references we can connect objects from different servers, and run tables for the same server on separate machines. This is possible because the GraphQL server manages the query planning, and has an understanding of the object model that is being queried.

Account Example: Modeling the data

Goal: Want to use ProximaDB Server to read and write data regarding tokens.

Write the schema in Account, Balance, Token, Transaction: Graphql Schema

type Account {
id: ID
address: String
balances: [Balance] 
transactions: [Transaction] 
} type Balance {
id: ID
lastTransaction: Transaction
tokenId: String
accountId: String
amount: Float
} 
type Token {
id: ID
address: String
symbol: String
name: String
decimals: Int
supply: Float
} type Transaction {
id: ID
block: Block
hash: String
tokenId: String
from: String
to: String
amount: Float
timestamp: Float
} type Chain {
id: ID
chainId: String
latestBlock: Block
blockNumber: Int
} type Block {
id: ID
hash: String
chainId: String
number: Int
timestamp: Int
}

This is then used to generate the Graphql schema.

Account Example: Basic Updates

If we are using the same account structure, then it is possible to use this along with a document writer, to push documents into the database server.

Generate Content updates and Write them to Database as Operations
BulkWrite of documents is used
Database will write operations to logs index… for User, Balance, and Token Contract,
If the update is 1,000th operation, then the database stores a snapshot with the id of the current operation, within a snapshot index
Content is updated and user can now begin reading the data

Multi-layer queries

User wants to get data matching document schema. They can do this by making a request to the Graphql generated endpoint. (Get, Query, relational). The multi-layer queries are limited to requesting data, and data transformations (e.g. sum of balances) are left to the client-side. Larger aggregations of data can be statically as a part of the Proxima streaming process, and these aggregations can then be audited through Proxima DB auditing system. Multi-layer queries traverse the object graph, requesting the object reference as well as a potential proof associated with this reference. This presents a bottleneck that other document stores face where the time of the query is directly connected to the complexity of the query itself.

Account Example: Multilayer Queries

Given the database index structure which has the ability to filter/provide range proofs, it is possible to construct operations that query relations of objects. We want our account information, which means all of the account, transactions, and balances. We have a query that looks like this:

query {
  Account(id: "1", prove: true) {
	id
    balances {
      id
      tokenId
      accountId
      lastTransaction {
        id 
        timestamp
        amount
        hash
      }
      amount
      proof {
        root
        proof
      }
    }
    proof {
      root 
      proof
    }
  }
}

Verification

The Graphql Request can take audit, or prove on header extensions, and returns audit and prove in the response. The structure of the query and proof can be shown as follows:

[account proof] ->  [balance proofs] -> [token proof]
                    [transaction proofs] -> [token proof]
                                         -> [blockchain proof]

The query can be done using multiple indices either hosted on the same machines, so long as the index can be reached. In the end it looks something like this on the Graphql playground:

Challenges and Bottlenecks

This approach can scale up, and works especially well in highly distributed systems (e.g. peer to peer) because these indices scale horizontally. This means that the bottle neck lies in the access of data and the complexity of the queries themselves.

A call for a performant authenticated data structure: B Merkle tree

To optimize performance and efficiency, it is necessary to implement a data base that uses the indexing structure of the B Merkle tree.This can be done with a flat-file system, which will enable efficient proofs through:

Multi-key filters
Query planning and Proof compression
Structuring and Encryption of Data

Conclusion

In summary, including the data index in the set opens up the path to optimizing query planning and the extension of features, including privacy and structured data. These component take into account a single node, in our next article we discuss the extensions of partitioning data among light clients.