Scaling Blockchain Storage
Taming the monolithic database
GoChain has been successful in increasing blockchain performance through block size reduction, significant code optimizations, and Proof of Reputation consensus, however, that causes another issue—more transactions per second means that the storage size grows at the same rate.
The storage layer originally used the
go-ethereum implementation where all data is stored in a single
goleveldb database. Because
goleveldb is a log-structured merge tree, there wasn’t an easy way to offload old key/value pairs into cold storage because writes are spread across numerous files for the entire keyspace.
The GoChain testnet regularly needed to be reset because disks would fill and there was no way to segment the data. This was annoying for testnet users but it was a complete non-starter for mainnet. We needed another approach.
Building the solution
The new storage engine approach separates our database into multiple tables, each of which is optimized and segmented based on its type of data. This segmentation let us build fast, optimized storage for cold block data.
The original storage implementation used a concept of tables internally, however, it was done using a single byte prefix on every key. For example, every block body was stored with a key that starts with a
b and every block header was stored with a key that starts with an
In our new engine, we separated out 4 tables into their own separate databases:
body— Block body data.
header— Block header data.
receipt— Block receipt data.
global— All other data.
Block data is typically immutable once it becomes old so separating tables for each type of block data allows us to do some additional optimizations.
Now that we have per-table databases, we want to start optimizing our block related data so that we can offload chunks into cold storage so we don’t run out of disk space.
The first step to this is to segment ranges of blocks by block number. We currently group blocks into 1,024-block ranges. Each of these ranges is separated into its own
goleveldb database. As older ranges age out, we compact the files and back them up to an S3-compatible provider. This lets us keep hot block data mutable and on-disk for fast access but still lets us retrieve older data as-needed.
Immutable file format
The LSM format of
goleveldb works well for write-intensive applications, however, it has several unwanted features for cold data:
- Uses many on-disk files to represent the key/value space.
- Read access is inefficient—typically
- Duplicate key entries can exist until levels are compacted.
Since we know that our data within our range is immutable, we can repackage our LSM data into a more efficient format. For this we developed a simple, single file format that includes three sections:
headersection includes basic file section information and a checksum.
datasection writes all key/value pairs in sequential order.
indexsection writes a Robin-Hood Hash-based index which allows
O(n)lookups for entries.
This approach also allows us to easily upload a single file to our S3-compatible storage instead of trying to tar and untar an entire
goleveldb database directory for storage.
Results & Performance
As with all benchmarks, take these with a grain of salt. These numbers were generated from our load generation tool on testnet. Results can vary depending on a number of factors such as transaction size & complexity.
Simply pushing data off to a secondary storage system isn’t enough if the cost doesn’t scale appropriately. We chose to interface with S3-compatible providers because they are ubiquitous, cheap, and—for all practical purposes—don’t run out of space.
As of this writing, the testnet cluster has
1.6M blocks. The headers, body, & receipts for each block requires an average of
348KB. Current Amazon S3 pricing is
$0.023 GB/mo which means our per-block storage costs are
$0.0000076 block/mo. Storage for a single node currently costs approximately
In addition to cost analysis, we don’t want the compaction to our immutable file format to be a performance bottleneck. Because the file format is relatively simple, compacting segments is extremely quick and performance mostly correlates to the total data size.
Below are the average compaction times for each table type:
- Avg per-block header compaction:
- Avg per-block body compaction:
- Avg per-block receipt compaction:
Overall the new storage engine has been successful. We’ve removed the upper bound on node block data by segmenting and compacting our different types of data. We’ve been able to keep costs low for our long term storage but we are investigating additional cost savings through the use of compression and alternate storage classes such as nearline and coldline storage.