Running a full EOS node and tuning ZFS

As of today, EOS mainnet is almost 6 months old, and it’s growing fast. The blockchain is serviced by nodeos software daemons running in a Peer-to-peer network. Typically a Block Producer would run nodeos servers and provide public API endpoints for the users.

One of the public API is provided by history_plugin, which requires solid technical resources and maintenance costs, and is tending to crash from time to time, and a full recovery may take days. There are several ongoing projects targeting to provide an alternative to this plugin: 1, 2, 3.

There are several typical reasons for running your own nodeos server, such as:

  1. Your interactive application needs a fast response from an API node, and those provided by BP are too slow or too far;
  2. You want to run specific plugins, such as ZMQ plugin, or the like, for exporting real-time events from the blockchain and processing it by your applications.
  3. You need a nearby Peer-to-peer node to feed your other nodeos instances with block data.

ZFS is a feature-rich filesystem developed by Sun Mycrosystems, and currently is is available in Linux. Particularly Ubuntu 18.04 has it in standard OS distribution, and it only needs a few packages to be installed. Among other features, ZFS allows quick and lightweight snapshots, and fast rollbacks to existing snapshots. Also it supports compression and adjustable record size suitable for the application.

So, in order to run a nodeos server in November 2018, you need a physical or virtual server with at least 16GB RAM, some 50GB storage for the operating system, and a 500GB block storage volume for a ZFS pool.

I’m using an LXC container in my scenario, but it’s optional. nodeos can also run in the main OS. But LXC allows isolating the container by functionality, and also it allows easy copying of the whole container to a different server.

Several ZFS filesystems are created inside the pool, as follows:

  • LXC container’s operating system. This is a standard ZFS filesystem with recordsize=128k, primarycache=all. You may choose to encrypt it if needed.
  • state data for nodeos. This is a large sparse file representing the EOS state memory. Typically you would allocate 32GB, and the file occupies about 3GB of disk data. The state file is a Linux shared memory segment mapped to a file. I could not find the information on how Linux is reading and writing it, and strace does not show these file operations. Assuming that the standard Lunux memory page is 4KB, it makes sense to allocate it with recordsize=128k, primarycache=metadata. As its content is already representing RAM state, it makes sense to disable full content caching, and leave only metadata caching.
  • blocks data for nodeos. This folder contains a large log of blockchain blocks (around 80GB by the time of writing) and a few supplementary files. If nodeos replays the whole blockchain, it reads the log in 8KB segments. If it’s already in-sync with the mainnet, it appends random pieces of data between 3KB and 30KB. But when a P2P neighbor requests the blocks, nodeos reads them in 8KB segments. LZ4 is a very fast compression algorithm supported by ZFS, and it gives 1.46x compression factor for the block data. Thus, ZFS parameters are: recordsize=8k, compression=lz4, primarycache=all. It makes sense to cache the content because multiple P2P neighbors may require the block data. Also if by accident your recordsize is 128k, metadata-only caching reduces the performance significantly, as the whole 128k block has to be read multiple times.
  • My scenario also uses a MySQL database in the same container. MySQL data is typically cmpressed with 2x factor, and binlog compression factor exceeds 4x. The compression reduces the I/O load on the disk, thus increasing the overall server performance.

ZFS is very powerful in managing snapshots. You can stop nodeos in the middle of work (assuming it’s not re-playing), and create a ZFS snapshot of its blocks and state data. Then in case of any abnormal crash, you can always return to some intermediate point, without having to rebuild the whole state again. But before rolling back, it’s important to flush the shared memory cache, otherwise your data will become corrupted:

sync; echo 3 > /proc/sys/vm/drop_caches

Full installation scenario: https://gist.github.com/cc32d9/04b66b732bec9aade93abd4a1b5a715e