This post is meant for members of the Orbs Universe; Guardians, Validators, Delagators, and everyone else interested in monitoring the Orbs network.
Hi everyone! Ido here. One of the important milestones we’ve hit in April, post-launch, was bringing our monitoring system into production. Having an efficient monitoring system is crucial for the optimization of all aspects of the Orbs blockchain. This update will elaborate on pre-production load testing, and post-production performance monitoring. Both use cases employ the Orbs monitoring dashboard as a key component.
In the weeks leading up to the planned launch of the Orbs network, Orbs core team was running stability tests on various network configurations. The purpose of such testing was twofold: identify bugs that manifest themselves only after a long time the system has been online, and test the load limits of the system.
We wanted to share the progress of stability tests with the whole team, so we decided to create a Dashboard that would be posted on a TV at the Orbs Ltd. office, where everyone can see real-time metrics of the running tests.
The requirement was to let everyone have a casual glance once in a while and immediately spot anomalies. Digging deeper into problems is left to other tools.
As a starting point, we already had a /metrics endpoint on the Orbs node, that prints out the node’s metrics in JSON format. For example, see the Orbs Audit node metrics.
Here you can see sample metrics on a validator node in production — It shows the current block height as 452631:
For the presentation layer, we checked several dashboard solutions: Geckoboard is a great looking dashboard especially on a TV screen and if your use case is covered by what it supports out of the box, it is highly recommended. In our case we needed some features which were not available, such as displaying a time frame of less than 1 day, and the ability to perform calculations over the data directly in the dashboard.
Another solution we tried is DataDog, which showed initial promise as it supports scraping JSON metrics directly from a URL, and performing calculations over it. However, we were not satisfied with the visual quality when displayed on a TV screen (specifically the lack of a Dark Mode), and decided to keep looking.
Eventually, we decided to use the Standard Plan of Grafana Cloud with a Prometheus backend for these reasons:
- Hosted solution — no need to maintain a Grafana server
- Renders beautifully on a TV screen
- Flexible in performing calculations over incoming data in realtime (together with Prometheus).
As for the actually displaying data, we started with a TV monitor attached to an off the shelf Android TV device, but it turned out that several browsers we tested on the Android TV did not render correctly any of our tested dashboards, and it was not possible to disable the screen saver and have it load the dashboard on startup without hacking the device.
So we decided to go with a Raspberry PI B+ v3 which is a mean little machine, capable of rendering the graphs without a hitch. Since it’s a Linux machine and has a wide community around it, it was easy to find solutions and guides on how to set up Grafana to our liking.
Here is our real monitoring TV (complete with window-glare):
Let’s go over the monitoring architecture we eventually came up with:
In the Orbs test networks, the Orbs Nodes are machines on AWS operated by Orbs Ltd. In production, the Orbs nodes are managed by their respective owners. Most of the Orbs nodes in production are on AWS but it doesn’t matter so long as their /metrics endpoint is accessible.
The metrics processor is a Node.js server running on an AWS machine. It passively listens on requests on its own /metrics endpoint and when called, executes HTTP POST requests to /metrics of every node, collects the metrics and converts to Prometheus format and returns the data.
Conversion to Prometheus format is done with the help of a Node.js Prometheus client library.
We initially planned to configure the Hosted Grafana’s Prometheus Database to periodically sample the metrics-processor but unfortunately, it is a passive-only solution, meaning you have to push data into it rather than have it actively scrape targets (such as the metrics processor) to pull data.
The solution was to create a second Prometheus database to act as a bridge. We do not need to store data on that Prometheus — its only purpose in life is to periodically pull data from metrics-processor and push it to the Hosted Prometheus.
For the Prometheus bridge, we use a predefined Prometheus Docker image with some simple configuration. It runs on the same AWS machine as the metrics-processor and has automatic restart configured for it, so if the machine restarts, Prometheus will start automatically. So while it is a somewhat unwanted addition to the architecture, it requires very little maintenance.
The Hosted Prometheus on the Grafana Cloud is where the actual metrics data is stored, so all tests data (and now production data) is easily presentable on Grafana.
We created two main dashboards:
An “ Orbs Production “ dashboard, whose purpose is a high-level view, suitable for TV display, for casual glances once in a while to see everything “looks right”.
Sample of the Orbs Production dashboard:
An “ Orbs DevOps “ dashboard which contains a lot more information, useful when starting to go deeper when some problem is suspected.
Sample of the Orbs DevOps dashboard:
We plan to introduce a Prometheus Go client library into the Orbs node to allow it to output its metrics directly in Prometheus format in addition to the existing JSON format.
During the last month before the launch of the Orbs network, the Orbs core team continuously ran long-running tests (“stability tests”) with the sole purpose of spotting bugs which manifest themselves only after some time the system has been running. We suspected some bugs eluded the developers on the otherwise very well tested code — the team has a suite of unit, component and integration tests and the whole system was written with strict adherence to TDD principles.
Typical stability configuration was a 22-node network (the expected max nodes in production) with about 10 transactions per second (tps) using the Lean Helix consensus algorithm and the proposed production configuration. The team did not try to optimize performance, nor hit resource bounds — only detect anomalies resulting from prolonged runtime. The team did discover and fix a memory leak due to some goroutines not terminating, which could not have been otherwise detected.
The method of testing was to let the test run, track it with the dashboards described above, and if some anomaly came up, enable logs on one or more machines and investigate using logz.io. Error logs are written to logz.io by default but the Info logs are not, as they are very verbose.
Logz.io takes a logline and extracts properties in addition to the basic message. This lets you filter over properties, for example, show messages only from node “fca820”, or messages for block-height 1 and so forth. When debugging a flow of activity, enabling various filters on properties such as those is essential for making sense of the large number of logs generated.
Here is a sample log line with properties:
The purpose of load testing is to identify the load limitations of specific network configurations. This takes place after stability tests are completed. The monitoring of load tests is identical to that described above. The Orbs Load Generator is a Go process that uses the Orbs Client SDK in Go to generate the load.
The load generator process sends out transactions at some rate (transactions per second — tps), split evenly between all machines. Its present implementation is to send out a burst of n tps, then sleep for a second, then repeat. This is not exactly how sending at specific tps should work, and the team plans to develop proposed improvements on this implementation.
The Orbs core team tested 3 different network configurations:
- 4-node network on the same AWS region (m4.xlarge in us-east-2)
- 8-node network on the same AWS region (m4.xlarge in us-east-2 and us-east-1). Same as #1, only more nodes.
- 15-node network — actual production nodes (m4.xlarge, multi-region)These are the actual production machines, operated by the Validators. The test ran on a separate virtual chain than the production virtual chain. This means it ran on the same AWS EC2 machines as production, but on different Docker containers (each virtual chain runs on its own Docker container) so the production virtual chain is not polluted with load test inputs.
The detailed results can be found below. Based on the results, we have concluded the following:
- The Orbs network’s present load test configuration is network-bound. At higher tps rates the Load Generator starts to receive HTTP 503 “Service not available” and other errors. The Orbs core team suspects this is related to limitations inherent to the Docker version we are using, though this is still under investigation. The 4-node network was bound by about 400 tps (100 tps per node). Going above it simply generated multiple errors on the client side and had reduced effect on the nodes (they simply did not process >100 tps).
- During testing, the Orbs core team moved to an 8-node network to reduce the per-machine tps load. This allowed for increased tps (about 500tps almost without errors). Remember that the more nodes, the more traffic goes between nodes due to each node conversing with all other nodes during flows of transaction forwarding, node sync and consensus. This explained why it’s impossible to double the tps by doubling node count. The core team plowed through and increased the load, and eventually managed to reach 912 tps of committed transactions — that is transactions in blocks that passed consensus. If the goal is to stay with a negligible error rate on the client side, it is sufficient to use the 500tps as the upper bound of such a network. Once this network limit is resolved, the expected result is to observe an increase in tps limit, as no other bounds were detected even reaching 912 tps.
- Production configuration (the real 15-node network) showed 200 tps running smoothly. The Orbs core team did not press it further at that time, so as not to affect the production virtual chain, and because a Validator Election event was due soon after the test was concluded and the Orbs core team did not want to impact this election either.
- CPU and Disk I/O were not limiting factors in any of these tests — CPU never exceeded 25% even at max tps, and disk I/O was about 200 Kb/s max.
- Blocks per second (as committed by Consensus) was not a limiting factor during the tests. Initially, when tps went up, the rate of blocks/sec went up as well. However, when tps kept going up, the blocks/sec rate peaked and then started declining. We assume the reason for this is the higher load meant new blocks contain more transactions because more transactions wait in the Transaction Pool’s pending pool when a new block collects them, so new blocks are larger and consequently take longer to pass around the network (and consequently by the time the next block needs to be generated, there are even more transactions waiting for it, and so on and so forth).
- When load resumed 100tps, the system resumed normal behavior — it did not crash or go into some unstable state due to the very high load previously sustained.
- On higher tps the dashboard showed very jittery data (see graph below). We suspect this to be an artifact of the monitoring architecture, not actual jittery behavior. At times, monitoring data stopped being read (even crawling /metrics manually sometimes returns HTTP 500 error) so naturally, the dips in the graph are unreliable):
Detailed results (posted here as raw data from a single load test, take with a grain of salt):
Going forward, the Orbs core team plans to integrate Prometheus metrics right from the node itself, thereby omitting the JSON to Prometheus conversion, and allowing for the generation of histograms right in the node itself.
Originally published at https://www.orbs.com on May 13, 2019.