When using any piece of software, you will probably need some kind of monitoring and alerting system watching on that software so you can know when is it about to collapse or if it is down. For doing that you will need to enable some metrics of your system to investigate what happened, how to solve the problem and how to prevent this from happening again.
Neo4J is no different from any piece of software, and it comes with the ability to export a bunch of customisable metrics to different systems, so you can visualise, monitor and watch for those values.
To do so, Neo4J provides some configuration properties to be added to the
neo4j.conf file, so you can enable the metrics you are interested in, and where they go. Here are some of the Neo4J 4.1 default values regarding metrics on
That means that, by default, Neo4J has CSV metrics enabled, writing to disk every 3 seconds, keeping 7+1 files per CSV it creates (the working CSV and the 7 historic files), being each file up to 10 MiB, but, how does Neo4J spill metrics to disk? Could it be this default configuration a disk space problem? Let’s find out.
Defaults can be dangerous
Neo4J has all the metrics enabled by default, which creates the following file structure inside your metrics folder:
So, only with
metrics.bolt.messages.enabled=true, Neo4J is generating 11 CSV files, one file per metric defined at the metrics reference. That means Neo4J creates a single file for each metric within a metric category. Taking into account that Neo4J has all metrics enabled by default, Neo4J will have 98 CSV files for all its metrics.
Performing some quick maths, 98 metrics, keeping 7+1 files for each metric, with a size of 10 MiB each file, the default disk space used by Neo4J regarding metrics is 98*(7+1)*10=7840 MiB, ~ 8GB of disk space usage in Neo4J 4.1 only in metrics!
Store useful metrics…
Based on your use case, you may not need all the metrics, or you might even not need metrics at all, it is even possible that your metrics are sent to Prometheus, so you don’t need to sink metrics to disk. However, you need to make a decision on which metrics do you want to use. For example, if you are running a single instance Neo4J, you don’t need
metrics.neo4j.causal_clustering metrics, as you are not running a causal cluster.
But, keep in mind that Neo4J doesn’t have a fine grained configuration on metrics, you can only enable or disable all the metrics within a category, so if you are interested in
<prefix>.causal_clustering.core.in_flight_cache.misses metric, you need to set
metrics.neo4j.causal_clustering.enabled=true, which will enable all the 20 metrics regarding causal clustering. You cannot have only the metrics you are interested in.
Therefore, in this case, only to get the
in_flight_cache.misses metric, you will have 20 metrics *(7+1) files per metric *10 MiB per file=1600MiB worth of metrics.
… and keep them for a useful time
Another important aspect of metrics storage is… how long do you need to store them? Do you need or find useful metrics from, let’s say, three weeks ago? Maybe not, maybe yes, but you don’t have a parameter in Neo4J’s configuration to set the metrics time expiration, so, how to set up the right configuration parameters for your use case?
First of all, you need to know how many bytes does a single metric line take. To do so, as there is no official documentation about it, you can run this script in your
$NEO4J_HOME to find that out:
This script will let you know, for each metric, which fields you are getting, an example of the values and how many bytes does each metric line use. An example output could be this (Neo4J 3.5.12):
Here we can see some interesting things:
neo4j.bolt.connections_closedmetric uses about 63 bytes.
neo4j.bolt.connections_idlemetric uses about 13 bytes.
neo4j.causal_clustering.core.message_processing_timer.append_entries_requestmetric uses about 165 bytes.
- Some metric values don’t always has the same byte size. For example,
neo4j.bolt.connections_idlevalue is 1, but it could be 100, changing the byte size of the metric’s line.
With this information, let’s say that you want to keep the metrics for
neo4j.bolt.connections_closed for 4 days, so you have the weekend and a Monday / Friday bank holiday covered. How many disk space would it use? Let’s do the calculations:
- 63 bytes per line.
metrics.csv.intervalis set by default to
- Wanting to keep at least 4 days worth of metrics.
4 days * 24 hours/day * 60 minutes/hour * 60 seconds/minute * 64 bytes per line / 3 seconds per line = 7372800 bytes, 7.2 MiB of disk space used just for the
neo4j.bolt.connections_closed metric, keeping it for 4 days.
Now, it would be tempting to set
metrics.csv.rotation.keep_number=1, so you have at maximum two files per metric (the working file and the historic, so if a rotation happens, and the working file is brand new, you have at least one 8MiB file), and as you only need 7.2 MiB, putting it to 8.00 MiB (just in case for the bytes fluctuation), you would get for the metric you want for the time you want, regardless the size and the retention of the other metrics that you don’t care about. So, with this configuration, only with
metrics.bolt.messages.enabled=true, you would have 11 metrics * 8 MiB * (1+1) files = 176 MiB of disk space usage.
But, likely, you are not interested in just one metric out of the 98 Neo4J provides, you might be interested in some other metrics, for example,
neo4j.bolt.connections_running (~13 bytes per line) and
neo4j.causal_clustering.core.message_processing_timer (~169 bytes per line).
If you want to keep at least 4 days worth of each metric, you need to make the calculations based on the heaviest metric you are interested in, so you will have at least those 4 days for the heaviest metric, but more for the rest, as Neo4J doesn’t allow to set up individual metric file size.
Therefore, for those two metrics, we would get
neo4j.causal_clustering.core.message_processing_timer metric and perform the calculations for it. 4 days * 24 hours/day * 60 minutes/hour * 60 seconds/minute * 169 bytes per line / 3 seconds per line = 19468800 bytes, ~ 19MiB, so you will need at least 20 MiB for each metric to have those desired 4 days worth of metrics for all your useful metrics, so, for example, you could configure Neo4J like this:
metrics.csv.rotation.size=10.00MiB# 11 metrics
metrics.bolt.messages.enabled=true# 20 metrics
With this configuration, with two categories enabled, wanting two metrics to be stored up to 4 days with an interval of 3 seconds, we would end up using (11+20)*(2+1)*10 = 930MiB of disk usage, but this can be enhanced, for example like this:
So, with this configuration, we have (11+20)*(4+1)*5=775MiB. We can achieve lower disk usage by playing with those parameters. We still have 20 MiB as historic files, but the working file, instead of being up to 10MiB, is only up to 5.00MiB, so the total disk usage is lower by increasing the number of files and lowering the file size.
- Enable just the needed metric categories, as, by default, CSV metrics can use up to 8GB worth of disk space.
- Take into account that you cannot enable or disable specific metrics, you can only enable and disable categories, so all the metrics within that category will be enabled or disabled.
- Make a decision about how long do you want to keep the metrics.
- Calculate how much disk space will you need for keeping your desired metrics for the time you need them, and configure Neo4J’s metrics parameters accordingly.