How to efficiently monitor Neo4j and identify queries that may be causing performance issues (Part 1)

Published in

Agile Lab Engineering

7 min readSep 20, 2024

Introduction

Monitoring is an essential element of every database, tool, or component. In this article, I will explain how I monitor the Neo4j database in a clustered environment using several tools such as Dynatrace and Kibana and the best practices . This is a 2 part article, in the first part I’ll discuss monitoring Neo4j with Dynatrace and in the second part I’ll show the practices I used to optimise bad queries. ➡️

Our beloved database

One of our customers utilizes an on-premise Neo4j three-node cluster. Each node is equipped with 256 GB of RAM and 32 cores. This particular database contains millions of nodes and is accessed by hundreds of microservices and dozens of solutions. For this reason, monitoring the resource usage of the cluster is crucial. This allows us to identify if a microservice is not performing well or if a solution is consuming excessive resources, which can lead to delays for other services.

In recent months, our efforts have been focused on leveraging existing resources to keep track of every query executed on Neo4j. Before delving into our solution, it is important to understand how memory is managed and structured in Neo4j.

How memory is managed in Neo4j

From a low-level perspective, Neo4j is implemented as a Java process, and therefore it runs on a Java Virtual Machine (JVM).

On-Heap memory: This portion of Neo4j’s memory is where the runtime data resides. It includes query execution and transaction state information. The on-heap memory is used for storing node and relationship objects, indexing data structures, and other runtime data needed for query processing and transaction management. Properly managing the on-heap memory is important for ensuring efficient query execution and maintaining the overall performance of the Neo4j database.
Off-Heap memory can be divided into three categories. The largest portion of off-heap memory is typically allocated to the page cache, which holds all the cached graph data, including indexes.

In this article, we will delve deeper into the heap memory as its optimization plays a crucial role in improving the speed and efficiency of queries executed within the cluster.

The JVM heap is a distinct dynamic repository utilized by Neo4j to store instantiated Java objects. Memory allocation and management for these Java objects are handled automatically by a garbage collector, which plays a vital role in the process. One significant aspect is that the garbage collector automatically takes care of deleting unused objects. The heap is divided into two main generations: the young generation and the old generation.

Newly created objects are allocated in the young generation. If these objects remain live (in use) for a sufficient period, they are subsequently moved to the old generation. The young generation benefits from a minor garbage collection process, known as a minor GC, which is responsible for reclaiming short-lived objects that are no longer in use. This helps maintain the performance and efficiency of the young generation.

On the other hand, the old generation contains objects that have survived multiple minor GC cycles and have demonstrated their longevity. The old generation undergoes a major garbage collection process, known as a major GC or full GC, which reclaims memory from long-lived objects that are no longer needed. When a generation fills up, the garbage collector performs a collection, during which all other threads in the process are paused including the network connection. So is critical that the garbage collection lasts as little as possible.

The young generation is quick to collect since the pause time correlates with the live set of objects. In the old generation, pause times roughly correlates with the size of the heap. For this reason, the heap should ideally be sized and tuned in such a way that transaction and query state never makes it to the old generation.

When the old generation of the JVM heap becomes full, garbage collection can result in lengthy pause times, lasting minutes in some cases. During this time, the affected node is unavailable, and the overall cluster may experience performance degradation or disruptions. This can have a significant impact on the system’s availability and responsiveness. When an offline node comes back online, it needs to synchronize its databases with the other nodes in the cluster.

This synchronization process can be disk-intensive, as it involves transferring and reconciling data between the offline node and the rest of the cluster. During this synchronization, the node is unavailable for both read and write operations.

To reiterate, all the above mentioned problems are caused by a long garbage collection time. To mitigate these problems, it is crucial to ensure that the old generation heap does not reach a state of fullness. By preventing the heap from filling up, we can minimize garbage collection pauses, reduce the time required for synchronization, and avoid leader reelection interruptions

Tuning heap memory

If the new generation is too small, short-lived objects may be moved to the old generation too soon. This is called premature promotion and will slow the database down by increasing the frequency of old generation garbage collection cycles (we want to avoid this).

If the new generation is too big, the garbage collector may decide that the old generation does not have enough space to fit all the objects it expects to promote from the new to the old generation. This turns new generation garbage collection cycles into old generation garbage collection cycles, again slowing the database down.

So the correct heap sizing heavily depends on the specific domain. It’s crucial to monitor the load of the cluster for days, or even weeks.

And here’s where Dynatrace comes in

Dynatrace

Dynatrace is a widely used tool that provides observability across various environments, utilizing a client-server paradigm where data is collected by the OneAgent tool installed on each machine and visualized in a centralized dashboard, offering insights into processes, resource usage, downtime, file descriptors, and particularly valuable JVM metrics, including garbage collection times and long garbage collection events, making it an essential tool for monitoring and optimizing Neo4j clusters.

In Figure 2, the summary for node n1 during a moment of medium/heavy load shows that Dynatrace (DT) detected a garbage collection process initiated by the node as the young generation memory started to fill up. It’s important to remember that objects surviving multiple garbage collection cycles are eventually moved to the old generation. However, in the depicted figure, it can be observed that the old generation filled up after the event. This resulted in a long garbage collection event that caused the unavailability of the node for several minutes.

Figure 2Node with high load in DT dashboard

With this information in hand, it is now time to address and resolve these challenges.

Query monitoring

Certainly, it is common for users and applications to interact with a Neo4j database through microservices deployed on a Kubernetes cluster. These microservices are typically developed using Java and the Spring Boot framework, which provide a robust and scalable environment for building and deploying microservices.

Monitoring the performance and response time of each query in the microservices is a crucial aspect for optimizing the system. The approach of adding a unique string, such as a correlationId, as a comment in the query is a practical method to track and associate relevant information with each query execution.

By including the correlationId in the queries, you can easily identify and trace individual queries in the Neo4j logs. This allows you to retrieve valuable metrics such as response time, CPU time, user information, and most importantly, the amount of heap memory allocated for each query.

I built a python script that parses all the neo4j logs and collect statistics, then storing them in a json file sorted by memory allocation. With this approach I have a convenient way to identify the most memory-intensive queries in a specific day. This allows you to focus your analysis on those queries and optimize their resource usage. Since the neo4j logs have a retention policy and don’t collect the parameters of a query, I use kibana to retrieve these info and start an analysis on a certain query.

Conclusions

Through the use of various tools and the identification of queries with poor performance, I initiated a process to refactor them. Refactoring queries to improve their performance can have a significant impact on the overall efficiency and response time of your system.

Even a ten percent reduction in query response time can result in substantial time savings, potentially shortening processes that used to take hours to a significantly shorter duration. And achieving a response time reduction of up to 70% for a specific query demonstrates the effectiveness of the refactoring process