Exploring Methods of Cypher Query Optimisations

Boosting Performance for Knowledge Graphs with Neo4j APOC Library

Research Graph
8 min readMar 19, 2024
Image created in https://www.canva.com/

Author

Introduction

A knowledge graph (graph database) captures information about main entities in a domain and the relationships between them. It was an augmented feature store for connected data which gave access to compute, access and operationalise structure features. Unlike traditional databases that use tabular structures, knowledge graph utilise graphs to organise data into nodes and edges, creating a flexible model. Neo4j is a graph database management system. It embraces a graph-based model, organising data base nodes, properties and relationships. Cypher is Neo4j’s declarative query language to extract useful information from the knowledge graph. It is like SQL for graphs and was inspired by SQL so it allows you to focus on what data you want out of the graph.

Neo4j graph database. image from Neo4j website.

Query optimisation refers to the process of using optimised query execution to improve database query performance. Simply put, it involves finding the most efficient way to retrieve data from a database. The optimisation process includes analysing the query, identifying the most efficient query and executing the query to retrieve data, therefore maximising database performance.

Query optimisation is important which has a significant impact on the database application especially when you have lots of nodes and relationships in the database. If the query is poor, it can take a long time to get the data which leads to poor application performance and bad user experience.

This article aims to explore various ways in which query can be optimised to perform on the database effectively. It will particularly focus on the APOC methods and delve into its relevant use case.

Background

The lifecycle of a Cypher query

Cypher Query lifecycle. Image from Neo4j,2024 https://neo4j.com/docs/cypher-manual/current/planning-and-tuning/execution-plans/

A Cypher query starts as a declarative query string which describes the graph pattern to match in the database. The query string goes through parsing and rewriting processes. During parsing, the query is broken down into its constituent parts and validated syntactically. Rewriting involves optimisations or transformations to the query for efficiency or correctness. Then the parsed and rewritten query string passes through the query optimizer (planner) which will produce an optimal plan (logical plan) in the given current database state. Finally, the logical plan is turned into a physical plan which specifies the actual operations and steps that will be executed by the database to retrieve the data and get the results.

Three things to help understanding Cypher execution:

  1. Neo4j database maps the data files from disk to the Page Cache. When you execute a query and the data is not in Cache the data will be loaded from disk. Query in page cache is faster compared to query in disk.
  2. When Cypher receives a query, it will compile an execution plan that is stored in the Query Cache which will be reused when the same query is executed again. However, if the cache exceeds its limits or if the data in the database has changed significantly, the plan will be removed from the cache.
  3. A query that is run for the first time will be slower than the second time.

Four factors that may slow down query performance:

  1. Big result sets: When a query returns a lot of data, it will take a longer time to finish.
  2. Locking: The database places locks on nodes and relationships when changing node values or relationships. When queries are written on the same node/relationship, they need to wait for other writes to complete.
  3. Query load: Query load may also be slow if the server has too many queries to process at the same time because they have to wait for CPU-resources execution.
  4. Turning queries: Queries may be slow because the Cypher statement is not optimal. The goal of query tuning is to get the lowest possible amount of DB hits.

Ways to optimise the query

  1. Using parameters: Constructing a Cypher query for each request every time can lead to suboptimal performance. The problem is that the Neo4j desktop app crashes after around 1500 requests. When a new query is sent to the Neo4j database, it needs to be transformed into an execution plan. But Neo4j caches the last 1000 queries for reuse so if sending a query that is cached, the optimization can be skipped. To solve this problem, make one query with parameters and reuse it for multiple requests. Instead of using the Python for loop, this method uses Cypher Unwind and the Python f-string with Cypher parameters. The new method does not need to empty the database or restart it. This means only the first time the query needs to be optimised, after that the cached query could be used. This method represents a significant improvement. Constructing a new query per request proved to be inefficient and problematic. Using the backend function in conjunction with the new method, we can successfully process more than 5000 requests that previously crashed with the old method.
  2. Not have errors in variable names: If a Cypher query references a variable that has not been properly initialised or defined, it can lead to unexpected behaviour like scanning the entire node space, and searching for matches. This can severely impact query performance, especially in large databases.
  3. Try to reduce query working set as soon as possible: The main idea involves using queries that are small enough and constant in their shape for caching. Thereby, each query can be updated from a single property to a whole subgraph. Some of these methods are described here:
  • One method is to use the UNWIND method which turns a batch of data into individual rows to achieve this. We can use it to create nodes with properties, and merge nodes with properties and lookup relationships. For the lookup, we can lookup by id and use a map format to put in the id where keys are nodes or relationship-ids. This is more compact and faster for id lookup.
  • We can also use FOREACH which is used to iterate over all items and execute updates for each of them. This can help create data dynamically based on inputs.
  • Another method is utilising the APOC procedures. The APOC procedure library has lots of useful applications, including creating nodes or relationships with dynamic labels and properties, batched transactions or iteration of updates, and functions for creating and manipulating maps to be set as properties.

Use case of using APOC

Introduction to APOC

APOC library consists of about 450 procedures and it is widely used in various areas like data integration, graph algorithms, or data conversion.

For APOC library installation: APOC Installation

An Introduction Guide to: APOC User Guide

There are some built-in procedures in APOC:

APOC Built-in procedures. Source APOC documentation.

I will focus on the periodic method of APOC, built-in periodic procedures of APOC:

APOC Built in Periodic Procedures. Source APOC documentation.

Efficiently Updating and Inserting Data With apoc.periodic.iterate

Understand the Cypher query (create 100000 person node)

Cypher query to create Person:

CALL apoc.periodic.iterate(
"UNWIND range(1,100000) as id RETURN id",
"CREATE(p:Person{id:id})",
{batchSize:10000, iterateList:true,parallel:true})

The first statement is the input that is worked on in the second statement, the second statement is doing the work like updating the graph, changing the graph, or removing things from the graph. “Batch size” means the size of transactions, “iterate list” indicates whether each batch is executed as a single Cypher transaction or not, and “parallel” means whether to enable parallel execution of batches or not.

To analyse the query result, I will use the PROFILE method which will execute the query and return results. This can help keep track of how many rows pass through each operator, and how much each operator needs to interact with the storage layer and retrieve the necessary data. To use the PROFILE method in Cypher, simply put a PROFILE word in front of the query.

Compare the execute time using original method and apoc procedure:

We can find the execution time of APOC is less compared with old methods.

Cypher query result with unwind.
Cypher query result with APOC and Unwind.
Execution plan
Cypher query result with APOC and match.
Cypher query result with match.
Execution plan.

The Estimated Rows column details the number of rows that are expected to be produced by each operator. They are calculated or estimated from graph statistics and are usually important for the query planner when formulating and comparing plans. DB hits stand for the work of storage when trying to get data from the Neo4j database. The main metric considered is db hits. Execution time is important but that is generally a result of administrative factors, like bandwidth, memory, and traffic volume. From these examples, We can see that the APOC method is better as it has fewer db hits.

Conclusion

In conclusion, optimising Cypher queries is crucial for improving database performance, especially in scenarios with large datasets. Through careful analysis and implementation of optimisation techniques, we can enhance the efficiency of database applications.

References

--

--