Cypher Query Optimisations

Published in

Neo4j Developer Blog

9 min readNov 28, 2018

Make your slow cypher queries running fast

In this post I will explain about common causes of query performance degradation in the Neo4j server. This is a follow up on the Meet the Query Log Analyzer story earlier this week.

Query log Analyzer App in the Neo4j Desktop

The Query Log Analyzer App helps you to find quickly the slow queries on the Neo4j server.

Important Things To Remember About Cypher Execution

Caching and Disk IO
The Neo4j database maps the data files from disk to the Page Cache. When you do a query and the data is not in Cache the data will be loaded from disk. The second time the query is executed the data can be read from Cache, which will result in a faster query. Any query that needs the same data will now benefit from having the data in cache.
If possible make sure that your Page Cache can contain the database.

Query Planning
When the Cypher engine receives a query string it compiles an execution plan for it. This execution plan is stored in the Query Cache and can be reused when the same query is fired again. The plan will be removed from that cache if the cache exceeds its limits or the data in the database has changed too much, so that estimates made during planning don’t hold true anymore. Using parameters instead of literal values will allow use of that cache, otherwise the query has to be re-parsed time and again.

First Query Execution
Running a query for the first time, will always be slower than running the query for the second time due to cache misses, and query planning as described above.

Checking the Query Plan
You can check queries by prefixing the query with the EXPLAIN or PROFILE keyword. When you find a slow query you can profile/explain a query by copying the query and the eventual parameters from the Query Log tab to the Neo4j Browser and use explain/profile to check how the query will be/is executed.

Query Checks

In the following sections we handle common causes which slow down the query performance. Note that there can be multiple causes which slow down the query performance, and even good queries can slow down because bad queries taking all the system resources.

Query Parameters

Are query parameters used for the queries where they should be? When you do not use query parameters every time a value changes the database will make a new plan for the query.

Example queries without parameters, each time a new plan is generated:

MATCH …. WHERE n.name = 'John' RETURN …
MATCH …. WHERE n.name = 'Ellis' RETURN …`

Example queries with parameters, the same plan is used again:

Parameters: { name : "John" }
MATCH …. WHERE n.name = $name RETURN …
Parameters: { name : "Ellis" }
MATCH …. WHERE n.name = $name RETURN …

Index Usage

When a query is unexpectedly slow it may be caused by a typo in the query which make the query planner decide to not use an Index. Use explain in the Neo4j browser to see if all the correct indexes are used in the query and change the query if needed.

Big Result Sets (and slow networks)

When your query is returning a lot of data, for instance megabytes of data, then the query will take a longer time to finish. This will be worse when you have also a slow network connection between the client and the server or the client is not consuming the results quickly enough.

These may be valid queries, however returning a lot of data in one query triggers me always to do a functional check on why you need these big result sets. Maybe a functional redesign is needed.

Note that when you load a lot of data to populate a cache in your application layer, you may do things ‘double’. The database is also caching the data.

Locking

The database places lock’s on nodes and relationships when you change Node values (lock on Node) or add and remove relationships (lock on start and end Node and Relationships) or change Relationship values (lock on Relationship). The Lock will be released when the transaction is committed or rolled back.

When at the same time multiple write queries are executed on the database and they are writing on the same Node/Relationship structure than the writes have to wait for the other writes to complete.

This is desired behaviour of a database but how can we optimise the writes to avoid unnecessary lock’s.

An example model can be as follows:

Assume that in your application you want to add 50 new persons to a specific Group. You will then fire 50 calls to the database to add each person to that specific group in the same time frame (In this example calls as are send parallel to the database each having their own transaction). The first call will place a lock, and while the first call is being executed the other calls have to wait until the lock is removed. Then you will get the following pattern (the green is executing the query, with a lock, blocking the other queries):

This can be avoided to send a list with 50 Persons in a parameter in one cypher statement with the use of UNWIND.

Example (parameters: personUuid list, groupUuid)

MATCH (g:Group { uuid: $groupUuid })
UNWIND $personUuidList as personUuid
MATCH (p:Person { uuid : personUuid })
MERGE (p)-[:IS_MEMBER]->(g)

This will be only one transaction on the server to process all the 50 members. Which is fast, and it has not to wait for locks.

With the Query Log Analyzer you can find this scenario by using the ‘Highlight’ function on a query. This will mark the lines in the query log file yellow for this query, and they tend to form ‘yellow’ blocks, with almost the same log time.

Query Load

Besides Locking, queries may also be slow because the query has to wait for cpu-resources for execution because the server has too many queries to process at the same time.

With the Query Log Analyzer you can use the ‘Highlight’ function and inspect the query log lines around the same log date time. Another option can be the Timeline which shows all the queries in the log file over time.

The amount of queries which can be handled sufficiently on a server is dependent on the single query execution time and the functional requirements. When you have a lot long running queries than the amount of queries per second the server can handle will be lower. Therefore it is very important that all the queries in the application are tuned to be as fast as possible, and that the server has enough capacity to handle the queries.

Tuning Queries

Queries may be slow because the cypher statement is not optimal. Therefore it is important to analyse the query log when you develop your application. Then in an early state you can find inefficient queries. The goal of query tuning is to get the lowest possible amount of db hits. In the following non-exhaustive list I mention some tips to improve your queries:

Errors in variable names
When you have an error in the variable names in the cypher statement, you can have unexpected results. If there is in the query a reference to a variable which is not initialised like ‘…MATCH (a)-[:’ or ‘…MERGE (a)-[:’ than this may result in a full node scan, which can slow down your query a lot. Use explain to check the query plan for your query.

Missing colon ‘:’ for a Label or Relationship Type
This is easy to check, if you miss the colon then the label or relationship type will be seen as a variable name, which results in full node scans etc. as described in the previous section, e.g. MATCH (Person) ... instead of MATCH (:Person) ... .

Try to reduce the query working set as soon as possible
There is a lot to say about this depending on the query. In general you can ask yourself the following things:

Can I move a distinct to an earlier point in the query?
Can I move a limit to an earlier point in the query?
Can I use collect on places in the query to reduce the amount of rows to be processed during execution of the query?
Do I use order by on the right place in the query?

Multiple UNWIND statements
Multiple UNWIND statements after each other will lead to cartesian products of the contents of the arrays you are unwinding. This is shown in the cypher statement below.

Multiple OPTIONAL MATCH statements
OPTIONAL MATCH is a power full possibility in Cypher, however it should be handled with care. When there are multiple OPTIONAL MATCH statements in one query then there may be a cartesian product which gives the database a lot of work. Assume we have the following model and cypher statement:

MATCH (a)
OPTIONAL MATCH (a)->(b)->(c )
OPTIONAL MATCH (a)->(d)->(e)
OPTIONAL MATCH (a)->(f)
...

Each OPTIONAL MATCH results in a stream of paths. So when for one (a) the (a)->(b)->(c ) path produced 10 entries, the (a)->(d)->(e) path produces 10 entries and the (a)->(f) path produces 10 entries this will result in 10 * 10 * 10 = 1000 intermediate rows per (a) node.

Depending on the requirements there are several ways to handle this properly:

Use WITH and COLLECT and DISTINCT to reduce the intermediate results

MATCH (a)
 OPTIONAL MATCH (a)->(b)->(c )
 WITH a, collect(DISTINCT b) as bb, collect(DISTINCT c) as cc
 OPTIONAL MATCH (a)->(d)->(e)
 WITH a, bb, cc, 
 collect(DISTINCT d) as dd, 
 collect(DISTINCT e) as ee
 ...

Use Pattern Comprehension
When a nested structure needs to be returned then you should use Pattern Comprehension (3.2.11.3) where you can ‘execute’ cypher patterns to build up the a tree after the RETURN statement. This is very fast:

MATCH (a)
 RETURN 
     { a:a,
       blist : [ (a)-->(b) | {b:b, clist : [(b)-->(c) | c ]],
       dlist : [ (a)-->(d) | {d:d, elist : [(d)-->(e) | e ]],
       flist : [ (a)-->(f) | f]}

Instead of returning the full nodes, you can also use map-projections to only return a subset of properties:
RETURN a { .name, b: b { .*, c: [(b)-->(c) | c ]}}

DISTINCT and Returning tree structures
When you return properties or create a tree structure with Pattern Comprehension then you should not use that in combination with DISTINCT or aggregation like this:

MATCH (a:Label)
...
RETURN DISTINCT a { and build up your nested tree structure with Pattern Comprehension and Map Projections}

It is much more efficient for the database to do the DISTINCT operation on a node or a simple variable before building the JSON structure:

MATCH (a:Label)
...
WITH DISTINCT a
RETURN a { ... }

Links

If you have questions regarding the query performance, you can always head to the #help-cypher channel on the Neo4j Users Slack or on the neo4j community.