Efficient Neo4j Data Import Using Cypher-Scripts
Fast Export/Import Operations with apoc.export.cypher and cypher-shell
The idea is to allow a user to export all data, an subgraph, the result of a query or a collections of paths into to an importable cypher-script, which you could import again by piping it to cypher-shell on the command-line. Very similar to using SQL based import scripts.
cat script.cypher | bin/cypher-shell -u username -p password
Please note that for maximum performance you need to use cypher-shell version 1.1.9 or later, which got client-side parameter parsing, instead of sending the parameter to the server to be evaluated. You can install that as a standalone binary if your Neo4j server installation comes with an older version.
On another note:
If you are experimenting with imports that are failing you can add the
--debug
command line parameter, to see which statement was executed last and cause the failure. Also check the memory configuration of your Neo4j instance, you might want to up the HEAP to 2–4GB with the dbms.memory.heap.max_size=2G
setting. And provide more memory to cypher-shell itself by prefixing the command with: JAVA_OPTS=-Xmx4G bin/cypher-shell …
Neo4j’s APOC procedures contain a variety of procedures that allow to easily export/import data with Neo4j. In this article, we’ll focus on exporting cypher-script using the apoc.export.cypher.*
procedures which changed significantly in the last APOC release in order to improve the import performance using cypher-shell.
Before the improvements
The procedure, before the improvements, behaved in this way.
Starting from the following command:
CALL apoc.export.cypher.all({fileName},{config})
The cypher-script output of the operation was the following:
begin
CREATE (:Foo:`UNIQUE IMPORT LABEL` {name:"foo", `UNIQUE IMPORT ID`:0});
CREATE (:Bar {name:"bar", age:42});
CREATE (:Bar:`UNIQUE IMPORT LABEL` {age:12, `UNIQUE IMPORT ID`:2});
commit
begin
CREATE INDEX ON :Foo(name);
CREATE CONSTRAINT ON (node:Bar) ASSERT node.name IS UNIQUE;
CREATE CONSTRAINT ON (node:`UNIQUE IMPORT LABEL`) ASSERT node.`UNIQUE IMPORT ID` IS UNIQUE;
commit
schema await
begin
MATCH (n1:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`:0}), (n2:Bar{name:"bar"})
CREATE (n1)-[:KNOWS]->(n2);
commit
begin
MATCH (n:`UNIQUE IMPORT LABEL`) WITH n LIMIT 20000 REMOVE n:`UNIQUE IMPORT LABEL` REMOVE n.`UNIQUE IMPORT ID`;
commit
begin
DROP CONSTRAINT ON (node:`UNIQUE IMPORT LABEL`) ASSERT node.`UNIQUE IMPORT ID` IS UNIQUE;
commit
As you can see there is one `CREATE` statement for each node:
CREATE (:Foo:`UNIQUE IMPORT LABEL` {name:"foo", `UNIQUE IMPORT ID`:0});
and relationships:
MATCH (n1:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`:0}), (n2:Bar{name:"bar"})
CREATE (n1)-[:KNOWS]->(n2);
Those UNIQUE IMPORT LABEL
and UNIQUE IMPORT ID
are used when the node in question has no unique constraint defined on any of its labels. Then we introduce that artificial label and id to be able to look them up later when creating relationships. The import script creates a constraint upfront and removes the constraint and additional information at the end.
The statements are wrapped in batch-transactions of configurable size. It was the best we could do as a “generic” approach which works everywhere.
But it is not the most efficient way of creating such scripts:
- hard coding literal values instead of using parameters (despite auto-parameterization in Cypher)
- using a single statement per individual update
We only realized when testing this that the limiting factor for running these queries is the performance of the Cypher parser, which gets strained a lot by millions of statements and the database spent more than 99% of its time parsing those statements.
Other mistakes one could make in such an update script
- generating large, complex statements (hundreds of lines)
- sending in HUGE (millions) of updates in a single transaction will cause out-of-memory issues
Better Approach
You want small enough queries, that are constant in their shape (for caching of the execution plan) and are using parameters, so that even the string representation is exactly the same.
Each statement should be able to update anything from a single property to a whole set of nodes, but has to stay the same in overall structure for caching.
UNWIND to the Rescue
To achieve a more efficient statement structure, you can prefix your regular “single-entity-update-statement” with an UNWIND that turns a batch-list of data-entries (up to 10k or 50k elements) into individual rows, each of which contains the information for each of the (more or less complex) updates.
Those large update lists only work well when you use a Neo4j driver and you provide the parameters from your programming language. In the cypher-shell also those update-list-parameters have to be parsed that’s why we try to keep them smaller (20 to 100 elements but configurable).
Usually you send in a $batch
parameter of data as a list of maps, which are then applied in a compact statement, which is also properly compiled and cached, as it has a fixed structure.
General Idea
{batch: [{row1},{row2},{row3},...]}UNWIND $batch as row// now perform updates with the data in each "row" variable
From that perspective it’s very similar to a LOAD CSV statement.
If you want to go deep into this kind of optimization please read this article from Michael Hunger.
The New Behaviour
Starting from the last APOC release we introduced the new optimization as the default behaviour for apoc.export.cypher
so if you call the following procedure:
CALL apoc.export.cypher.all({fileName},{config})
The export script will look something like this:
:BEGIN
CREATE INDEX ON :Bar(first_name,last_name);
CREATE INDEX ON :Foo(name);
CREATE CONSTRAINT ON (node:Bar) ASSERT node.name IS UNIQUE;
CREATE CONSTRAINT ON (node:`UNIQUE IMPORT LABEL`)
ASSERT node.`UNIQUE IMPORT ID` IS UNIQUE;
:COMMIT:SCHEMA AWAIT:BEGIN
UNWIND [{_id:3, properties:{age:12}}] as row
CREATE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id}) SET n += row.properties SET n:Bar;
UNWIND [{_id:2, properties:{age:12}}] as row
CREATE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id}) SET n += row.properties SET n:Bar:Person;
UNWIND [{_id:0, properties:{born:date('2018-10-31'), name:"foo"}}, {_id:4, properties:{born:date('2017-09-29'), name:"foo2"}}] as row
CREATE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id}) SET n += row.properties SET n:Foo;
UNWIND [{name:"bar", properties:{age:42}}, {name:"bar2", properties:{age:44}}] as row
CREATE (n:Bar{name: row.name}) SET n += row.properties;
UNWIND [{_id:6, properties:{age:99}}] as row
CREATE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id}) SET n += row.properties;
:COMMIT:BEGIN
UNWIND [{start: {_id:0}, end: {name:"bar"}, properties:{since:2016}}, {start: {_id:4}, end: {name:"bar2"}, properties:{since:2015}}] as row
MATCH (start:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row.start._id})
MATCH (end:Bar{name: row.end.name})
CREATE (start)-[r:KNOWS]->(end) SET r += row.properties;
:COMMIT:BEGIN
MATCH (n:`UNIQUE IMPORT LABEL`) WITH n LIMIT 20000 REMOVE n:`UNIQUE IMPORT LABEL` REMOVE n.`UNIQUE IMPORT ID`;
:COMMIT
:BEGIN
DROP CONSTRAINT ON (node:`UNIQUE IMPORT LABEL`) ASSERT (node.`UNIQUE IMPORT ID`) IS UNIQUE;
:COMMIT
As you can see, the script has significantly changed.
For nodes it is turned into the discussed UNWIND structure:
UNWIND [
{_id:0, properties:{born:date('2018-10-31'), name:"foo"}},
{_id:4, properties:{born:date('2017-09-29'), name:"foo2"}}] as rowCREATE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id})
SET n += row.properties SET n:Foo;
As well as for relationships:
UNWIND [
{start: {_id:0}, end: {name:"bar"}, properties:{since:2016}}, {start: {_id:4}, end: {name:"bar2"}, properties:{since:2015}}] as rowMATCH (start:`UNIQUE IMPORT LABEL`
{`UNIQUE IMPORT ID`: row.start._id})
MATCH (end:Bar {name: row.end.name})
CREATE (start)-[r:KNOWS]->(end) SET r += row.properties;
but this is not enough yet because we are still hard-coding literal values instead of using parameters.
Cypher-Shell’s parameter support to the rescue!
In order to leverage the query parametrization and speeding-up (again) the import process we can use the cypher-shell-mode that allows using query parameters.
In order to export the data in a cypher-shell compatible format we can execute the following procedure call:
CALL apoc.export.cypher.all({file},{format:'cypher-shell',
useOptimizations:{type:'unwind_batch_params',unwindBatchSize: 2}})
That produces the following output script:
:begin
CREATE INDEX ON :Bar(first_name,last_name);
CREATE INDEX ON :Foo(name);
CREATE CONSTRAINT ON (node:Bar) ASSERT (node.name) IS UNIQUE;
CREATE CONSTRAINT ON (node:`UNIQUE IMPORT LABEL`)
ASSERT (node.`UNIQUE IMPORT ID`) IS UNIQUE;
:commitCALL db.awaitIndex(':Foo(name)');
CALL db.awaitIndex(':Bar(first_name,last_name)');
CALL db.awaitIndex(':Bar(name)');:param rows => [{_id:4, properties:{age:12}}, {_id:5, properties:{age:4}}]:begin
UNWIND $rows AS row
CREATE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id}) SET n += row.properties SET n:Bar;
:commit:param rows => [{_id:0, properties:{born:date('2018-10-31'), name:"foo"}}, {_id:1, properties:{born:date('2017-09-29'), name:"foo2"}}]:begin
UNWIND $rows AS row
CREATE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id}) SET n += row.properties SET n:Foo;
:commit:param rows => [{_id:2, properties:{born:date('2016-03-12'), name:"foo3"}}]:begin
UNWIND $rows AS row
CREATE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id}) SET n += row.properties SET n:Foo;
:commit:param rows => [{name:"bar", properties:{age:42}}, {name:"bar2", properties:{age:44}}]:begin
UNWIND $rows AS row
CREATE (n:Bar{name: row.name}) SET n += row.properties;
:commit:param rows => [{start: {_id:0}, end: {name:"bar"}, properties:{since:2016}}]:begin
UNWIND $rows AS row
MATCH (start:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row.start._id})
MATCH (end:Bar{name: row.end.name})
CREATE (start)-[r:KNOWS]->(end) SET r += row.properties;
:commit:begin
MATCH (n:`UNIQUE IMPORT LABEL`) WITH n LIMIT 2 REMOVE n:`UNIQUE IMPORT LABEL` REMOVE n.`UNIQUE IMPORT ID`;
:commit:begin
MATCH (n:`UNIQUE IMPORT LABEL`) WITH n LIMIT 2 REMOVE n:`UNIQUE IMPORT LABEL` REMOVE n.`UNIQUE IMPORT ID`;
:commit:begin
MATCH (n:`UNIQUE IMPORT LABEL`) WITH n LIMIT 2 REMOVE n:`UNIQUE IMPORT LABEL` REMOVE n.`UNIQUE IMPORT ID`;
:commit:begin
DROP CONSTRAINT ON (node:`UNIQUE IMPORT LABEL`) ASSERT (node.`UNIQUE IMPORT ID`) IS UNIQUE;
:commit
As you can see, in cypher-shell it’s possible to define a parameter in this way:
:param name => expression
so in our case:
:param rows => [{_id:4, properties:{age:12}}, {_id:5, properties:{age:4}}]
So we defined a parameter rows
that can be used in a Cypher query with the $
prefix, in the following way:
UNWIND $rows AS row
CREATE (n:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`: row._id}) SET n += row.properties SET n:Bar;
In previous versions, the cypher-shell much like Neo4j Browser evaluated those expressions by sending them with a prefixed RETURN
over the wire to the server to be evaluated by the server-side cypher-engine. That introduced both a network roundtrip per parameter, prohibited their use inside of a client-side transaction and strained the cypher parser even more. That’s why our first experiments with parameters were actually a bit slower than the ones without.
Thanks to the Cypher team, cypher-shell gained a client-side expression parser in version 1.1.9 so most of those drawbacks are now gone, and the import speed improved a lot.
The Benchmarks
We used the following dataset (thanks to Alberto de Lazzari for providing that) for our benchmark.
The dataset is composed by:
- 4713605 nodes
- 4549134 relationships
The Benchmarks: Exporting the cypher-script
$ time ./bin/cypher-shell -u neo4j -p andrea "call apoc.export.cypher.all('import/exportDataCypherShellOld.cypher',{format:'cypher-shell', useOptimizations: {type: 'none'}, batchSize:100})"
real 1m46.790s
user 0m1.384s
sys 0m0.229s
$ time ./bin/cypher-shell -u neo4j -p andrea "call apoc.export.cypher.all('import/exportDataCypherShell.cypher',{format:'cypher-shell', useOptimizations: {type: 'unwind_batch', unwindBatchSize: 20}, batchSize:100})"
real 1m41.065s
user 0m1.349s
sys 0m0.214s$ time ./bin/cypher-shell -u neo4j -p andrea "call apoc.export.cypher.all('import/exportDataCypherShellParams.cypher',{format:'cypher-shell', useOptimizations: {type: 'unwind_batch_params', unwindBatchSize:100}})"
real 1m33.585s
user 0m1.387s
sys 0m0.222s
As you can see the performances in export are quite comparable with the export with params that is slightly more efficient. But all of them complete in around 100 seconds.
The Benchmarks: importing the generated cypher-script
$ time ./bin/cypher-shell -u neo4j -p andrea < 'import/exportDataCypherShellOld.cypher' > 'import/output.exportDataCypherShellOld.log'
real 252m33.279s
user 13m53.566s
sys 6m3.110s$ time ./cypher-shell -u neo4j -p andrea < 'import/exportDataCypherShell.cypher' > 'import/output.exportDataCypherShell.log'
real 110m27.697s
user 2m48.452s
sys 0m55.657s$ time ./bin/cypher-shell -u neo4j -p andrea < 'import/exportDataCypherShellParams.cypher' > 'import/output.exportDataCypherShellParams.log'
real 33m14.835s
user 26m12.681s
sys 0m46.270s
As you can see the new UNWIND powered scripts are more efficient in import, in particular:
- the simple UNWIND script is ~2.3x faster than the older export.
- the UNWIND with parametrization is ~7.6x faster than the older export and ~3.3x faster the simple unwind script
So you can import 4M nodes and 4M relationships using plain cypher-scripts in 33 minutes instead of 4 hours 12 minutes. There is still room for improvement in terms of parser performance but it is a great outcome of the work of the Cypher team on cypher-shell and our optimizations.
APOC procedures provide many options for exporting/importing data with Neo4j. Today we showed how we significantly improved the import performance by using the combination of UNWIND with parametrization in cypher-shell.
If you run into any issues or have thoughts about improving our work, please raise a GitHub issue.