Methods for Loading Data into a Remote Neo4j Instance — Part 1

Andrew Jefferson
May 26 · 5 min read

There are quite a few ways to import your own data into a Neo4j instance however they require that either:

  • you have the source data on the same filesystem as the Neo4j instance
  • or the data can be loaded from a url accessible to the Neo4j instance

I’ll provide some solutions to deal with the situation where you can’t put the source data on the same filesystem as the Neo4j instance and it’s not already located on a url that the Neo4j instance can access.

There are plenty of reasons why you might not be able to load your source data onto the same filesystem as the Neo4j instance for example your database administrator might not let you access the server’s filesystem or you are using a managed Neo4j service such as Neo4j Cloud or Neo4j Sandbox where you cannot access the instance’s filesystem.

The three solutions I will cover are:

Part 1: (This article) Making a local directory accessible over the internet using a tunneling service — in this case ngrok.io (quick and free but insecure!)

Part 2: Use cloud storage to securely load data into Neo4j. Loading the files from your local machine into a cloud storage service and using secure time-limited signed urls to access them (slower but secure)

Part 3: (Upcoming article) Import data into a local Neo4j instance and use export-to-cypher to upload the resulting graph to a remote Neo4j instance.

To demonstrate the approach I will use the Cypher LOAD CSV command to load data from a csv file. The approach will work just fine with other Neo4j commands such as the apoc.load family of commands.

If you want to follow the example but don’t have a remote Neo4j instance handy you can create a Neo4j Sandbox Instance (choose a “Blank Sandbox”) and use the Neo4j Browser to run Cypher queries.

Neo4j Blank Sandbox

I have a csv file containing the numbers between 1 and 1,000,000 and the number as words in English. It looks like this:

If you’re following along at home you can download the mini csv above which just has the numbers 1 to 10 or you can generate the full csv for yourself using this Python script.

I want to load this data from csv into Neo4j, creating a node for each number (i.e. row in the csv). The Cypher command I will use to load the data is

LOAD CSV WITH HEADERS FROM "<url>" AS row
CREATE (:Number {value: toInt(row.value), text: row.text});

Pretty straightforward, all I need to do is create a url that allows the remote Neo4j instance to access my csv file.

Publishing a local directory to the world wide web using ngrok and Python

How we will read local csv files from a remote neo4j server

Using ngrok and Python will allow us to get a public url (something like https://abs123.ngrok.io/mydata.csv) that we can use to read the csv file from our local computer — without needing to upload the file to online storage.

This method is insecure and publishes your data so that it is accessible to anyone on the internet for the duration of the loading process. Because of that this method must only be used with non-sensitive data or data that is already publicly available.

  1. You will need the ngrok cli tool that you can download here (you don’t need to create an ngrok account)
  2. Python (2 or 3)
  3. All the source data you want to use should be in a single directory (everything inside tha directory will be publised online)

Tldr; If you just want a magic script to do it check out this gist: https://gist.github.com/eastlondoner/615acdaad2167f5e0a166f23bebbafd4

Start a shell and navigate to the directory that contains the data. Then serve the contents of the directory on port 8080 using Python’s builtin http server:

cd data# Try using python3 first, if that fails try python2
python3 -m http.server 8080 || python2 -m SimpleHTTPServer 8080

Now you should be able to open up http://localhost:8080/ and see the directory that’s being hosted.

http://localhost:8080/

Double check that you’re happy to publish everything that’s in there, we’re about to put it all on the internet.

Leave that shell running the http server and start a new shell. Use ngrok to tunnel your local port 8080 to a public url:

# Set the ngrok region to use - your data will be tunneled through here so it's important to pick one close to you.
# us - USA, eu - Europe, ap - Asia/Pacific, au - Australia
REGION="us"
ngrok http -region="${REGION}" -inspect=false 8080

The ngrok command line UI should fire up and show the public URL for your tunnel:

ngrok command line UI

The address that looks like https://<random id>.ngrok.io or https://<random id>.<region>.ngrok.io is the public address for your tunnel.

You’ll get a new random address each time you run ngrok. Copy the public address of your tunnel to the clipboard, we will need it next.
n.b. You can control the address you receive if you sign up for an account at ngrok.com.

If you navigate to the public address you should see that your local directory server is now available on the public url.

Your data is out there for everyone to see!

Now our csv files are published online we can read them from our remote Neo4j Instance. Just take the public id of your tunnel url and plug that into the Cypher statement:

LOAD CSV WITH HEADERS FROM "https://<your tunnel url>/numbers.csv" AS row
CREATE (:Number {value: toInt(row.value), text: row.text});

Added 1000000 labels, created 1000000 nodes, set 2000000 properties

Success!

Don’t forget to terminate (Ctrl + c) both the Python http server and the ngrok processes when you’re done.

Next up: Part 2

Find out How to use cloud storage to securely load data into Neo4j

Thanks to David Allen and David Mack.

Andrew Jefferson

Written by

ML Alchemist