Automate Neo4j Database Construction with Python

Tutorial detailing how to import data and create your graph database via a Python script.

Madison Gipson
Dec 14, 2020 · 4 min read
Minimal workspace with laptop displaying code.
Photo by Christopher Gower on Unsplash

A recent project of mine has involved creating a massive (upwards of a million nodes and relationships) graph database in Neo4j. When there’s that much to process, having an automated script comes in clutch. This article outlines some tips I have for anyone in a similar situation.

Code is linked throughout, and please comment any additional tips!

Pre-Requisites

  1. Python needs to be installed (preferably 2.7 or later)
  2. Neo4j database should be created, with any plugins or settings needed set
    (I use the APOC and GDSL plugins, and increase max heap size to 3GB)
  3. Cypher queries that will build the database as you’d like

Setup: Python Directory & Virtual Environment

  1. Navigate to the graph’s directory. The directory on my Mac looks like /Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-asdfjkl-1234-etc/installation-4.1.1/. I pin this to my sidebar for easy access.
  2. Create a python directory within that installation directory.
  3. Optional: inside the python directory, create a virtual environment. I really like working with a venv because I can easily share or transfer the requirements.txt with all of the required packages with teammates or other projects, respectively. This resource helped me successfully set up virtual environments on both my Mac and PC.
  4. Once your virtual environment is created, you can install the py2neo package via pip.
  5. If your Cypher queries import data from any files, go ahead and place those files in the import directory (should be within the same installation directory).

Coding Time

Note: the full code for this project is in the following repository, so you can reference the context of the gists below or check out the data I was importing.

Script Skeleton

Our main Python script that imports the data or builds the nodes and relationships will live inside that handy Python folder and virtual environment.

This code block does the bare minimum:
1. Import py2neo
2. Establish a database connection
3. Run a single query

With Py2Neo, you can run Cypher queries as-is or use the Py2Neo functions to perform similar actions. My personal preference is running Cypher queries as-is because I can simply copy and paste the code to and from the Neo4j browser.

If you want something super simple, you could repeat the graph.run() call with your different queries and it would get the job done. However, there are some things I discovered when developing my own script that helped tremendously; keep reading if you want to see if these will be useful to you.

Batching & Auto-Committing

I leveraged periodic commits (aka batching) because I kept running into memory issues with my heavier-lifting queries.

If you do run in batches, you’ll want to use an open transaction (accomplished with the begin(), run(), commit() sequence in the code below) instead of a closed transaction (standalone run() call).
This is because a closed transaction auto-commits, meaning that if you use batching and a closed transaction, it will execute and commit your first batch but not continue to do the other batches. Definitely not what we want.

#Without Periodic Commitsgraph.run("""LOAD CSV WITH HEADERS
FROM 'file:///womeninstem.csv' AS line
MERGE (person:Person { name: line.personLabel })
ON CREATE SET person.birth = line.birthdate, person.death = line.deathdate""")
#With Periodic Commitsquery_list = []query_list.append("""CALL apoc.periodic.iterate("
LOAD CSV WITH HEADERS
FROM 'file:///womeninstem.csv' AS line
RETURN line
","
MERGE (person:Person { name: line.personLabel })
ON CREATE SET person.birth = line.birthdate, person.death = line.deathdate", {batchSize:1000, parallel:false, retries: 10}) YIELD operations""")
#would have more queries in query_listfor query in query_list:
g = graph.begin() # open transaction
result = g.run(query).to_data_frame() # execute query
g.commit() # close transaction

Query Failure Alerts

The thing about running a Cypher query through Python code instead of in the browser is that it doesn’t always tell you if the query fails- it will alert you of syntax errors, but that’s about it.

For that reason, it’s important to have something implemented that alerts you if a query doesn’t finish completely.

query_list = []query_list.append("""CALL apoc.periodic.iterate("
LOAD CSV WITH HEADERS
FROM 'file:///womeninstem.csv' AS line
RETURN line
","
MERGE (person:Person { name: line.personLabel })
ON CREATE SET person.birth = line.birthdate, person.death = line.deathdate", {batchSize:1000, parallel:false, retries: 10}) YIELD operations""")
#would have more queries in query_listfor query in query_list:
g = graph.begin() # open transaction
result = g.run(query).to_data_frame() # execute query
try:
if result['operations'][0]['failed'] > 0: #means batch failed
print('Could not finish query')
else: #means batch succeeded
g.commit() # close transaction
print('Completed query ')
except Exception as e: #means there's a syntax or other error
g.commit() # close transaction
print('Completed query')

Parameterizing Queries

I wrote a short separate article on this- long story short, you tag the simple .format() substitution onto the end of the Cypher query.

I haven’t written about these yet, but there are a few other tricks I have up my sleeve with constructing a Neo4j database (parallelizing queries, simple GUI to display script progress, etc.); however, they deserve their own articles and I have not written those yet. I plan to link them here once I do though!

If you’d like to check out the larger project from which this was pulled, see this repository.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Madison Gipson

Written by

Student & intern learning how to make the world a better place with code.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface.

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox.

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic.

Get the Medium app