Analytics Vidhya
Published in

Analytics Vidhya

Automate Neo4j Database Construction with Python

Tutorial detailing how to import data and create your graph database via a Python script.

Minimal workspace with laptop displaying code.
Photo by Christopher Gower on Unsplash

A recent project of mine has involved creating a massive (upwards of a million nodes and relationships) graph database in Neo4j. When there’s that much to process, having an automated script comes in clutch. This article outlines some tips I have for anyone in a similar situation.

Code is linked throughout, and please comment any additional tips!

Pre-Requisites

  1. Python needs to be installed (preferably 2.7 or later)
  2. Neo4j database should be created, with any plugins or settings needed set
    (I use the APOC and GDSL plugins, and increase max heap size to 3GB)
  3. Cypher queries that will build the database as you’d like

Setup: Python Directory & Virtual Environment

  1. Navigate to the graph’s directory. The directory on my Mac looks like /Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-asdfjkl-1234-etc/installation-4.1.1/. I pin this to my sidebar for easy access.
  2. Create a python directory within that installation directory.
  3. Optional: inside the python directory, create a virtual environment. I really like working with a venv because I can easily share or transfer the requirements.txt with all of the required packages with teammates or other projects, respectively. This resource helped me successfully set up virtual environments on both my Mac and PC.
  4. Once your virtual environment is created, you can install the py2neo package via pip.
  5. If your Cypher queries import data from any files, go ahead and place those files in the import directory (should be within the same installation directory).

Coding Time

Note: the full code for this project is in the following repository, so you can reference the context of the gists below or check out the data I was importing.

Script Skeleton

Our main Python script that imports the data or builds the nodes and relationships will live inside that handy Python folder and virtual environment.

This code block does the bare minimum:
1. Import py2neo
2. Establish a database connection
3. Run a single query

With Py2Neo, you can run Cypher queries as-is or use the Py2Neo functions to perform similar actions. My personal preference is running Cypher queries as-is because I can simply copy and paste the code to and from the Neo4j browser.

If you want something super simple, you could repeat the graph.run() call with your different queries and it would get the job done. However, there are some things I discovered when developing my own script that helped tremendously; keep reading if you want to see if these will be useful to you.

Batching & Auto-Committing

I leveraged periodic commits (aka batching) because I kept running into memory issues with my heavier-lifting queries.

If you do run in batches, you’ll want to use an open transaction (accomplished with the begin(), run(), commit() sequence in the code below) instead of a closed transaction (standalone run() call).
This is because a closed transaction auto-commits, meaning that if you use batching and a closed transaction, it will execute and commit your first batch but not continue to do the other batches. Definitely not what we want.

#Without Periodic Commitsgraph.run("""LOAD CSV WITH HEADERS
FROM 'file:///womeninstem.csv' AS line
MERGE (person:Person { name: line.personLabel })
ON CREATE SET person.birth = line.birthdate, person.death = line.deathdate""")
#With Periodic Commitsquery_list = []query_list.append("""CALL apoc.periodic.iterate("
LOAD CSV WITH HEADERS
FROM 'file:///womeninstem.csv' AS line
RETURN line
","
MERGE (person:Person { name: line.personLabel })
ON CREATE SET person.birth = line.birthdate, person.death = line.deathdate", {batchSize:1000, parallel:false, retries: 10}) YIELD operations""")
#would have more queries in query_listfor query in query_list:
g = graph.begin() # open transaction
result = g.run(query).to_data_frame() # execute query
g.commit() # close transaction

Query Failure Alerts

The thing about running a Cypher query through Python code instead of in the browser is that it doesn’t always tell you if the query fails- it will alert you of syntax errors, but that’s about it.

For that reason, it’s important to have something implemented that alerts you if a query doesn’t finish completely.

query_list = []query_list.append("""CALL apoc.periodic.iterate("
LOAD CSV WITH HEADERS
FROM 'file:///womeninstem.csv' AS line
RETURN line
","
MERGE (person:Person { name: line.personLabel })
ON CREATE SET person.birth = line.birthdate, person.death = line.deathdate", {batchSize:1000, parallel:false, retries: 10}) YIELD operations""")
#would have more queries in query_listfor query in query_list:
g = graph.begin() # open transaction
result = g.run(query).to_data_frame() # execute query
try:
if result['operations'][0]['failed'] > 0: #means batch failed
print('Could not finish query')
else: #means batch succeeded
g.commit() # close transaction
print('Completed query ')
except Exception as e: #means there's a syntax or other error
g.commit() # close transaction
print('Completed query')

Parameterizing Queries

I wrote a short separate article on this- long story short, you tag the simple .format() substitution onto the end of the Cypher query.

I haven’t written about these yet, but there are a few other tricks I have up my sleeve with constructing a Neo4j database (parallelizing queries, simple GUI to display script progress, etc.); however, they deserve their own articles and I have not written those yet. I plan to link them here once I do though!

If you’d like to check out the larger project from which this was pulled, see this repository.

--

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Recommended from Medium

CS371p Spring 2022 Wk3: Cristian Astorga

Making Forms Crispier than Potato Chips with Django-Crispy-Forms

REGISTER FOR A FREE BINANCE ACCOUNT

Mountain Defi Begins its Recovery after Theft of Funds

Memoirs: Journey to Junior, Part 1 - Practical Information

Improving the look of your MacOS Terminal

start using medium, 2021–2–20

“The Math Neon.” project

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Madison Gipson

Madison Gipson

Data Scientist & SWE | NASA | BS Computer Science & MBA Student

More from Medium

Create a Multi-Page App with the New Streamlit-Option-Menu Component

How To Add User Authentication On Your Streamlit App

Publish Your Streamlit Apps in the Cloud

Try this simple Routing and Auth approach for Streamlit