Loading Data into Neo4j Using pyneoinstance (ScoobyGraph, Part 2)

Published in

Neo4j Developer Blog

8 min readFeb 20, 2024

Are you interested in stepping into the transformative world of graphs? Do you want to begin working with graphs to uncover connections, visualize insights, and even ground your Large Language Models (LLMs)?

Do you want to try all of this with a fun dataset full of Scooby-Doo monsters?

Welcome to the ScoobyGraph series! In this tutorial, we are exploring how to use graphs and graph analysis with a childhood favorite: Scooby Doo. If you missed part 1 you can read it here to learn how we developed the Cypher queries we will be using in this tutorial.

In Part 1, we explored how to turn the Scooby Doo csv into a Neo4j graph using Cypher (the preferred graph query language) in the Neo4j command line. In Part 2, we are going to take those Cypher queries from Part 1 and instead use them to upload the data straight from Python using a library called pyneoinstance. I will explain how to use pyneoinstance and show you how it makes uploading data to Neo4j in Python extremely easy.

pyneoinstance

While there are multiple ways to ingest data into Neo4j, my personal favorite is pyneoinstance library. I prefer pyneoinstance because it works very well with pandas and makes your calls feel very Pythonic! For an in-depth article about how to use pyneoinstance, see Alex Gilmore’s article here!

# Python imports
import pandas as pd
import pyneoinstance

Separating Python and Cypher with config.yaml

One of the cleanest ways to ingest code to Neo4j using Cypher is by having two separate files: your python script and a .yaml file. It makes your code cleaner, easier to maintain, and it is easier to work with Cypher code in the .yaml file rather than in a Python script or notebook.

1. Connect to the Neo4j Database

When I create a .yaml file, I first load in my Neo4j URI, username, and password as shown below.

# config.yaml: Neo4j Credentials
credentials:
  uri: 'insert uri here'
  user: neo4j
  password: 'password'

The pyneoinstance library provides a convenient function to load in the yaml file easily. Then you just need to access your credentials in the file and create a Neo4j connection.

# Python file - connection to Neo4j

# Retrieving the Neo4j connection credentials from the config.yaml file
configs=pyneoinstance.load_yaml_file('config.yaml')
creds=configs['credentials']

# Establishing the Neo4j connection
neo4j=pyneoinstance.Neo4jInstance(creds['uri'],creds['user'],creds['password'])

Now that we have established a connection to the database, we can begin loading the data into Neo4j.

2. Add Cypher Queries to YAML file

Before loading your data into any graph database, you should have an idea of how you want the data structured as a graph (what data will be nodes, what will be the relationships between those nodes, and what data will become properties of the nodes and edges). Here is the data model from Part 1:

We will work on loading the data one section at a time. Loading your data into the graph in smaller pieces makes development and debugging much easier.

pyneoinstance is extremely useful because it has different methods for loading data with write Cypher queries. You can run a string Cypher query, or you can use the execute_write_query_with_data() function to run Cypher queries against a data frame. We are going to use the latter method to load data from our data frame.

We’ll begin with writing the necessary Cypher code to load in the Series and Episodes nodes, as well as the IS_SERIES relationship. (As a reminder: if you aren’t familiar with Cypher, check out the free resources on GraphAcademy!)

The key to writing Cypher queries to run on data frames with pyneoinstance is to begin your Cypher query with the following:

WITH $rows AS rows
UNWIND rows AS row

Then, you can use row.column_name to access the values in your data frame throughout your Cypher query, just like what we did in Part 1 with LOAD CSV. For example, the Cypher statement to load Series nodes would look like this:

WITH $rows AS rows
UNWIND rows AS row
MERGE (s:Series {seriesName:row.series_name})

Now that we have a general understanding of how pyneoinstance works, we can store Cypher queries in the config.yaml file. We will take the queries we developed in Part 1, but instead of using LOAD CSV we will use the WITH $rows as rows UNWIND rows as row. Here is the config file with the Cypher queries for series, episodes, and monster nodes.

cypher:
  load_nodes:

    # loading in the series nodes
    load_series: | 
      WITH $rows AS rows
      UNWIND rows AS row
      MERGE (s:Series {seriesName:row.series_name})
    
    # loading in episode nodes and connecting them to existing series nodes
    load_episodes: |
      WITH $rows AS rows
      UNWIND rows AS row
      MATCH (s:Series {seriesName:row.series_name})
      MERGE (e:Episode {episodeTitle:row.title})
      SET e.season = row.season, e.network = row.network,
      e.imdb = row.imdb, e.engagement = row.engagement,
      e.runtime = row.runtime, e.format = row.format
      MERGE (s) <-[:IN_SERIES] - (e)# Loading in monsters, subtypes, types, species, and the relationships between them
    
    # loading in the monster, subtype, type, and species nodes
    # as well as the relationships between them
    load_monsters: |
          WITH $rows AS rows
          UNWIND rows AS row
          MERGE (m:Monster {name:row.monster_name})
          SET m.real = row.monster_real, m.gender = row.monster_gender
          WITH m,row
          MERGE (st:Subtype {name:row.monster_subtype})
          MERGE (t:Type {name:row.monster_type})
          MERGE (s:Species {name:row.monster_species})
          WITH m,st,t,s,row
          MATCH (e:Episode {episodeTitle:row.title})
          MERGE (m) -[:APPEARS_IN] -> (e)
          MERGE (m) -[:IS_SUBTYPE]->(st)
          MERGE (st)-[:IS_TYPE]->(t)
          MERGE (t)-[:IS_SPECIES]->(s)

3. Run Queries in Python to Load Data

First, load the data into Python.

# Python - Load in the Scooby Doo data
df = pd.read_csv('https://raw.githubusercontent.com/CayleyCausey/ScoobyGraph/main/scoobydoo_new.csv')

Now that we have the data in Python and the Cypher code to load the series and episodes in our YAML file, we can upload the data using the funcction pyneoinstance.Neo4jinstance.execute_write_query_with_data(). We set the query parameter to the Cypher query string in the config.yaml file, and data parameter to the dataframe we loaded in earlier.

For the monsters query, I changed the data parameter to only load in monster rows that are not null (because a small amount of the Scooby-Doo media does not include any monsters).

# Python - loading nodes
# Load series nodes
neo4j.execute_write_query_with_data(query =configs['cypher']['load_nodes']['load_series'],data=df)
# Load episode nodes and IN_SERIES relationships
neo4j.execute_write_query_with_data(query = configs['cypher']['load_nodes']['load_episodes'],data = df)
# Load monster, subtype, type, and species nodes/relationships
neo4j.execute_write_query_with_data(query = configs['cypher']['load_nodes']['load_monsters'],data = df[~df.monster_name.isna()])

4. Continue adding to the YAML file and calling the Cypher in Python until all the data is loaded

Now we’ve loaded most of the data, but there are still some missing pieces. For example, the main characters (Fred, Daphne, Velma, Scooby, and Shaggy) need to be added to the database. We can add the Cypher to add main characters to our config.yaml file, which now looks like this:

credentials:
  uri: 'insert uri here'
  user: neo4j
  password: 'password'
cypher:
  load_nodes:
    load_series: |
      WITH $rows AS rows
      UNWIND rows AS row
      MERGE (s:Series {seriesName:row.series_name})
    load_episodes: |
      WITH $rows AS rows
      UNWIND rows AS row
      MATCH (s:Series {seriesName:row.series_name})
      MERGE (e:Episode {episodeTitle:row.title})
      SET e.season = row.season, e.network = row.network,
      e.imdb = row.imdb, e.engagement = row.engagement,
      e.runtime = row.runtime, e.format = row.format
      MERGE (s) <-[:IN_SERIES] - (e)
    load_monsters: |
      WITH $rows AS rows
      UNWIND rows AS row
      MERGE (m:Monster {name:row.monster_name})
      SET m.real = row.monster_real, m.gender = row.monster_gender
      WITH m,row
      MERGE (st:Subtype {name:row.monster_subtype})
      MERGE (t:Type {name:row.monster_type})
      MERGE (s:Species {name:row.monster_species})
      WITH m,st,t,s,row
      MATCH (e:Episode {episodeTitle:row.title})
      MERGE (m) -[:APPEARS_IN] -> (e)
      MERGE (m) -[:IS_SUBTYPE]->(st)
      MERGE (st)-[:IS_TYPE]->(t)
      MERGE (t)-[:IS_SPECIES]->(s)
    load_main_characters: |
      MERGE (:MainCharacter {name:"Fred"})
      MERGE (:MainCharacter {name:"Daphne"})
      MERGE (:MainCharacter {name:"Shaggy"})
      MERGE (:MainCharacter {name:"Scooby"})
      MERGE (:MainCharacter {name:"Velma"})

Because the main characters are not being loaded by a dataframe, we can just run them with the function execute_write_query().

# Python script: loading main characters
# Loading the main characters
neo4j.execute_write_query(configs['cypher']['load_nodes']['load_main_characters'])

Now that we’ve loaded all the nodes, we can load in the rest of the relationships. We will add the APPEARS_WITH code as we did in Part 1, but for the UNMASKED, CAPTURED, and CAUGHT relationships we will make a change from Part 1. Now that we are working in Python, we can simplify the Cypher statements and manipulate the data more in Python. This leads us to the following complete config.yaml:

Complete config.yaml file

credentials:
  uri: 'insert uri here'
  user: neo4j
  password: 'password'
cypher:
  load_nodes:
    load_series: |
      WITH $rows AS rows
      UNWIND rows AS row
      MERGE (s:Series {seriesName:row.series_name})
    load_episodes: |
      WITH $rows AS rows
      UNWIND rows AS row
      MATCH (s:Series {seriesName:row.series_name})
      MERGE (e:Episode {episodeTitle:row.title})
      SET e.season = row.season, e.network = row.network,
      e.imdb = row.imdb, e.engagement = row.engagement,
      e.runtime = row.runtime, e.format = row.format
      MERGE (s) <-[:IN_SERIES] - (e)
    load_monsters: |
      WITH $rows AS rows
      UNWIND rows AS row
      MERGE (m:Monster {name:row.monster_name})
      SET m.real = row.monster_real, m.gender = row.monster_gender
      WITH m,row
      MERGE (st:Subtype {name:row.monster_subtype})
      MERGE (t:Type {name:row.monster_type})
      MERGE (s:Species {name:row.monster_species})
      WITH m,st,t,s,row
      MATCH (e:Episode {episodeTitle:row.title})
      MERGE (m) -[:APPEARS_IN] -> (e)
      MERGE (m) -[:IS_SUBTYPE]->(st)
      MERGE (st)-[:IS_TYPE]->(t)
      MERGE (t)-[:IS_SPECIES]->(s)
    load_main_characters: |
      MERGE (:MainCharacter {name:"Fred"})
      MERGE (:MainCharacter {name:"Daphne"})
      MERGE (:MainCharacter {name:"Shaggy"})
      MERGE (:MainCharacter {name:"Scooby"})
      MERGE (:MainCharacter {name:"Velma"})
  load_relationships:
    CAUGHT: |
      WITH $rows AS rows
      UNWIND rows AS row
      MATCH (m:Monster {name:row.monster_name})
      MATCH (mc:MainCharacter {name:$name})
      MERGE (mc) -[:CAUGHT]->(m)
    UNMASKED: |
      WITH $rows AS rows
      UNWIND rows AS row
      MATCH (m:Monster {name:row.monster_name})
      MATCH (mc:MainCharacter {name:$name})
      MERGE (mc) -[:UNMASKED]->(m)
    CAPTURED: |
      WITH $rows AS rows
      UNWIND rows AS row
      MATCH (m:Monster {name:row.monster_name})
      MATCH (mc:MainCharacter {name:$name})
      MERGE (m) -[:CAPTURED]->(mc)
    APPEARS_WITH: |
      MATCH (m:Monster) - [:APPEARS_IN] -> (e:Episode) <- [:APPEARS_IN] - (m1:Monster)
      MERGE (m) - [:APPEARS_WITH] -> (m1)

We create the APPEARS_WITH relationships the same way we created main character nodes:

# Python - Creating APPEARS_WITH relationships
neo4j.execute_write_query(configs['cypher']['load_relationships']['APPEARS_WITH'])

To create the UNMASKED, CAPTURED, and CAUGHT relationships, we will create a dictionary to map the names in the csvs to the names in the Main Character nodes (e.g. fred -> Fred), find the relevant columns, and then load the data into Neo4j.

In this write we are using a variable in our Cypher statement called $name, which we set with “parameters” in execute_write_query_with_data. We also partition the data into 2 groups for each of these calls to reduce load on the database.

# Python script: monster-main character connections

# Adding in CAUGHT, UNMASKED, and CAPTURED relationships
adjusted_names = {'fred':'Fred',
                  'daphnie':'Daphne',
                  'velma': 'Velma',
                  'shaggy': 'Shaggy',
                  'scooby': 'Scooby'}

for col in df.columns:

    # Finding the caught columns
    if 'caught' in col:
        if col.split('_')[1] in adjusted_names:
            name = adjusted_names[col.split('_')[1]] 
            result = neo4j.execute_write_query_with_data(query = configs['cypher']['load_relationships']['CAUGHT'],data = df[df[col]==True],partitions=2,parameters={"name":name})
            print(result)
    
    # Finding the unmask columns
    elif 'unmask' in col:
        if col.split('_')[1] in adjusted_names:
            name = adjusted_names[col.split('_')[1]]
            result = neo4j.execute_write_query_with_data(query = configs['cypher']['load_relationships']['UNMASKED'],data = df[df[col]==True],partitions=2,parameters={"name":name})
            print(result)

    # Finding the capture columns
    elif 'capture' in col:
        if col.split('_')[1] in adjusted_names:
            name = adjusted_names[col.split('_')[1]]
            result = neo4j.execute_write_query_with_data(query = configs['cypher']['load_relationships']['CAPTURED'],data = df[df[col]==True],partitions=2,parameters={"name":name})
            print(result)

Once you’ve run all of that, you’re done! You have successfully created your ScoobyGraph using pandas data frames and pyneoinstance. Now you have an understanding of how to create more graphs from dataframes in the future.