Building a Graph Database with Vector Embeddings: A Python Tutorial with Neo4j and Embeddings

A Real-World Example of Creating Embeddings with Neo4j, LangChain, and Neo4jVector

Shivam Sharma
The Deep Hub
9 min readMay 24, 2024

--

In today’s data-driven world, traditional databases are no longer sufficient for handling complex relationships between data entries. Graph Databases offer a solution to store and query data as a graph. In this article, I will learn the concept of graph databases, Neo4j, and how to create embeddings for fast retrieval.

Introduction to Graph Databases

Graph databases are gaining significant traction due to their ability to represent data through relationships rather than traditional tabular formats. Unlike conventional databases, which store data in tables, graph databases store data as a graph consisting of nodes and edges. Nodes represent entities (such as Persons or Places), while edges represent relationships between these nodes. Both nodes and relationships can have properties to store additional metadata.

Graph databases excel at finding connections and patterns in various fields, such as social networks and recommendation engines, and even tracing money in money laundering cases. They facilitate complex queries, such as finding the shortest path between nodes, identifying strong or weak points in a network, and analyzing node connections based on distance.

Graph databases are highly scalable and efficiently manage large volumes of connected data.

Consider a simple example where nodes represent users, places, cuisines, and locations. The relationships between these nodes might include “IS FRIEND OF,” “LIKES,” “SERVES,” and “LOCATED IN.” For instance:

  • User 1, User 2, and User 3 are nodes representing individual users.
  • Place, Cuisine, and Location are other nodes representing specific entities.
  • Relationships like “IS FRIEND OF” connect User 1 and User 2, while “LIKES” might connect User 1 to a particular Cuisine, and “SERVES” connects a Place to a Cuisine.

Introducing Neo4j

Neo4j is a widely used graph database that stands out for its high performance, scalability, and open-source nature. At its core, Neo4j stores data in the form of nodes, relationships, and properties, making it an ideal choice for handling complex data relationships.

However, what makes Neo4j truly powerful is its ability to establish relationships between nodes. In our example, the Paper node has relationships with multiple authors through the “AUTHORED_BY” relationship and with the Version node through the “HAS_VERSION” relationship. This allows us to query the data flexibly and efficiently using Cypher, Neo4j’s query language.

Cypher is a declarative query language that allows us to query the data in our Neo4j database. With Cyphar, we can write queries that traverse the relationships between nodes, making it easy to extract insights and patterns from our data. Whether we need to find all the authors who have written papers in a specific category or identify the latest version of a paper, Cypher makes it easy to get the answer we need.

Let us write a simple cipher query to find all the papers written by the author “Theran Louis.”

Setting Up Neo4j

We need to set up Neo4j desktop to build the graph databases and interact with it using Python. Follow these steps to get started:

  1. Download Neo4j Desktop from this link for your operating system.
  2. Open Neo4j after installation, and create new projects by clicking the “New Project” button. Name your project(in my case “AuthorPublication”).
  3. Within newly created project, click the “Add” button located at the top right corner.
  4. Choose “Local Database” and follow the prompts to create a new database. Enter a name for this database, and set a password.
  5. Once your database is created, it can be seen under the projects. Click on “Start” button to launch the database.
  6. Wait for the database to start and press “Open” to interact with your database with Cypher’s.

Building the Graph Database in Python

To build our graph database, we will use an open dataset of authors and papers. we have a collection of JSON data in the following format:

{
"id":"0704.0002",
"submitter":"Louis Theran",
"authors":"Ileana Streinu and Louis Theran",
"title":"Sparsity-certifying Graph Decompositions",
"comments":"To appear in Graphs and Combinatorics",
"journal-ref":null,
"doi":null,
"report-no":null,
"categories":"math.CO cs.CG",
"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"abstract":" We describe a new algorithm, the $(k,\\ell)$-pebble game with colors, and use\nit obtain a characterization of the family of $(k,\\ell)$-sparse graphs and\nalgorithmic solutions to a family of problems concerning tree decompositions of\ngraphs. Special instances of sparse graphs appear in rigidity theory and have\nreceived increased attention in recent years. In particular, our colored\npebbles generalize and strengthen the previous results of Lee and Streinu and\ngive a new proof of the Tutte-Nash-Williams characterization of arboricity. We\nalso present a new decomposition that certifies sparsity based on the\n$(k,\\ell)$-pebble game with colors. Our work also exposes connections between\npebble game algorithms and previous sparse graph algorithms by Gabow, Gabow and\nWestermann and Hendrickson.\n",
"versions":
[
{
"version":"v1",
"created":"Sat, 31 Mar 2007 02:26:18 GMT"
},
{
"version":"v2",
"created":"Sat, 13 Dec 2008 17:26:00 GMT"
}
],
"update_date":"2008-12-13",
"authors_parsed":
[
["Streinu","Ileana",""],
["Theran","Louis",""]
]
}

We’ll interact with the Neo4j database using the neomodel library, which provides a Pythonic way to work with Neo4j.

First, we’ll import the necessary libraries and configure the connection to the Neo4j database. The neomodel library uses the Bolt protocol, typically hosted on port 7687, to communicate with Neo4j. neomodel library needs it in this format: bolt://username:password@localhost:port_number

import json
from neomodel import StructuredNode, StringProperty, RelationshipTo, config

# Configure the database connection
config.DATABASE_URL = 'bolt://neo4j:password@localhost:7687'

Next, we’ll define the schema for our graph database. We’ll create three node types: Paper, Author, and Version. The Paper node will have relationships with Author nodes (AUTHORED_BY) and Version nodes(HAS_VERSION). Each class will represent a node label, and its properties will correspond to the attributes of that label. The neomodel library allows us to specify unique properties using the unique_index flag.

class Paper(StructuredNode):
uid = StringProperty(unique_index=True)
submitter = StringProperty()
title = StringProperty()
comments = StringProperty()
journal_ref = StringProperty()
doi = StringProperty()
report_no = StringProperty()
categories = StringProperty()
abstract = StringProperty()
update_date = StringProperty()

authors = RelationshipTo("Author", 'AUTHORED_BY')
versions = RelationshipTo("Version", "HAS_VERSION")

class Author(StructuredNode):
name = StringProperty(unique_index=True)

class Version(StructuredNode):
version = StringProperty()
created = StringProperty()

Loading Data into Neo4j

To load our dataset into the Neo4j graph database, we need to define a function that creates nodes and relationships based on the JSON data structure we had. This function uses neomodel library to interact with the Neo4j database.

We start by defining a function create_nodes_and_relationships that takes a single JSON object as input and creates nodes and relationships in the database.

def create_nodes_and_relationships(data):
paper = Paper(uid=data['id'], submitter=data['submitter'],
title=data['title'], comments=data['comments'],
journal_ref=data['journal-ref'], doi=data['doi'],
report_no=data['report-no'],categories=data['categories'],
abstract=data['abstract'], update_data=data['update_date']).save()

for author in data['authors_parsed']:
author_node = Author(name=" ".join(author)).save()
paper.authors.connect(author_node)

for version in data['versions']:
version_node = Version(version=version['version'],
created=version['created']).save()
paper.versions.connect(version_node)

Next, we need to iterate over each JSON object in our database file and use the function to create nodes and relationships in the database. This is done by opening the file and reading each line as a separate JSON object.

import json

# open the JSON file and read each line
with open("archive/arxiv-metadata-oai-snapshot.json") as user_file:
for line in user_file:
try:
# load the JSON data and create nodes and relationships
create_nodes_and_relationships(json.loads(line))
except Exception as e:
print(e)

We have successfully loaded our dataset into the Neo4j graph database. The database now contains a network of interconnected nodes representing papers, authors, and versions. Now our database will look like this:

Creating Embeddings

To enhance the capabilities of our Neo4j graph database, we’ll create embeddings using the Neo4jVector wrapper. This enables us to perform operations involving vectors, such as similarity searches and faster retrieval of relevant text.

Neo4jVector acts as a bridge between our Neo4j graph and vector operations. It facilitates the creation of embeddings from textual data, allowing us to extract meaningful representations of our graph’s nodes and relationships.

Let’s dive into the process of generating embeddings of our existing graph. We’ll utilize the from_existing_graph method provided by Neo4jVector. This method takes text from our database, calculates embeddings and stores them back in the database. For this task, I will be leveraging the capabilities of OpenAIEmbeddings.

from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings
import os
from dotenv import load_dotenv

load_dotenv()

# Create the vectorstore for our existing graph
paper_graph = Neo4jVector.from_existing_graph(
embedding=OpenAIEmbeddings(),
url="bolt://localhost:7687",
username="neo4j",
password="password",
index_name="paper_index",
node_label="Paper",
text_node_properties=["abstract", "title"],
embedding_node_property="paper_embedding",
)

Understanding the Parameters

  • embedding: Specifies the type of embedding method to use(OpenAIEmbeddings)
  • url: The URL of our Neo4j database
  • username/password: Credentials for accessing the Neo4j database.
  • index_name: Name of the index to store the embeddings
  • node_label: The label of the nodes for which we want to create embeddings(in our case “Paper”).
  • text_node_properties: Properties of the nodes containing the textual data from which embeddings will be generated.
  • embedding_node_property: The property name where the embeddings will be stored in the database.

Now we created the embeddings, let us take a closer look at them in the Neo4j Browser, here as can be seen paper_embedding , the property contains the embeddings.

Performing Similarity Search

Now that we have generated embeddings for our graph data let’s explore how we can leverage them to perform similarity searches. This functionality allows us to find text documents that are semantically similar to a given query.

We will start by conducting a similarity search for a specific text on the embeddings we created earlier. This involves using the similarity_search method provided by Neo4jVector.

from pprint import pprint

result = paper_graph.similarity_search("dark matter field fluid model")
pprint(result[0].page_content)

We provided a query text and the algorithm searches (cosine similarity search is the default) for documents with embeddings similar to the query. In the result , we get the most similar documents, including their content.

We have established our index and generated embeddings, allowing us to easily access them for future use. The from_existing_index method of Neo4jVector enables us to do this effortlessly.

paper_store = Neo4jVector.from_existing_index(
OpenAIEmbeddings(),
url="bolt://localhost:7687",
username="neo4j",
password="password",
index_name="paper_index",
text_node_property="abstract"
)

result = paper_store.similarity_search("We discuss the results from the combined IRAC and MIPS c2d Spitzer Legacy observations of the Serpens star-forming region. In particular we present")
pprint(result[0].page_content)

Output:

In conclusion, by using embeddings in our Neo4j database, we’ve unlocked advanced capabilities such as similarity search. This helps us in efficiently navigate and extract insights from our data based on semantic similarities.

In this article, we’ve explored the fundamentals of graph databases, delved into the process of setting up Neo4j, and demonstrated how to create embeddings for our data. We’ve also learned how to perform similarity searches, which opens up a world of possibilities for data exploration and analysis. Whether it’s finding related documents, detecting patterns, or making recommendations, the combination of graph databases and embeddings offers a powerful toolkit for data scientists and analysts. With the knowledge gained from this article, you’re now ready to dive deeper into the world of graph-based data analysis and leverage these techniques in your own projects.

You can find the code and data here: https://github.com/shivamsharma00/GraphDatabase

And if you like this article please connect with me on Linkedin: https://www.linkedin.com/in/shivamsharma00/

Also, please show your appreciation by clapping on this article.

--

--