Building Knowledge Graphs from Scratch Using Neo4j and Vertex AI

19 min readMar 18, 2024

Recently I watched Andrew Ng and Andreas Kollegger course “Knowledge Graphs for RAG” available at deeplearning.ai. The course builds Knowledge Graphs from scratch, defining nodes and relationships obtained from the SEC (U.S. Securities and Exchange Commission) forms. The course is free. This article tries to reproduce the results obtained by them, concatenating pieces of code and visualizing the Knowledge Graph built through queries in Neo4j Workspace. Here, I will give you something that was not presented in the course: the use of embeddings using Google Cloud Vertex AI instead of OpenAI and some cool graph visualizations in Neo4j Workspace.

As you can see in my other article, “Augmenting Gemini-1.0-Pro with Knowledge Graphs via LangChain”, a Knowledge Graph is a structured representation of knowledge, typically in the form of entities (such as people, places, documents or concepts) and the relationships between them. It’s a powerful way to model complex domains and understand the connections between different pieces of information. Knowledge Graphs have numerous applications, including semantic search, recommendation systems, network analysis, and more. They are also an excellent option for RAG (Retrieval Augmented Generation), as they impose a structure of concepts, rather than a simple overlap of randomly sized chunks. In my previous article, I noticed that hallucinations decrease a lot with Knowledge Graphs (KGs), when compared to RAGs.

Knowledge Graphs can be used in different industries:

Healthcare: In healthcare, Knowledge Graphs can be utilized to integrate and analyze vast amounts of medical data including patient records, research findings, drug interactions, and treatment protocols, in order to identify patterns, correlations, and propose new treatments. KGs can also aid in drug discovery by mapping relationships between genes, proteins, diseases, and drugs, accelerating the development of new therapies.
Cybersecurity: Knowledge Graphs can play a crucial role in threat detection, incident response, and vulnerability management. By integrating data from various sources such as network logs, security alerts, threat intelligence feeds, and historical attack data, organizations can build a comprehensive understanding of their IT infrastructure, potential security threats and their location. Cyber attacks patterns can be identified and security alerts can be prioritized in SIEM based on their relevance and severity.
Content Recommendation and Optimization: Knowledge Graphs help marketers understand the relationships between different pieces of content, such as blog posts, videos, product descriptions, and user reviews. By analyzing these connections, marketers can optimize content discovery and recommendation systems, ensuring that users are presented with relevant and engaging content across various channels.

Neo4j is a popular graph database management system that is specifically designed for storing, querying, and analyzing graph-structured data. It excels at managing interconnected data and is well-suited for building Knowledge Graphs due to its native graph storage and processing capabilities.

Here’s how Neo4j can help build Knowledge Graphs:

Graph Storage: Neo4j stores data in nodes, which represent entities, and relationships, which represent connections/links between entities. This structure is ideal for representing complex relationships found in Knowledge Graphs.

Relationships as Core Data: In Neo4j, relationships carry properties and can be traversed efficiently. This allows for rich modeling of relationships between entities, capturing various nuances of the data.

Graph Query Language (Cypher): Neo4j uses Cypher as its query language. Cypher is specifically designed for expressing patterns in graph data and performing graph operations. It’s a declarative language that allows users to specify what they want to retrieve from the graph. It’s almost like a SQL-like language, but it has a learning curve, as it is specialized knowledge.

Traversal and Pattern Matching: With Cypher, you can traverse the graph to find patterns, relationships, and paths between entities. This is crucial for querying Knowledge Graphs to extract meaningful insights and answer complex questions.

Scalability: Neo4j is highly scalable and can handle large-scale Knowledge Graphs with millions or even billions of nodes and relationships. It offers features such as clustering and sharding to distribute data across multiple servers for performance and fault tolerance. Here, as we will use a free account, our processing power is limited.

Integration: Neo4j provides various integration options, allowing you to ingest data from different sources such as databases, .dump files, CSV, TSV or APIs. This makes it easier to populate your Knowledge Graph with relevant data from diverse sources.

Visualization and Analysis: Neo4j provides tools for visualizing and analyzing graph data, what is awesome, allowing users to explore the structure of the Knowledge Graph, identify patterns, and gain insights through interactive visualizations.

This article is extense, there is A LOT of code and the troubleshooting is not trivial, given that the Google Search sometimes brings you this:

It looks like we are at the right place =) Besides, it is very useful. Here I won’t give you the pip installs, so you have to find out how to set up the Python environment. Check my article mentioned above for more info.

Let’s start. First, go to https://neo4j.com/, sign up for the free account and create a new instance:

Then, Open it and connect:

You will get a clean instance with the default database neo4j:

A hint: if you mess with the database, you can reset it to blank at any time. It will take some minutes to restart, but then you are good to go.

From now on, I split the article in three parts:

Create the database, import a JSON file (SEC Form 10K), create nodes and relationships of Form 10K, a Vector Index, querying with LangChain and Vertex AI.
Import a CSV file, SEC Form 13, adding companies that filed this form, and their investment in NETAPP INC.
Final Part: Create the whole KG, querying with LangChain and Vertex AI and get tangible results.

Part 1

Start a Jupyter notebook, in the environment you created for LangChain, Neo4j and Vertex AI. Authenticate on Google Cloud Platform:

gcloud auth login

Now, we define some variables and import libraries. It is supposed that you have a key.json service account file from Google Cloud (go to IAM/Service Accounts) in your Secret Manager, with the name GOOGLE_APPLICATION_CREDENTIALS.

import os
from langchain_community.graphs import Neo4jGraph

NEO4J_URI = "neo4j+s://YOUR-DATABASE-NUMBER.databases.neo4j.io"
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "YOUR-NEO4J-PASSWORD"
NEO4J_DATABASE = "neo4j"

from google.cloud import secretmanager
from google.cloud import aiplatform
import vertexai
import warnings
import json
import textwrap

# Langchain
from langchain_community.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQAWithSourcesChain

with warnings.catch_warnings():
    warnings.simplefilter('ignore')

def access_secret_version(secret_version_id):
  client = secretmanager.SecretManagerServiceClient()
  response = client.access_secret_version(name=secret_version_id)
  return response.payload.data.decode('UTF-8')

secret_version_id = f"projects/YOUR-PROJECT-NUMBER/secrets/GOOGLE_APPLICATION_CREDENTIALS/versions/latest"

key=access_secret_version(secret_version_id)
os.getenv(key)

vertexai.init(project='YOUR-PROJECT', location='us-central1')

Run the following command in Neo4j terminal:

SHOW DATABASES yield name

Get the 0000950170–23–027948.json (Form) file from the course. Data can also be obtained through EDGAR system at SEC and prepared:

Now we will define Neo4j variables:

## CONSTRUCT KG

VECTOR_INDEX_NAME = 'form_10k_chunks'
VECTOR_NODE_LABEL = 'Chunk'
VECTOR_SOURCE_PROPERTY = 'text'
VECTOR_EMBEDDING_PROPERTY = 'textEmbedding'

json_file = json.load(open("0000950170-23-027948.json"))
json_file['item1'][0:200]

>Item 1. \nBusiness\n\n\nOverview\n\n\nNetApp, Inc. (NetApp, we, us or the Company) is a global cloud-led, data-centric software company. We were incorporated in 1992 and are headquartered in San Jose, California. Building on more than three decades of innovation, we give customers the freedom to manage applications and data across hybrid multicloud environments.

Now we create the RecursiveCharacterTextSplitter in order to create the chunks with LangChain: here, the chunk size of 1000 with chunk overlap 500 yielded far better results than 2000 / 200.

splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap  = 500,
    length_function = len,
    is_separator_regex = False,
)

Below is the function to split the JSON data. We use [:30] to limit computing demand, as we are working with a free instance:

def split_json_data(file):
    chunks = [] # use this to accumlate chunk records
    
    for item in ['item1','item1a','item7','item7a']: # pull these keys from the json
        print(f'Processing {item} from {file}') 
        item_text = json_file[item] # grab the text of the item
        item_text_chunks = splitter.split_text(item_text) # split the text into chunks
        chunk_seq_id = 0
        for chunk in item_text_chunks[:30]: # only take the first 30 chunks
             # extract form id from file name
            file="0000950170-23-027948.json"
            form_id = str(file[0:-6])+"1"
            # finally, construct a record with metadata and the chunk text
            chunks.append({
                'text': chunk, 
                # metadata from looping...
                'f10kItem': item,
                'chunkSeqId': chunk_seq_id,
                # constructed metadata...
                'formId': f'{form_id}', # pulled from the filename
                'chunkId': f'{form_id}-{item}-chunk{chunk_seq_id:04d}',
                # metadata from file...
                'names': json_file['names'],
                'cik': json_file['cik'],
                'cusip6': json_file['cusip6'],
                'source': json_file['source'],
            })
            chunk_seq_id += 1
        print(f'\tSplit into {chunk_seq_id} chunks')
    return chunks

Then we create the chunks:

## CREATE CHUNKS

json_chunks=split_json_data(json_file)
json_chunks[0]

Now we create the nodes, each one representing a specific chunk, with the Neo4j SQL-like syntax (Cypher). This way, each chunk will have 8 properties, names, formId, cik, cusip6, source, f10kItem, chunkSeqId and text, that will match the JSON file.

## CREATE NODE WITH PROPERTIES

merge_chunk_node = """
MERGE(mergedChunk:Chunk {chunkId: $chunkParam.chunkId})
    ON CREATE SET 
        mergedChunk.names = $chunkParam.names,
        mergedChunk.formId = $chunkParam.formId, 
        mergedChunk.cik = $chunkParam.cik, 
        mergedChunk.cusip6 = $chunkParam.cusip6, 
        mergedChunk.source = $chunkParam.source, 
        mergedChunk.f10kItem = $chunkParam.f10kItem, 
        mergedChunk.chunkSeqId = $chunkParam.chunkSeqId, 
        mergedChunk.text = $chunkParam.text
RETURN mergedChunk
"""

Let’s try to understand this Cypher:

MERGE(mergedChunk:Chunk {chunkId: $chunkParam.chunkId}): This line creates or merges a node labeled as Chunk with a property chunkId equal to the value of $chunkParam.chunkId. If a node with this chunkId already exists, it will be merged (if not already) with the label Chunk. If it doesn’t exist, a new node with this chunkId and label Chunk will be created.

ON CREATE SET: This part of the query is executed only if a new node is created as a result of the MERGE operation. It sets properties on the newly created or merged node.

mergedChunk.names = $chunkParam.names … and so on

These lines set various properties (names, formId, cik, cusip6, source, f10kItem, chunkSeqId, text) on the mergedChunk node based on values provided in $chunkParam variables.

RETURN mergedChunk: This line returns the mergedChunk node after the MERGE operation. If the node already existed, it will return the existing node; if it was created or merged, it will return the newly created or merged node.

For a Cypher Cheat Sheet, go to this link. Neo4j also has a Cypher Workbench, that can be accessed here.

Once we created the nodes with 8 properties, we will populate the nodes with the chunks from the JSON file, where keys match the properties:

### POPULATE NODES WITH CHUNKS


node_count = 0
for chunk in json_chunks:
    print(f"Creating `:Chunk` node for chunk ID {chunk['chunkId']}")
    kg.query(merge_chunk_node, 
            params={
                'chunkParam': chunk
            })
    node_count += 1
print(f"Created {node_count} nodes")

Then, we create a Vector Index for the chunks. I will use dimension 768 in the vector and cosine similarity to get the top_k results:

## CREATE VECTOR INDEX

kg.query("""
         CREATE VECTOR INDEX `form_10k_chunks` IF NOT EXISTS
          FOR (c:Chunk) ON (c.textEmbedding) 
          OPTIONS { indexConfig: {
            `vector.dimensions`: 768,
            `vector.similarity_function`: 'cosine'    
         }}
""")

Let’s take a look at the indices. Here, you can also use the terminal in Neo4j Workspace

## SHOW INDEXES

kg.query("""
  SHOW VECTOR INDEXES
  """
)

[{‘id’: 2,
‘name’: ‘form_10k_chunks’,
‘state’: ‘ONLINE’,
‘populationPercent’: 100.0,
‘type’: ‘VECTOR’,
‘entityType’: ‘NODE’,
‘labelsOrTypes’: [‘Chunk’],
‘properties’: [‘textEmbedding’],
‘indexProvider’: ‘vector-2.0’,
‘owningConstraint’: None,
‘lastRead’: None,
‘readCount’: None}

Now we should get the Google Cloud Access Token. Remember, Google Cloud access tokens have a set expiration time, which by default is one hour but can be longer depending on the configuration. So, if you try to run the same notebook tomorrow, it won’t work and you may have to re-generate the access token. Run:

!gcloud auth print-access-token

Copy the output and add to “YOUR-GOOGLE-CLOUD-TOKEN” and also your GCP Project, because now we will populate the index with Vertex AI Embeddings.

## POPULATE INDEX

kg.query("""
    MATCH (chunk:Chunk) WHERE chunk.textEmbedding IS NULL
    WITH chunk, genai.vector.encode(
      chunk.text, 
      "VertexAI",{token: "YOUR-GOOGLE-CLOUD-TOKEN", projectId: 'YOUR-PROJECT'})
      AS vector
    CALL db.create.setNodeVectorProperty(chunk, "textEmbedding", vector)
    """)

Let’s take a look at the KG schema: we don’t have relationships yet, only nodes.

## GET SCHEMA

kg.refresh_schema()
print(kg.schema)

Let’s test the Vector Index:

def neo4j_vector_search(question):
  vector_search_query = """
    WITH genai.vector.encode(
      $question, 
      "VertexAI",{token: "YOUR-GOOGLE-CLOUD-TOKEN", projectId: 'YOUR-PROJECT'})
      AS question_embedding
    CALL db.index.vector.queryNodes($index_name, $top_k, question_embedding) yield node, score
    RETURN score, node.text AS text
  """
  similar = kg.query(vector_search_query, 
                     params={
                      'question': question, 
                      'index_name':VECTOR_INDEX_NAME, 
                      'top_k': 10})
  return similar

search_results = neo4j_vector_search(
    'In a single sentence, tell me about Netapp.'
)
search_results[0]

Let’s take a look at the nodes/chunks in Neo4j Workspace: they are disconnected from each other.

Let’s check and add properties:

cypher = """
  MATCH (anyChunk:Chunk) 
  WITH anyChunk 
  RETURN anyChunk { .names, .source, .formId, .cik, .cusip6 } as formInfo
"""
form_info_list = kg.query(cypher)

node_form=form_info_list[0]['formInfo']
node_form

cypher = """
    MERGE (f:Form {formId: $formInfoParam.formId })
      ON CREATE 
        SET f.names = $formInfoParam.names
        SET f.source = $formInfoParam.source
        SET f.cik = $formInfoParam.cik
        SET f.cusip6 = $formInfoParam.cusip6
"""

kg.query(cypher, params={'formInfoParam': node_form})

Now we will order the chunks of the Form by section, in sequence: chunk0000, chunk0001, chunck0002 …

cypher = """
  MATCH (from_same_section:Chunk)
  WHERE from_same_section.formId = $formIdParam
    AND from_same_section.f10kItem = $f10kItemParam 
  RETURN from_same_section { .formId, .f10kItem, .chunkId, .chunkSeqId } 
    ORDER BY from_same_section.chunkSeqId ASC
    LIMIT 10
"""

kg.query(cypher, params={'formIdParam': node_form['formId'], 
                         'f10kItemParam': 'item1'})

Now we create a NEXT link (relationship) to the nodes that will show up in the KG schema in the format (:Chunk) — [:NEXT]->(:Chunk). Note that the words Chunk and NEXT can be any word you choose to be.

cypher = """
  MATCH (from_same_section:Chunk)
  WHERE from_same_section.formId = $formIdParam
    AND from_same_section.f10kItem = $f10kItemParam
  WITH from_same_section
    ORDER BY from_same_section.chunkSeqId ASC
  WITH collect(from_same_section) as section_chunk_list
    CALL apoc.nodes.link(
        section_chunk_list, 
        "NEXT", 
        {avoidDuplicates: true}
    )  
  RETURN size(section_chunk_list)
"""

kg.query(cypher, params={'formIdParam': node_form['formId'], 
                         'f10kItemParam': 'item1'})

kg.refresh_schema()
print(kg.schema)

We now have the first relationship:

Let’s create all the ordered relationships between ALL the Form 10 sections.

## CREATE ALL RELATIONSHIPS BETWEEN FORM 10 SECTIONS

cypher = """
  MATCH (from_same_section:Chunk)
  WHERE from_same_section.formId = $formIdParam
    AND from_same_section.f10kItem = $f10kItemParam
  WITH from_same_section
    ORDER BY from_same_section.chunkSeqId ASC
  WITH collect(from_same_section) as section_chunk_list
    CALL apoc.nodes.link(
        section_chunk_list, 
        "NEXT", 
        {avoidDuplicates: true}
    )
  RETURN size(section_chunk_list)
"""
for form10kItemName in ['item1', 'item1a', 'item7', 'item7a']:
  kg.query(cypher, params={'formIdParam':node_form['formId'], 
                           'f10kItemParam': form10kItemName})

Check the sequence created: it runs counterclockwise in the picture.

Now that we created the Chunks’ relationships inside the Form 10, we connect these chunks to their parent node, the Form 10K, as PART_OF. Note the connections being formed: Chunk PART_OF Form.

## Connect chunks to their parent form with a PART_OF relationship

cypher = """
  MATCH (c:Chunk), (f:Form)
    WHERE c.formId = f.formId
  MERGE (c)-[newRelationship:PART_OF]->(f)
  RETURN count(newRelationship)
"""

kg.query(cypher)

Next we will do some tasks:

Establish a relation SECTION between Form:Chunks
Order the first chunks with a NEXT statement
Order the chunks also with a NEXT statement, returning the length of the path and the list of chunks
Scale the lengths of the paths between the interval [0,1]

## - Modify `NEXT` relationship to have variable length, FROM ZERO TO 1

cypher = """
  MATCH (f:Form)-[r:SECTION]->(first:Chunk)
    WHERE f.formId = $formIdParam
        AND r.f10kItem = $f10kItemParam
  RETURN first.chunkId as chunkId, first.text as text
"""

first_chunk_info = kg.query(cypher, params={
    'formIdParam': node_form['formId'], 
    'f10kItemParam': 'item1'
})[0]

cypher = """
  MATCH (first:Chunk)-[:NEXT]->(nextChunk:Chunk)
    WHERE first.chunkId = $chunkIdParam
  RETURN nextChunk.chunkId as chunkId, nextChunk.text as text
"""

kg.query(cypher,
         params={'chunkIdParam': first_chunk_info['chunkId']})

next_chunk_info = kg.query(cypher, params={
    'chunkIdParam': first_chunk_info['chunkId']
})[0]

cypher = """
    MATCH window = (c1:Chunk)-[:NEXT]->(c2:Chunk)-[:NEXT]->(c3:Chunk) 
        WHERE c1.chunkId = $chunkIdParam
    RETURN length(window) as windowPathLength
    """

kg.query(cypher,
         params={'chunkIdParam': next_chunk_info['chunkId']})

## CREATE NEXT RELATIONSHIPS

cypher = """
    MATCH window=(c1:Chunk)-[:NEXT]->(c2:Chunk)-[:NEXT]->(c3:Chunk) 
        WHERE c2.chunkId = $chunkIdParam
    RETURN nodes(window) as chunkList
    """
# pull the chunk ID from the first 
kg.query(cypher,
         params={'chunkIdParam': first_chunk_info['chunkId']})

## ADD VARIABLE LENGTH TO NEXT RELATIONSHIPS

cypher = """
  MATCH window=
      (:Chunk)-[:NEXT*0..1]->(c:Chunk)-[:NEXT*0..1]->(:Chunk) 
    WHERE c.chunkId = $chunkIdParam
  RETURN length(window)
  """

kg.query(cypher,
         params={'chunkIdParam': first_chunk_info['chunkId']})


cypher = """
  MATCH window=
      (:Chunk)-[:NEXT*0..1]->(c:Chunk)-[:NEXT*0..1]->(:Chunk)
    WHERE c.chunkId = $chunkIdParam
  WITH window as longestChunkWindow 
      ORDER BY length(window) DESC LIMIT 1
  RETURN length(longestChunkWindow)
  """

kg.query(cypher,
         params={'chunkIdParam': first_chunk_info['chunkId']})

With the NEXT and PART_OF statements, we get this Graph in Neo4j Workspace:

MATCH p=(Company)-[:NEXT|PART_OF*]->() 
RETURN DISTINCT p 
LIMIT 25;

If we select one of the nodes, we will see its properties (key-value in JSON data):

The chunks of the Form are now ordered and all chunks are PART_OF the Form.

Let’s test with LangChain what we’ve done so far: we create a Cypher and define the Vector Store and the Retriever:

from langchain_community.llms import VertexAI
from langchain_google_vertexai import VertexAIEmbeddings

retrieval_query_window = """
MATCH window=
    (:Chunk)-[:NEXT*0..1]->(node)-[:NEXT*0..1]->(:Chunk)
WITH node, score, window as longestWindow 
  ORDER BY length(window) DESC LIMIT 1
WITH nodes(longestWindow) as chunkList, node, score
  UNWIND chunkList as chunkRows
WITH collect(chunkRows.text) as textList, node, score
RETURN apoc.text.join(textList, " \n ") as text,
    score,
    node {.source} AS metadata
"""

vector_store_extra_text = Neo4jVector.from_existing_index(
    embedding=VertexAIEmbeddings(),
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    database="neo4j",
    index_name=VECTOR_INDEX_NAME,
    text_node_property=VECTOR_SOURCE_PROPERTY,
    retrieval_query=retrieval_query_window, 
)

# Create a retriever from the vector store
retriever = vector_store_extra_text.as_retriever()

# Create a chatbot Question & Answer chain from the retriever
chain = RetrievalQAWithSourcesChain.from_chain_type(
    VertexAI(temperature=0), 
    chain_type="stuff", 
    retriever=retriever
)

chain('Who is Netapp ?')

{‘question’: ‘Who is Netapp ?’,
‘answer’: ‘ Netapp is a data management company.\n’,
‘sources’: ‘https://www.sec.gov/Archives/edgar/data/1002047/000095017023027948/0000950170-23-027948-index.htm'}

Part 2

Now we will work on the form13.csv from the course, that looks like this:

Let’s turn it into a dictionary:

## ADD COLLECTION FORMS 13s

import csv

all_form13s = []

with open('form13.csv', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    for row in csv_reader: # each row will be a dictionary
      all_form13s.append(row)

all_form13s[0:3]

[{‘source’: ‘https://sec.gov/Archives/edgar/data/1000275/0001140361-23-039575.txt',
‘managerCik’: ‘1000275’,
‘managerAddress’: ‘ROYAL BANK PLAZA, 200 BAY STREET, TORONTO, A6, M5J2J5’,
‘managerName’: ‘Royal Bank of Canada’,
‘reportCalendarOrQuarter’: ‘2023–06–30’,
‘cusip6’: ‘64110D’,
‘cusip’: ‘64110D104’,
‘companyName’: ‘NETAPP INC’,
‘value’: ‘64395000000.0’,
‘shares’: ‘842850’},{…..

Now we will create the company nodes in the graph, and update the company value to match the Form 10-K:

first_form13 = all_form13s[0]

cypher = """
MERGE (com:Company {cusip6: $cusip6})
  ON CREATE
    SET com.companyName = $companyName,
        com.cusip = $cusip
"""

kg.query(cypher, params={
    'cusip6':first_form13['cusip6'], 
    'companyName':first_form13['companyName'], 
    'cusip':first_form13['cusip'] 
})

cypher = """
  MATCH (com:Company), (form:Form)
    WHERE com.cusip6 = form.cusip6
  RETURN com.companyName, form.names
"""

kg.query(cypher)

cypher = """
  MATCH (com:Company), (form:Form)
    WHERE com.cusip6 = form.cusip6
  SET com.names = form.names
"""

kg.query(cypher)

Now we create a FILED relationship between the company and the Form-13 node, meaning that Company X filed the Form 13, by using the cusip6 identification field.

kg.query("""
  MATCH (com:Company), (form:Form)
    WHERE com.cusip6 = form.cusip6
  MERGE (com)-[:FILED]->(form)
""")

Then we create the Manager nodes, for all the companies that have filed a Form 13 to report their investment in NETAPP, starting with a single node:

kg.query("""
CREATE CONSTRAINT unique_manager 
  IF NOT EXISTS
  FOR (n:Manager) 
  REQUIRE n.managerCik IS UNIQUE
""")

Create also a fulltext index of managers to allow for text search beyond literal match:

kg.query("""
CREATE FULLTEXT INDEX fullTextManagerNames
  IF NOT EXISTS
  FOR (mgr:Manager) 
  ON EACH [mgr.managerName]
""")

Query the fulltext index to check if ‘royal bank’ returns ‘Royal Bank of Canada’:

kg.query("""
  CALL db.index.fulltext.queryNodes("fullTextManagerNames", 
      "royal bank") YIELD node, score
  RETURN node.managerName, score
""")

Now, create the nodes for all companies that filed a Form 13:

cypher = """
  MERGE (mgr:Manager {managerCik: $managerParam.managerCik})
    ON CREATE
        SET mgr.managerName = $managerParam.managerName,
            mgr.managerAddress = $managerParam.managerAddress
"""
# loop through all Form 13s
for form13 in all_form13s:
  kg.query(cypher, params={'managerParam': form13 })

Note the light green nodes around Form 10 chunks in Neo4j Workspace. The Cypher is:

MATCH (n) OPTIONAL MATCH (n)-[r]-(m) RETURN COLLECT(n) AS nodes, COLLECT(r) AS relationships

At this time, Chunks of Form 10K are connected in sequence to the NETAPP node, but the companies that invest in NETAPP shares are isolated.

To see the investment Royal Bank of Canada has in NETAPP INC, run this cypher:

first_form13 = all_form13s[0]

cypher = """
  MATCH (mgr:Manager {managerCik: $investmentParam.managerCik}), 
        (com:Company {cusip6: $investmentParam.cusip6})
  RETURN mgr.managerName, com.companyName, $investmentParam as investment
"""

kg.query(cypher, params={ 
    'investmentParam': first_form13 
})

Great. Now let’s do the following:

Match companies with managers based on data in the Form 13
Create an OWNS_STOCK_IN relationship between the manager and the company. We start with a single manager who filed the first Form 13 in the list:

cypher = """
MATCH (mgr:Manager {managerCik: $ownsParam.managerCik}), 
        (com:Company {cusip6: $ownsParam.cusip6})
MERGE (mgr)-[owns:OWNS_STOCK_IN { 
    reportCalendarOrQuarter: $ownsParam.reportCalendarOrQuarter
}]->(com)
ON CREATE
    SET owns.value  = toFloat($ownsParam.value), 
        owns.shares = toInteger($ownsParam.shares)
RETURN mgr.managerName, owns.reportCalendarOrQuarter, com.companyName
"""

kg.query(cypher, params={ 'ownsParam': first_form13 })

kg.query("""
MATCH (mgr:Manager {managerCik: $ownsParam.managerCik})
-[owns:OWNS_STOCK_IN]->
        (com:Company {cusip6: $ownsParam.cusip6})
RETURN owns { .shares, .value }
""", params={ 'ownsParam': first_form13 })

We also define the properties of OWNS_STOCK_IN setting the parameters to value and shares, and loop through ALL Managers.

cypher = """
MATCH (mgr:Manager {managerCik: $ownsParam.managerCik}), 
        (com:Company {cusip6: $ownsParam.cusip6})
MERGE (mgr)-[owns:OWNS_STOCK_IN { 
    reportCalendarOrQuarter: $ownsParam.reportCalendarOrQuarter 
    }]->(com)
  ON CREATE
    SET owns.value  = toFloat($ownsParam.value), 
        owns.shares = toInteger($ownsParam.shares)
"""

#loop through all Form 13s
for form13 in all_form13s:
  kg.query(cypher, params={'ownsParam': form13 })

At this point, let’s check how our KG schema is:

kg.refresh_schema()
print(textwrap.fill(kg.schema, 60))

The relationships created are the following:

(:Chunk)-[:NEXT]->(:Chunk)
(:Chunk)-[:PART_OF]->(:Form)
(:Form)-[:SECTION]->(:Chunk)
(:Company)-[:FILED]->(:Form)
(:Manager)-[:OWNS_STOCK_IN]->(:Company)

How is the appearance of our Knowledge Graph? Run the following Cypher: there is a connection between the two clusters of data.

MATCH (n)
OPTIONAL MATCH (n)-[r]-(m)
RETURN COLLECT(DISTINCT n) AS nodes, COLLECT(DISTINCT r) AS relationships

Limit the companies to 25 and take a closer look at this part of the KG:

MATCH p=()-[:OWNS_STOCK_IN]->() RETURN p LIMIT 25;

Ok, now let’s define the chunkId parameter and after that, build up a path (relationship) from Form 10-K chunk to companies and managers:

cypher = """
    MATCH (chunk:Chunk)
    RETURN chunk.chunkId as chunkId
    """

chunk_rows = kg.query(cypher)
chunk_first_row = chunk_rows[0]
ref_chunk_id = chunk_first_row['chunkId']

cypher = """
    MATCH (:Chunk {chunkId: $chunkIdParam})-[:PART_OF]->(f:Form)
    RETURN f.source
    """

for i in range(0,len(chunk_rows)):
    chunk_first_row = chunk_rows[i]

    kg.query(cypher, params={'chunkIdParam': chunk_first_row['chunkId']})

We do the same for Company FILED Form13:

cypher = """
MATCH (:Chunk {chunkId: $chunkIdParam})-[:PART_OF]->(f:Form),
    (com:Company)-[:FILED]->(f)
RETURN com.companyName as name
"""

for i in range(0,len(chunk_rows)):
    chunk_first_row = chunk_rows[i]
    kg.query(cypher, params={'chunkIdParam': chunk_first_row['chunkId']})

In order to get the number of investors, we run this Cypher:

cypher = """
MATCH (:Chunk {chunkId: $chunkIdParam})-[:PART_OF]->(f:Form),
        (com:Company)-[:FILED]->(f),
        (mgr:Manager)-[:OWNS_STOCK_IN]->(com)
RETURN com.companyName, 
        count(mgr.managerName) as numberOfinvestors 
"""

for i in range(0,len(chunk_rows)):
    chunk_first_row = chunk_rows[i]
    kg.query(cypher, params={
        'chunkIdParam': chunk_first_row['chunkId']
        })

Let’s get a list of Companies that invest (that own shares) in NETAPP INC.:

cypher = """
    MATCH (:Chunk {chunkId: $chunkIdParam})-[:PART_OF]->(f:Form),
        (com:Company)-[:FILED]->(f),
        (mgr:Manager)-[owns:OWNS_STOCK_IN]->(com)
    RETURN mgr.managerName + " owns " + owns.shares + 
        " shares of " + com.companyName + 
        " at a value of $" + 
        apoc.number.format(toInteger(owns.value)) AS text
    LIMIT 10
    """
kg.query(cypher, params={
    'chunkIdParam': ref_chunk_id
})

Final Part

Now that everything is ready, let’s query the whole Knowledge Graph: we will provide instructions for VertexAI to respond, and also examples of Cyphers, so that LangChain generate a Cypher and give us the result based on the generated Cypher:

from langchain.prompts.prompt import PromptTemplate
from langchain.chains import GraphCypherQAChain


prompt_template = """Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:
{schema}

Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Examples: Here are a few examples of generated Cypher statements for particular questions:

# What investment firms are in San Francisco?
MATCH (mgr:Manager)-[:LOCATED_AT]->(mgrAddress:Address)
    WHERE mgrAddress.city = 'San Francisco'
RETURN mgr.managerName

# What investment firms are near Santa Clara?
  MATCH (address:Address)
    WHERE address.city = "Santa Clara"
  MATCH (mgr:Manager)-[:LOCATED_AT]->(managerAddress:Address)
    WHERE point.distance(address.location, 
        managerAddress.location) < 10000
  RETURN mgr.managerName, mgr.managerAddress


# What does Palo Alto Networks do?
  CALL db.index.fulltext.queryNodes(
         "fullTextCompanyNames", 
         "Palo Alto Networks"
         ) YIELD node, score
  WITH node as com
  MATCH (com)-[:FILED]->(f:Form),
    (f)-[s:SECTION]->(c:Chunk)
  WHERE s.f10kItem = "item1"
RETURN c.text

# Give me a list of 10 companies and the value invested by them in NETAPP INC.
MATCH 
    (com:Company)-[:FILED]->(f),
    (mgr:Manager)-[owns:OWNS_STOCK_IN]->(com)
RETURN mgr.managerName + " owns " + owns.shares + 
    " shares of " + com.companyName + 
    " at a value of $" + 
    apoc.number.format(toInteger(owns.value)) AS text
LIMIT 10

The question is:
{question}"""

cypher_prompt = PromptTemplate(
    input_variables=["schema", "question"], 
    template=prompt_template
)

cypherChain = GraphCypherQAChain.from_llm(
    VertexAI(temperature=0),
    graph=kg,
    verbose=True,
    cypher_prompt=cypher_prompt,
)

Let’s run the question and check the answer and the cypher generated:

cypherChain.run("How many shares of NETAPP INC does Royal Bank of Canada own?")

Let’s compare the text answer with our previous generated list of investors:

Great. Now, let’s check if the generated cypher works when run in Neo4j Workspace:

842850. Again, correct.

Now, let’s ask about the structure of the KG itself: note that I didn’t say NETAPP INC.

The answer and the Cypher are right, again:

Another one, also right:

We can also query the chunks of Form 10K:

However, with this public dataset we cannot traverse the graph, as there are no contextual connections like: how much money BlueXP Sync generated in additional investments for NETAPP? Internal documents would make this possible.

In wrapping up, while there’s definitely room for tinkering and tidying up the code, completing a Minimum Viable Graph was my goal. We’ve nailed down the essentials — building databases, importing and playing with JSON and CSV files, setting up nodes and connections, and making some seriously cool visualizations. Now, the goal is to keep fine-tuning our Knowledge Graph and tapping into its power for smarter insights and decisions.

✨ Google ML Developer Programs team supported this work by providing Google Cloud Credits✨

Building Knowledge Graphs from Scratch Using Neo4j and Vertex AI

Part 1

Part 2

Final Part

Written by Rubens Zimbres