How To Build a Graph Database off Business Communications

Manuel Maguga Darbinian
6 min readFeb 9, 2024

--

Some days ago a colleague of mine sent me this youtube video from Johannes Jolkkonen.

The video is about Advanced RAG with Knowledge Graphs, which is — to this day, a very hot topic in the scene. You could easily tell by the amount of resources and papers out there addressing the topic (If you want to read good quality content about RAG over graph databases, I would strongly advise you to also have a look at Tomaz Bratanic articles. I find all of his work fascinating).

Anyway, what caught my attention from the video was rather the “what” Johannes was feeding into the sample Neo4J graph database: slack messages, projects briefs, and profiles of people. These would then be seeded into the graph and be represented in the form of nodes, properties and relationships. This aligned with a latest work I’m undertaking to build a Knowledge graph off business communications.

To give you an example: If you work — or have worked, in any medium to large corporation that uses Slack as internal communication tool, you might be familiar with the anxiety of having to spend so much time navigating through the channels to keep yourself updated, filtering through the trash to find and digest relevant information.
Thankfully, we live in the age of the AI renaissance, where we have the power to leverage the benefits that it brings, in order to power-charge companies and individuals to optimise time and resources.

How do you get started?

One thing I learn the past year is that you don’t need to be an engineer in order to build code and test use cases (I’m not an engineer myself). You only need ChatGPT Plus and any opensource library to make use of the code. In my case I like to use Langchain.

PS. I would advise you to follow some intro coding courses to get familiar with the basics. Otherwise it will be like trying to drive a Ferrari for the first time without having driven a car in your life.

High-level flow of an hypothetical AI Product X

Step 1: Fetch the content from a specific Slack channel

The first thing you want to do is for the AI product X (which is the product we are building) to fetch the content from our specified slack channel. To achieve this we make use of the Slack API, and run the below code together with some functions to fetch messages and their threads. We should also specify the start_of_period and end_of_period to define the timeline.

# Constants for Slack API
CHANNEL_ID = 'YOUR_CHANNEL_ID' # Replace with your channel ID
API_URL = 'https://slack.com/api/conversations.history'
API_URL_FOR_REPLIES = 'https://slack.com/api/conversations.replies'
CHANNEL_INFO_URL = 'https://slack.com/api/conversations.info'
SLACK_TOKEN = os.getenv('SLACK_API_TOKEN') # Ensure your SLACK_API_TOKEN is set in your environment variables
HEADERS = {'Authorization': f'Bearer {SLACK_TOKEN}'}

(...)
start_of_period = date - timedelta(days=2)
end_of_period = date + timedelta(days=0)
(...)

# Main Execution
if __name__ == "__main__":
today = datetime.now()
daily_messages = fetch_messages(CHANNEL_ID, today)
print_messages_with_threads(messages, CHANNEL_ID)

The output will be a raw markdown slack history together with the Slack handler of the person writing the message.

Step 2: Use AI to generate the “Human Log”

Making use of the OpenAI’s Assistant API, we pass the daily_messages above as an argument to the system message:

system_message = f"""
Task: As a diligent AI assistant, you are tasked with reviewing a sequence of Slack messages and its threads, so to distill them into topics, discussion points, and decisions made. This is what we call a 'Human log'.

Instructions:
- Segregate the content into distinct topics, assigning a unique identifier for traceability purposes.
- Within each topic, capture key discussion points or concerns raised by the participants in relation to its topic.
- For each discussion point, note any decision or conclusion, along with the accountable person's Slack handle.
- Format your response in JSON format adhering to the below schema:

Schema for Response:
{{
"human_log": {{
"slack_channel_id": "{channel_id}",
"slack_channel": "{channel_name}",
"topic": [
{{
"topicId": "Unique identifier for the topic",
"description": "Main subject of the discussion",
"discussionPoint": [
{{
"discussionPointId": "Unique identifier for the discussion point",
"description": "Main discussion point related to the topic",
"decision": [
{{
"decisionId": "Unique identifier for the decision",
"description": "The conclusion or decision made regarding the discussion point.",
"ACCOUNTABLE": "Slack handle (@...) of the person responsible for the decision"
}}
]
}}
]
}}
]
}}
}}

Notes:
- Use timestamps and content hashes to generate the unique identifiers.
- Keywords or phrases that are frequently mentioned may indicate the main topic.
- Requests for feedback, opinions, or thoughts typically highlight discussion points.
- Decisive language such as "we will", "we have decided", or "let's proceed with" is indicative of decisions.
"""

# Prepare the messages for the assistant
messages = [
{"role": "system", "content":system_message},
{"role": "user", "content": f"Slack messages:\n{daily_messages}"}
]

# Directly using the OpenAI client as specified
client = openai.OpenAI(api_key=OPENAI_API_KEY)
response = client.chat.completions.create(
model="gpt-4-0125-preview",
messages=messages,
temperature=0.3
)

human_log = response.choices[0].message.content
return human_log

Running this cell will return you a detailed — JSON formatted, human log with topics, discussions, decisions and accountabilities for each decision. You have now leveraged AI to generate filtered valuable and actionable business information from all the clutter in Slack. You could argue that having a standardised process to generate and store this type of information could bring a lot of benefits to the areas of decision making, change management, tracking, and overall harmonisation and optimisation of business activities.

The question would be now, how and where would you store all of this valuable Human Log data? The answer might be already clear to you if you read until here: A Graph Database.

Step 3: Populate a Graph Database with the Human Logs

We now would make use of Langchain’s Neo4j DB QA chain in order to: 1) Connect to a sample database; 2) Start feeding it with the human log data.

from langchain.chains import GraphCypherQAChain
from langchain_community.graphs import Neo4jGraph
from langchain_openai import ChatOpenAI

graph = Neo4jGraph(
url="bolt://localhost:7687", username="neo4j", password="pleaseletmein"
)

Once connected, we would start pushing each of the JSON human log items into the Graph through functions and cypher statements (here a Cypher cheat sheet for reference)

The below is a generic example that demonstrates:

  • Insert or merge nodes into the Neo4j graph based on unique identifiers and additional attributes.
  • Optionally create relationships between nodes, demonstrating how to link newly inserted or matched nodes to other nodes within the graph.
from neo4j import GraphDatabase
import traceback

def push_data_to_graph(graph, data_dict):
"""
Inserts data nodes and their relationships into a Neo4j graph based on the provided dictionary.

Parameters:
- graph: The Neo4j graph object for database interaction.
- data_dict: A dictionary containing data to be inserted into the graph.
"""
try:
# Iterate over items in the data dictionary
for item in data_dict["items"]:
# Extract data for the node
nodeId = item["id"]
description = item["description"]

# Insert or merge the data node
cypher_query_insert = """
MERGE (n:DataNode {id: $nodeId, description: $description})
RETURN n
"""
params_insert = {"nodeId": nodeId, "description": description}
result_insert = graph.query(cypher_query_insert, params=params_insert)
print(f"DataNode {'created' if result_insert else 'matched'}: {nodeId}")

# Example of linking to another node (if applicable)
# This section can be customized based on the relationship and target node type
relatedNodeId = item.get("relatedNodeId")
if relatedNodeId:
cypher_query_link = """
MATCH (n:DataNode {id: $nodeId})
MATCH (r:RelatedNode {id: $relatedNodeId}) # Adjust node type as needed
MERGE (n)-[rel:RELATED_TO]->(r) # Adjust relationship type as needed
RETURN type(rel), n, r
"""
params_link = {"nodeId": nodeId, "relatedNodeId": relatedNodeId}
result_link = graph.query(cypher_query_link, params=params_link)
if result_link:
print(f"Relationship RELATED_TO created between DataNode {nodeId} and RelatedNode {relatedNodeId}.")

except Exception as e:
print(f"An error occurred while pushing data to the graph: {e}")
# Handling of exceptions can be more sophisticated based on needs

# Example usage:
# Assuming `graph` is your initialized Neo4jGraph object and `data_dict` is your data to be inserted
# push_data_to_graph(graph, data_dict)

Once we run the functions to seed our graph with the data, we will end-up generating our first human-logs graph database as seen in the below image.

Conclusion

In order to make this product powerful, there is the need to take into consideration standardisation and reuse of nodes, properties and relationship types to avoid duplications. Otherwise we would recreate the clutter we setup to solve for in the first place.
Once we have solved for it, you could imagine a future where, leveraging a graph database and an AI on top of it, we could create the foundations for smart companies / individuals that would not only provide us with valuable information, but also help us create and turbocharge our work.

This is my first article in Medium. Hope you enjoy it an please feel free to share thoughts and even challenge! always up for a good conversation.

--

--

Manuel Maguga Darbinian

Product & AI consultant on a mission to create value through leveraging AI and latest tech innovation.