Importing Your Unstructured Triples into WhyHow.AI — Notebook Demonstration

Published in

WhyHow.AI

5 min readAug 24, 2024

Update: We have recently open-sourced our KG Studio here: https://medium.com/enterprise-rag/open-sourcing-the-whyhow-knowledge-graph-studio-powered-by-nosql-edce283fb341

We are excited to announce you can now import your own triples into WhyHow.AI’s Knowledge Graph Studio (Beta) for further processing and management, especially for multiple modular graph orchestration in multi-agent workflows. To use this feature, you would need access to the Platform Beta (Beta Access codes can be found below).

What this means is that if you created triples through a range of other processes — whether through LangChain’s LLMGraphTransformer Package, LlamaIndex’s Property Graph package or through your internal pipelines, you can still leverage WhyHow.AI’s modular graph infra. The main advantage for doing so are:

Modular Graph management for Multi-Graph Multi-Agent workflows
Multiplayer Graph Creation through Public Graph Links
Personalized Human-In-The-Loop Entity Resolution and Entity Extraction

To illustrate how to do this, we have a Notebook that can be found here and explained below that demonstrates the process of creating graphs from Langchain’s LLMGraphTransformer package with an Amazon 10-K financial report. If you want to skip to what the resultant graph looks like, check it out here.

As we are database agnostic, you can also export the resultant Graph in Cypher to any external graph database of your choice.

Specific problems we are aiming to solve through this process are:

I am a developer building a multi-agent workflow that needs to call upon a range of structured data. I want to be able to segregate my graphs by different domains, document sources or types (lexical vs semantic) and be able to independently call each graph for multi-graph, multi-agent workflows.
I am a developer who has built my own sophisticated data processing pipeline to create triples. I want to hand these graphs over to the domain expert (who may not be technical) to make sure that the triples and graphs are accurate.
I am a developer who wants to be able to use WhyHow.AI’s structured and unstructured querying (i.e. specify specific entity types, entities and relations programmatically) packages against the single or range of graphs I have.
I am a domain expert who is tasked with cleaning up a graph that’s been mostly generated by a developer, and want to create nodes, triples and entity resolution with a UI.
I am an LLM. I enjoy pretty graphs and structured data because people get mad at me when I only use vectors to retrieve relevant information and construct answers. I want to plug into WhyHow.AI’s Studio Platform for more reliable context injection.

Importing Langchain created Triples into WhyHow

import itertools
import os
import pickle

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI

from whyhow import WhyHow, Node, Relation, Triple

from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv())

llm = ChatOpenAI(model="gpt-4o")
llm_transformer = LLMGraphTransformer(llm=llm)

Load Text from Selected File

We load in the 2024 Amazon 10K file here.

filepath = "data/amazon_10K_2024.pdf"
loader = PyPDFLoader(filepath)
docs = loader.load()

Process Document

We then split the document into chunks for processing.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
split_docs = text_splitter.split_documents(docs)

Convert Processed Text to Triples

We then specify the nodes and relationships that we would like to see with Langchain’s schema format.

# Select the entity types and realtions you want for your triples
allowed_nodes=["Company", "Risk Factor", "Legal Proceeding", "Business Segment"]
allowed_relationships=["AFFECTS", "INVOLVED_IN", "WORKED_AT", "POSES_RISK"]

llm_transformer_props = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=allowed_nodes,
    allowed_relationships=allowed_relationships
)

graph_documents_props = llm_transformer_props.aconvert_to_graph_documents(split_docs)

[Optional] Store the Triples

Optionally, you can store the triples independently to loading into WhyHow.

# Serialize the list and write it to the file
with open('langchain_triples.pkl', 'wb') as file:
    pickle.dump(flat_triples, file)

WhyHow Integration

Here, we load our WhyHow API key and initialize the workspace.

# Initialise the client with your WhyHow API key
client = WhyHow(api_key=os.environ.get("WHYHOW_API_KEY"), base_url="https://api-test.whyhow.ai/")

workspace = client.workspaces.create(name="Amazon 10-K Analysis")
# or, if you already have a workspace
# workspace = client.workspaces.get(workspace_id="")

[Optional] Load the Triples

We can load the triples we saved in the optional step.

with open('langchain_triples.pkl', 'rb') as file:
    flat_triples = pickle.load(file)

Preprocess the Triples

Here, we process the triples in a format that the WhyHow platform can accept.

def format_triple(triple):
    """
    Format the LangChain triple into the desired structure.

    Args:
        triple: An object containing source, target, and type attributes.

    Returns:
        Triple: A Triple object with formatted head, relation, and tail.
    """
    # Extract source and target from the triple
    source = triple.source
    target = triple.target
    
    # Create and return a formatted Triple object
    return Triple(
        head=Node(name=source.id, label=source.type),  # Head node with source id and type
        relation=Relation(name=triple.type),           # Relation with triple type
        tail=Node(name=target.id, label=target.type)   # Tail node with target id and type
    )

# Generate a list of formatted triples with indices
formatted_triples = [format_triple(triple, str(index)) for index, triple in enumerate(flat_triples)]

# View the first 3 triples
formatted_triples[:3]

Create and Query the Graph

We then create the graphs by loading in the formatted triples into the specified workspace, and naming the graph.

graph = client.graphs.create_graph_from_triples(
    workspace_id=workspace.workspace_id,
    triples=formatted_triples,
    name="Amazon 10-K Graph"
)

We can also query the graph through the SDK either through our Structured or Unstructured query endpoints.

# Query graph for Amazon's business segments
question = "What are Amazon's main business segments?"

query_response = client.graphs.query_unstructured(
    graph_id=graph.graph_id,
    query=question,
)

print(f"LLM Response: {query_response.answer}")
print(f"Returned Triples: {query_response.triples}")

What does this graph look like? You can see and interact with the graph that was made from these Langchain triples in this link here. Langchain’s Graph package doesn’t allow us to natively link Vector Chunks to Triples as opposed to our Graph Creation SDK, so it only imports triples and leaves our Chunk section blank, but we are working on a package to allow you to import and tie chunks.

We also believe that the quality of your schema (i.e. how structured it is, and how much it reflects the underlying content) affects the quality of the graph output. We are working on a more automated system to help you clean up your schema if you have built an unconstrained schema graph, so stay tuned for that!

WhyHow.AI’s Knowledge Graph Studio Platform (currently in Beta) is the easiest way to build Agentic & RAG-Native Knowledge Graphs, combining workflows from developers and non-technical domain experts.

If you’re thinking about, in the process of, or have already incorporated knowledge graphs in RAG for accuracy, memory and determinism, we’d love to chat at team@whyhow.ai, or follow our newsletter at WhyHow.AI. Join our discussions about rules, determinism and knowledge graphs in RAG on our Discord.

For our Medium subscribers — One-Time Use Access Codes (If they are not working, they have been claimed already):

8d1f74f4–41b8–471c-a815–695c5e057499

7da80eb9–7613–4df8–9056–946adb6e4e0e

697c225b-cf75–47e5–8ae7–48f8ab0e4b68