GraphRAG with Databricks and PuppyGraph

Ajmal Aziz
7 min readAug 30, 2024

--

A Demonstration of how Compound AI Systems increases reliability and accuracy of LLM outputs.

Note: Code to be released on GitHub shortly. Many thanks to the PuppyGraph Engineering Team for the collaboration on this post.

Introduction

Large Language Models (LLMs) have surged in popularity since OpenAI released ChatGPT to the public nearly two years ago. This rapid adoption has spurred enterprises to explore how they can leverage LLMs with their private data to improve the productivity of their workforce.

The primary architecture for introducing LLMs to external — and typically private — data is Retrieval Augmented Generation (RAG). In it’s simplest form, RAG imputes information to the LLM prompt to provide context that the LLM has not been exposed to during training. The primary vehicle for this has been to associate some kind of similarity metric across a user query and partitions of knowledge embeddings using a vector database to provide just-in-time context to the LLM for response.

Whilst these similarity based approaches for knowledge augmentation have proved useful for simple scenarios where a query vector’s similarity matches the embeddings of chunks of data, they fall short when it comes to answering questions about the nature of the information itself. We demonstrate this shortcoming with data in the mining domain with a dataset generated with mining subject matter experts.

Why mining? In mining maintenance, the key performance indicator for the Reliability teams is the Mean Time To Recovery (MTTR). MTTR after a failure can depend on various factors such as: availability of staff, equipment, or root cause analysis. The driver we will focus on is the speed of diagnosis. Teams typically spend quite a bit of time looking through user manuals and previous work orders to arrive at a diagnosis for the asset’s failure. If we can support these teams working in mine sites (typically in arid conditions) to sift through this trove of information faster, we can greatly improve the MTTR.

As a point of reference, a 15 minute improvement to MTTR can lead to millions of dollars in recovered revenue for some of the largest mine operators in the world.

This application will demonstrate how speed of failure diagnosis can be improved through an LLM interface with an underlying graph structure providing the external context as opposed to a vector database. The example makes use of three tables: work orders raised against assets, troubleshooting guides for assets that outline steps to follow to diagnose a failure, and an asset lookup table.

In order to compare results against vector database RAG and GraphRAG, we devise a series of questions (in partnership with reliability engineers from one of the largest mining operators in the world) that would be critical to understand during equipment downtime. We use two key technologies::

Databricks

  • AutoLoader to demonstrate production readiness via incremental ingestion and transformation.
  • Unity Catalog for managing table metadata including definitions, governance, lineage, schemas, and for organising our data in a medallion architecture.
  • Databricks Mosaic AI for running LLM experimentation, tracing, and RAG Studio for debugging our prompt engineering process.

PuppyGraph

  • As the key Graph agent that runs Cypher queries to augment LLM knowledge.

What is GraphRAG?

The term GraphRAG is defined by Microsoft as “a technique for richly understanding text datasets by combining text extraction, network analysis, and LLM prompting and summarization into a single end-to-end system.” In this blog, GraphRAG is yet another tool leveraged in Retrieval Augmented Generation (RAG) systems with a key difference being, the augmentation or inclusion of new information is primarily driven by PuppyGraph (our graph retrieval tool) which converts queries into Cypher statements that filter, count, aggregate, and compute metrics against the underlying graph objects.

The hypothesis that we will explore is that the graph based retrieval will lead to better results as it can model complex relationships amongst entities rather than standalone “chunks” of information.

Simple representation of how retrieval takes place with Vector Database RAG and GraphRAG

In our experience, GraphRAG plugs a key assumption in traditional RAG approaches: that the answer for a user query is contained locally within regions of text. One example where this assumption falls over are queries which reason over the entire dataset like aggregate queries. Examples in our mining domain include:

  1. How many assets are active in Site A?
  2. How many troubleshooting steps are from the Documentum per failure type?

When using a Vector Database as the retriever, a key parameter is the number of approximate nearest neighbours. This parameter is chosen prior to a question being asked, is fixed, and directly proportional to the number of tokens per chunk and therefore tokens in the query prompt. Without prior knowledge of the numerical scale of these answers, this number is chosen arbitrarily and grows exponentially with the number of tokens per chunk of text. This leads to both a slowdown in LLM generation but also increased cost on pay-per-token models.

Considering the question above, in the graph Domain, these questions can be expressed as simple Cypher queries:

How many assets are active in Site A?

Explanation

  • MATCH (asset:asset {status: “active”, location: “Site A”}): This pattern matches all nodes with the label asset that have the status attribute set to “active” and are located in “Site A”.
  • RETURN COUNT(asset) AS activeAssets: This returns the count of the matched assets as activeAssets.

How many troubleshooting steps are from the Documentum per Failure Type?

Explanation:

  • MATCH (asset:asset)-[r:can_have_failure]->(failure:failure_type): This pattern matches the can_have_failure relationships between asset and failure_type nodes.
  • WHERE r.reference_source = “Documentum”: This condition filters the relationships to only include those where the reference_source attribute is “Documentum”.
  • RETURN COUNT(r) AS troubleshootingStepsFromDocumentum: This returns the count of the filtered relationships as troubleshootingStepsFromDocumentum.

Now, the key question is, where do we execute these Cypher statements and how do we automatically generate them based on a user query?

PuppyGraph and Databricks

Databricks provides the Data Intelligence Platform allowing for ingestion, transformation, and analysis of data. Databricks also provides an interface for Large Language Models (LLMs) either through SQL or leveraging the LangChain API. PuppyGraph allows the data stored in Databricks to be queried as a graph with its graph query engine and offers advanced graph analytics capabilities.

  • Integration with the Delta file format. No need to maintain separate sources of truth, with PuppyGraph, our source tables stay in Delta file format and do not need to be uploaded to an external database.
  • Unity Catalog Support mitigating the need to maintain separate access control lists.
  • Table first format for underlying data. We do not need to generate triplets and reducing an ETL step

Implementation Architecture with PuppyGraph

Maintenance Dataset

We created tables that represent some facts and dimensions for a maintenance operation on a mine site. Description of datasets:

  • Work Orders (fct_work_orders): Records detailed information about maintenance work orders carried out on various assets.
  • Assets (dim_assets): Provides detailed information about each asset in the mine site.
  • Troubleshooting Guide (dim_troubleshooting_guide): Offers step-by-step troubleshooting procedures for different types of failures.
  • Failure Type (dim_failure_type): Categorizes the types of failures encountered in the maintenance operations.

The database schema is shown below and generated using DBML.

ETL Steps

In order to emulate a production environment, we have the tables undergo a medallion style set of enrichment steps. Aside from simple casting of types, we leverage ai_query() to classify a work order’s failure type based on the description of work undertaken. This failure type is then leveraged as part of our graph schema to further connect the work orders with the troubleshooting guide to understand the root cause of errors.

Converting Delta tables to a Graph (the Ontology)

We used the following schema to create the graph within PuppyGraph. The schema contains 3 different types of nodes: failure_type, asset, andwork_order. As well as 3 types of relationships: can_have_failure, worked_on, and related_to_failure. Each of the nodes and relationships is powered by one or more Delta tables in Databricks. PuppyGraph provides a graphical schema builder for users to define the mapping.

PuppyGraph is able to leverage this schema definition to perform graph queries on top of our delta tables in Unity Catalog. The following diagram shows the entire graph generated by PuppGraph based on the schema definition.

Results

In order to compare results, we created 20 test questions with ground truth labels and MLflow evaluation to compare the results against both retrieval architectures. MLflow’s evaluate function provides LLM-as-judge capabilities. We leveraged this with GPT-4 as the judge to assign an answer_correctness score (out of 5) to the generated results from both architectures. For every row in our evaluation set, we assess the answer_correctness score (box plot below shows the final results) and plot the results below.

We further labeled our questions as Direct or Meta queries. Where we define:

  • Direct Query: Required only a single table to answer a question. These questions are specific and only enquire about a single chunk or row of information.
  • Meta Query: Multiple tables need to be referenced to answer this query.

Grouping by these types of queries reveals the reasoning for a performance degradation between the two retrieval architectures.

Next we will look at each of these questions and showcase how the Vector Database RAG would not have been able to provide the sufficient information to answer a quer query.

--

--