A New Synergy: Knowledge Graphs and LLMs in Collaboration

Eveline Schmidt
Sopra Steria NL Data & AI
5 min readMar 1, 2024

A couple of months ago, I had the opportunity to attend the 19th edition of the International Conference on Semantic Systems, which, for the first time, joined forces with the Language Intelligence Summit. The fact that these two domains united their strengths for a singular event emphasises the increasing significance of their collaboration. The fusion of semantic systems and language intelligence is now more relevant than ever, and that was precisely the key takeaway from this event. This blog will delve deeper into the synergy between Large Language Models (LLMs) and Knowledge Graphs (KGs) utilising a use case I presented alongside my colleague Kike Franssen during this event.

Image created by DALL-E via ChatGPT.

The synergy between Knowledge Graphs and Large Language Models

First, let’s break down what Knowledge Graphs (KGs) are and how they can support Large Language Models (LLMs) in the journey of advancing information retrieval. KGs, essentially a type of graph databases, model real-world entities and their relationships through interconnected nodes and edges. They are recognised for their capacity to empower systems, enabling them to make contextualised decisions and enhance data-driven insights. This is an example of how KGs could complement the capabilities of LLMs to generate text. The KG can help the LLM to not only predict the next word in a sentence but learn how the words in the text relate to each other. LLMs gain then a more nuanced understanding of entities, their attributes, and the relationships between them. Yet, the synergy goes both ways. LLMs can assist KGs in generating data on a faster pace. This aligns with the functionality of the Natural Language Processing tool we created which this blog will elaborate further on.

Accelerating the process of concept identification

In our current project, Kike Franssen and I were asked to look for opportunities to optimise the process of identifying concepts to enrich the ontologies and taxonomies. A key responsibility within our team involved thorough examination of regulations and law articles to identify concepts to be included in the graph database. This process involved verifying whether a given concept already existed in the database; if not, it had to be converted into RDF-format for seamless integration into the graph database. RDF, short for Resource Description Framework, is a standardised framework for presenting information and express relationships through triples: subject — predicate — object statements. An example of such a triple can be found below. This framework serves as the keystone in constructing semantic models and knowledge graphs, laying the foundation for robust information representation.

ont:Activity ont:takesPlaceOn ont:Location ;

# subject is ont:Activity
# predicate is ont:takesPlaceOn
# object is ont:Location

Regulations and law articles can be lengthy documents, often surpassing 70 pages. The manual handling of this extensive content proved time-consuming and inherently inefficient. It seemed counterintuitive to have skilled, educated professionals invest their time in such laborious tasks, especially in an era where technology is advancing rapidly. Recognising this inefficiency, we wanted to develop a tool that automates the process while retaining a ‘human in the loop’ approach. In the following sections, a demonstration of the the functionality will be given.

An application on how LLMs complement KGs

The tool consists of a frontend Streamlit application and a backend where the LLM predicts new concepts from the text that is provided as input by the user. How these predictions are generated is visualised in the outline below:

Image 1: A rough outline on how the tool works. In Orange is the input and output visualised, whereas in blue the processes are pictured. Source: Semantics presentation Kike Franssen & Eveline Schmidt.

PDF Extraction

As can be seen in image 1, the tool demands two key elements as input. Firstly, an established graph database that consist of concepts is needed. The second component is the user’s input text. In this use case the input is given as PDF-files of regulations and law articles. However, the tool could also be used for different types of documents. The initial stage of the tool involves extracting the text from the regulations and law articles, which is facilitated by a PDF-extractor. Only the concepts, presented as single terms, are required from the graph database.

with open('regulation_example.pdf') as pdf_file:
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
for page_num in range(pdf_reader.numPages):
page = pdf_reader.getPage(page_num)
page_text = page.extractText()

Data Cleaning

Before the data becomes usable, a crucial next step is cleaning the data. Regarding the concepts, minimal cleaning was required, as they were relatively straightforward. Nonetheless, there were a few concepts that necessitated a bit of refinement. Such an example was ‘HR Strategy / Policy’ which should be split into ‘HR Strategy’ and ‘HR Policy. Cleaning the textual data extracted from regulations and law articles proved to be a more complicated task. Textual content in these types of documents tends to be disorderly. The image below highlights the presence of substantial irrelevant data, such as titles, headers and other textual and numerical references which are not relevant to retrieve the concepts. The extraction of this redundant data was mainly achieved through the implementation of regular expressions.

Image 2: Example of redundant data in regulations. Screenshot of the ETIAS regulation.

Data Pre-processing

Afterwards, the PDF-files were reduced to strings of text. This marks the end of the data cleaning process, initiating the pre-processing phase. To enhance computational efficiency, we opted to segment the textual data into chunks — a common pre-processing step. As part of our pre-processing strategy, we made a conscious decision not to eliminate stop words. Retaining stop words is essential for preserving contextual understanding, a fundamental aspect of LLMs, designed to capture nuanced information. Having divided the chunks into train-test-validation datasets, we took further steps by transforming the data into a BIOE-format. Using this format is a common practice for Named Entity Recognition (NER) tasks and represents the Beginning, Inside, Outside, and End of a Named Entity. It serves as a labelling scheme for annotating entities within a text and ensures that each word is tagged with its role in a Named Entity. This way the LLM gets a better understanding of recognising and comprehending boundaries of concepts within a given context. Our rationale for adopting this format lies in the fact that many concepts from the graph database consist of multiple words and should be regarded as a cohesive concept. This can be seen in the example of a BIOE-labelled sentence below:

Sentence: ‘The suspect was charged with second degree murder’.
BIOE labels: O B O O O B I E
BIOE-labelled sentence: ‘The (O) suspect (B) was (O) charged (O) with (O) second (B) degree (I) murder (E)’

Curious to see what other pre-processing steps were taken? Soon my colleague Kike Franssen will publish the next blog on this topic. She will elaborate on the LLM utilised and explore the endless possibilities that emerge from the collaboration between KGs and LLMs.

--

--