Automated Knowledge Graph Construction using ChatGPT

9 min readNov 29, 2023

Introduction

In this post, we will cover the topic of constructing a knowledge graph from raw text data using OpenAI’s gpt-3.5-turbo. LLMs have shown superior performance in text generation and question-answering tasks. Retrieval-augmented generation (RAG) has further improved their performance, allowing them to access up-to-date and domain specific knowledge. Our goal in this post is to utilize LLMs as information extraction tools, to transform raw texts into facts that can easily be queried to gain useful insights. But first, we need to define a few of the key concepts.

What is a knowledge graph?

A knowledge graph is a semantic network which represents and interlinks real-world entities. These entities often correspond to people, organizations, objects, events, and concepts. The knowledge graph consists of triplets having the following structure:

head → relation → tail

or in the terminology of the Semantic Web:

subject → predicate → object

The network representation allows us to extract and analyze the complex relationships that exist between these entities.

A knowledge graph is often accompanied by a definition of the concepts, relations, and their properties — an ontology. The ontology is a formal specification which defines the concepts and their relation in the target domain, thus providing semantics for the network.

Ontologies are used by search engines and other automated agents on the Web to understand what the content of a specific Web page means in order to index it and display it correctly.

Case description

For this use-case, we are going to create a knowledge graph using OpenAI’s gpt-3.5-turbo from product descriptions in the Amazon Products Dataset.

There are a lot of ontologies used on the Web to describe products, the most popular ones being the Good Relations Ontology, and the Product Types Ontology. Both of these ontologies extend the Schema.org Ontology.

Schema.org is a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet. Schema.org vocabulary can be used with many different encodings, including RDFa, Microdata and JSON-LD.

For the task at hand, we are going to use the Schema.org definitions for products and related concepts, including their relations, to extract triplets from product descriptions.

Implementation

We are going to implement the solution in Python. First, we need to install and import the required libraries.

Import libraries and read the data

!pip install pandas openai sentence-transformers networkx

import json
import logging
import matplotlib.pyplot as plt
import networkx as nx
from networkx import connected_components
from openai import OpenAI
import pandas as pd
from sentence_transformers import SentenceTransformer, util

Now, we are going to read the Amazon Products Dataset as a pandas dataframe.

data = pd.read_csv("amazon_products.csv")

We can see the contents of the dataset in the figure below. The dataset contains the following columns: ‘PRODUCT_ID’, ‘TITLE’, ‘BULLET_POINTS’, ‘DESCRIPTION’, ‘PRODUCT_TYPE_ID’, and ‘PRODUCT_LENGTH’. We are going to combine the columns ‘TITLE’, ‘BULLET_POINTS’, and ‘DESCRIPTION’ into one column ‘text’, which will represent the specification of the product that we are going to prompt ChatGPT to extract entities and relations from.

data['text'] = data['TITLE'] + data['BULLET_POINTS'] + data['DESCRIPTION']

Information extraction

We are going to instruct ChatGPT to extract entities and relations from a provided product specification and return the result as an array of JSON objects. The JSON objects must contain the following keys: ‘head’, ‘head_type’, ‘relation’, ‘tail’, and ‘tail_type’.

The ‘head’ key must contain the text of the extracted entity with one of the types from the provided list in the user prompt. The ‘head_type’ key must contain the type of the extracted head entity which must be one of the types from the provided user list. The ‘relation’ key must contain the type of relation between the ‘head’ and the ‘tail’, the ‘tail’ key must represent the text of an extracted entity which is the object in the triple, and the ‘tail_type’ key must contain the type of the tail entity.

We are going to use the entity types and relation types listed below to prompt ChatGPT for entity-relation extraction. We will map these entities and relations to the corresponding entities and relations from the Schema.org ontology. The keys in the mapping represent the entity and relation types provided to ChatGPT, and the values represent the URLS of the objects and properties from Schema.org.

# ENTITY TYPES:
entity_types = {
  "product": "https://schema.org/Product", 
  "rating": "https://schema.org/AggregateRating",
  "price": "https://schema.org/Offer", 
  "characteristic": "https://schema.org/PropertyValue", 
  "material": "https://schema.org/Text",
  "manufacturer": "https://schema.org/Organization", 
  "brand": "https://schema.org/Brand", 
  "measurement": "https://schema.org/QuantitativeValue", 
  "organization": "https://schema.org/Organization",  
  "color": "https://schema.org/Text",
}

# RELATION TYPES:
relation_types = {
  "hasCharacteristic": "https://schema.org/additionalProperty",
  "hasColor": "https://schema.org/color", 
  "hasBrand": "https://schema.org/brand", 
  "isProducedBy": "https://schema.org/manufacturer", 
  "hasColor": "https://schema.org/color",
  "hasMeasurement": "https://schema.org/hasMeasurement", 
  "isSimilarTo": "https://schema.org/isSimilarTo", 
  "madeOfMaterial": "https://schema.org/material", 
  "hasPrice": "https://schema.org/offers", 
  "hasRating": "https://schema.org/aggregateRating", 
  "relatedTo": "https://schema.org/isRelatedTo"
 }

In order to perform information extraction using ChatGPT, we create an OpenAI client, and using the chat completions API, we generate the output array of JSON objects for each identified relation from the raw product specification. The default model is chosen to be gpt-3.5-turbo since it’s performance is good enough for this simple demonstration.

client = OpenAI(api_key="<YOUR_API_KEY>")

def extract_information(text, model="gpt-3.5-turbo"):
   completion = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": user_prompt.format(
              entity_types=entity_types,
              relation_types=relation_types,
              specification=text
            )
        }
        ]
    )

   return completion.choices[0].message.content

Prompt engineering

The system_prompt variable contains the instructions guiding ChatGPT to extract entities and relations from the raw text, and return the result in the form of arrays of JSON objects, each having the keys: ‘head’, ‘head_type’, ‘relation’, ‘tail’, and ‘tail_type’.

system_prompt = """You are an expert agent specialized in analyzing product specifications in an online retail store.
Your task is to identify the entities and relations requested with the user prompt, from a given product specification.
You must generate the output in a JSON containing a list with JOSN objects having the following keys: "head", "head_type", "relation", "tail", and "tail_type".
The "head" key must contain the text of the extracted entity with one of the types from the provided list in the user prompt, the "head_type"
key must contain the type of the extracted head entity which must be one of the types from the provided user list,
the "relation" key must contain the type of relation between the "head" and the "tail", the "tail" key must represent the text of an
extracted entity which is the tail of the relation, and the "tail_type" key must contain the type of the tail entity. Attempt to extract as
many entities and relations as you can.
"""

The user_prompt variable contains a single example of the required output for a single specification from the dataset and prompts ChatGPT to extract entities and relations in the same way from the provided specification. This is an example of single-shot learning with ChatGPT.

user_prompt = """Based on the following example, extract entities and relations from the provided text.
Use the following entity types:

# ENTITY TYPES:
{entity_types}

Use the following relation types:
{relation_types}

--> Beginning of example

# Specification
"YUVORA 3D Brick Wall Stickers | PE Foam Fancy Wallpaper for Walls,
 Waterproof & Self Adhesive, White Color 3D Latest Unique Design Wallpaper for Home (70*70 CMT) -40 Tiles
 [Made of soft PE foam,Anti Children's Collision,take care of your family.Waterproof, moist-proof and sound insulated. Easy clean and maintenance with wet cloth,economic wall covering material.,Self adhesive peel and stick wallpaper,Easy paste And removement .Easy To cut DIY the shape according to your room area,The embossed 3d wall sticker offers stunning visual impact. the tiles are light, water proof, anti-collision, they can be installed in minutes over a clean and sleek surface without any mess or specialized tools, and never crack with time.,Peel and stick 3d wallpaper is also an economic wall covering material, they will remain on your walls for as long as you wish them to be. The tiles can also be easily installed directly over existing panels or smooth surface.,Usage range: Featured walls,Kitchen,bedroom,living room, dinning room,TV walls,sofa background,office wall decoration,etc. Don't use in shower and rugged wall surface]
Provide high quality foam 3D wall panels self adhesive peel and stick wallpaper, made of soft PE foam,children's collision, waterproof, moist-proof and sound insulated,easy cleaning and maintenance with wet cloth,economic wall covering material, the material of 3D foam wallpaper is SAFE, easy to paste and remove . Easy to cut DIY the shape according to your decor area. Offers best quality products. This wallpaper we are is a real wallpaper with factory done self adhesive backing. You would be glad that you it. Product features High-density foaming technology Total Three production processes Can be use of up to 10 years Surface Treatment: 3D Deep Embossing Damask Pattern."

################

# Output
[
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "isProducedBy",
    "tail": "YUVORA",
    "tail_type": "manufacturer"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasCharacteristic",
    "tail": "Waterproof",
    "tail_type": "characteristic"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasCharacteristic",
    "tail": "Self Adhesive",
    "tail_type": "characteristic"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasColor",
    "tail": "White",
    "tail_type": "color"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasMeasurement",
    "tail": "70*70 CMT",
    "tail_type": "measurement"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasMeasurement",
    "tail": "40 tiles",
    "tail_type": "measurement"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasMeasurement",
    "tail": "40 tiles",
    "tail_type": "measurement"
  }}
]

--> End of example

For the following specification, generate extract entitites and relations as in the provided example.

# Specification
{specification}
################

# Output

"""

Now, we call the extract_information function for each specification in the dataset and create a list of all extracted triplets which will represent our knowledge graph. For this demonstration, we will generate a knowledge graph using a subset of only 100 product specifications.

kg = []
for content in data['text'].values[:100]:
  try:
    extracted_relations = extract_information(content)
    extracted_relations = json.loads(extracted_relations)
    kg.extend(extracted_relations)
  except Exception as e:
    logging.error(e)

kg_relations = pd.DataFrame(kg)

The results from the information extraction are displayed in the figure below.

Fig 2. Results of the information extraction with ChatGPT

Entity resolution

Entity resolution (ER) is the process of disambiguating entities that correspond to real world concepts. In this case, we will attempt to perform basic entity resolution on the head and tail entities in dataset. The reason for this is to have a more concise representation of the facts present in the texts.

We will perform entity resolution using NLP techniques, more specifically we are going to create embeddings for each head, using the sentence-transformers library, and calculate the cosine similarity between the head entities.

We will use the ‘all-MiniLM-L6-v2’ sentence transformer to create the embeddings since it’s a fast and relatively accurate model, suitable for this use-case. For each pair of head entities, we will check if the similarity is larger than 0.95, if so we will consider these entities as being the same entity and we normalize their text values to be the equal. The same reasoning works for the tail entities.

This process will help us achieve the following result. If we have two entities, one having the value ‘Microsoft’ and the second one ‘Microsoft Inc.’, then these two entities will be merged into one.

We load and use the embedding model in the following way to calculate the similarity between the first and second head entities.

heads = kg_relations['head'].values
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedding_model.encode(heads)
similarity = util.cos_sim(embeddings[0], embeddings[1])

In order to visualize the extracted knowledge graph after entity resolution, we use the networkx Python library. First, we create an empty graph, and add each extracted relation to the graph.

G = nx.Graph()
for _, row in kg_relations.iterrows():
  G.add_edge(row['head'], row['tail'], label=row['relation'])

To draw the graph we can use the following code:

pos = nx.spring_layout(G, seed=47, k=0.9)
labels = nx.get_edge_attributes(G, 'label')
plt.figure(figsize=(15, 15))
nx.draw(G, pos, with_labels=True, font_size=10, node_size=700, node_color='lightblue', edge_color='gray', alpha=0.6)
nx.draw_networkx_edge_labels(G, pos, edge_labels=labels, font_size=8, label_pos=0.3, verticalalignment='baseline')
plt.title('Product Knowledge Graph')
plt.show()

A subgraph from the generated knowledge graph is displayed in the figure below:

We can see that in this way, we can connect multiple different products based on characteristics they share. This is useful for learning common attributes between products, normalizing product specifications, describing resources on the Web by using a common schema such as Schema.org, and even making product recommendations based on the product specifications.

Conclusion

Most corporations have a vast amount of unstructured data lying around unused in data lakes. The approach of creating a knowledge graph to extract insights from this unused data will help to obtain information which is trapped in unprocessed and unstructured text corpora, and use this information for making more informed decisions.

So far, we have seen that LLMs can be used to extract triplets of entities and relations from raw text data and automatically construct a knowledge graph. In the next post, we will attempt to create a product recommendation system based on the extracted knowledge graph.