Building An Academic Knowledge Graph with OpenAI & Graph Database

Use GPT-3 in the End-to-End Pipeline of Knowledge Graph — Part 1

8 min readJan 29, 2023

by Joshua Yu

The Aurora Light Captured in a Winter Night in Iceland by the Author

In my last article, I showed a simple but powerful example of how to add Q&A features in natual language to a knowledge graph, in just a few easy-to-follow steps.

In fact, what GPT-3 has brought to us are far more just about translating a language to another by learning from some samples. In this trilogy, I will demonstrate more exciting things GPT-3 can do, uisng an academic knowledge graph as a sample project. This is the fisrt part.

I. What You Need

An OpenAI API key which is generated after you sign up OpenAI.
A running Neo4j Graph Database intance.

I am using the Neo4j Desktop for this project, which can be downloaded for FREE from here. It supports both Windows and Mac, and comes with versions of Neo4j Enterprise Graph DBMS. After installation, please complete the following steps to create an empty graph database:

1) Create a Project

2) Add a Local DBMS

Choose a Name for the DBMS, a password and a version of 4.4.15 or above. This needs to be a local DBMS.

Click Create button to download DBMS package for the specified version, and have it installed and running. This may take some minites.

3) Install APOC Plugin

This project requires some procedures from the APOC Plugin of Neo4j, so remember to have it installed. Desktop will check the right version of APOC which matches the Graph DBMS version.

APOC — Awesome Procedures of Cypher is a collection of over 600 procedures and functions developed for advanced user of Neo4j to extend existing capabilities of the Cypher query language. For more great articles on APOC, you can check them here.

Note: here I didn’t choose Neo4j AuraDB, the DaaS product from Neo4j, because one procedure is not availalbe on AuraDB due to security considerations. I chose to build everything using Cypher + APOC ONLY in this project, but it shouldn’t be too hard to wrap the code into an application client using a language like Python or JavaScript, and make small changes to the procedure, so to make it work on AuraDB too. I will explain this futher when we get there.

II. The Data Pipeline

Here we are to use a highly abstracted illustration of the data pipeline of our project:

1. Data Souce

For the Academic Knowledge Graph, we are to use arXiv for metadta of published papers. arXiv is a free distribution service and an open-access archive for 2,196,510 scholarly articles (as of late Jan 2023) in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. It also provoide a Restful API for paper search, for example, the following URL requests for a search of ‘Graph machine learning’, for page 1 and showing 20 entries on each page:

https://export.arxiv.org/api/query?search_query=all:graph%20machine%20learning&start=0&max_results=20

2. Ingest

Because arXiv API returns results in XML format, we will use the apoc.load.xml procedure from the APOC plugin to parse metadate of each paper, and save it into our knowledge graph.

Details of ingestion steps are to be covered in this article. I am also going to show you how to use GPT-3 to generate Cypher statement to query it!

3. Enrich

The metadata of papers contains titile, summary, authors etc. useful information, but more knowledge is in fact embedded in the text of title and summary. Again, this is something GPT-3 is highly capable of. This project has done some experiments using GPT-3 to do entity and relationship extractions over title and summary. The results provide knowledge at a more granular context for loaded papers by linking them with Concepts, as well as linking Concept with other Concepts. This enables more exciting use cases for the Connected Data Analysis.

Details of Enrich process is covered in the Part 2 of this series.

4. Use

Some common use cases for an Academic Knowledge Graph will be discussed and experimented, but using new approach. In Part 3 of this series, I will demonstrate other features of GPT-3 e.g. text embedding to futher enrish the KG, and what use cases can benefit from those features.

III. Data Ingestion

The Graph Data Model

Below is the graph model for our academic paper knowledge graph. It should be very intuitive and easy-to-understand for the subject, however It may look different based on your experieces on data modeling (mostly from a relational world / RDBMS). Yes, there are all circles representing Nodes of certain Label (e.g. Paper), which are connected by arrows, i.e. the Relationships.

There are a lot of useful articles on native graph modelling by my colleague Michael Hunger, David Allen and many others, which are great contents to read if you’d like to know more.

The are many knowledge graphs created as RDF store, but here we are to use the Labeled Property Graph because of its flexibility and scalability. My colleague Dr. Jesús Barrasa has run a workshop to compare the major differences of the two, and below is the link of the post.

RDF Triple Stores vs. Labeled Property Graphs: What's the Difference? - Neo4j Graph Data Platform

Editor's Note: This presentation was given by Jesús Barrasa at GraphConnect San Francisco in October 2016. Resource…

neo4j.com

Before the start of data ingestion, don’t forget to create some Constraints:

// schema

CREATE CONSTRAINT FOR (p:Paper) REQUIRE (p.id) IS UNIQUE;
CREATE CONSTRAINT FOR (a:Author) REQUIRE (a.fullname) IS UNIQUE;
CREATE CONSTRAINT FOR (y:Year) REQUIRE (y.year_number) IS UNIQUE;

With these uniqueness constraint, every paper will be unique in our database, and by creating a Uniqueness Constraint, DBMS will also create an Index on the same property, so as to optimize the query time over the property.

2. The Request/Response

arXiv has provided a Restful API which returns results in XML format. We use apoc.load.xml() to call the API and parse response:

// Define parameters
:param arxiv_endpoint=>'https://export.arxiv.org/api/query?';
:param search_query=>'search_query=' + 'all:' + 'graph%20machine%20learning';
:param start_index=>'start=' + toString(0);
:param max_results=>'max_results=' + toString(50);

// Call arXiv API 
CALL apoc.load.xml($arxiv_endpoint + $search_query + '&' + $start_index + '&' + $max_results) YIELD value
RETURN value;

The returned variable value contains response in the following format:

{
    "value": {
      "_children": [
        {
          "_type": "link",
          "rel": "self",
          "href": "http://arxiv.org/api/query?search_query%3Dall%3Agraph%20machine%20learningstart%3D0%26id_list%3D%26start%3D0%26max_results%3D20",
          "type": "application/atom+xml"
        },
... ... ... ... 
        {
          "_children": [
            {
              "_type": "id",
              "_text": "http://arxiv.org/abs/2201.01288v1"
            },
            {
              "_type": "published",
              "_text": "2022-01-04T18:31:31Z"
            },
            {
              "_type": "title",
              "_text": "Automated Graph Machine Learning: Approaches, Libraries and Directions"
            },
            {
              "_type": "summary",
              "_text": "Graph machine learning ......."
            },
            {
              "_children": [
                {
                  "_type": "name",
                  "_text": "Xin Wang"
                }
              ],
              "_type": "author"
            }
          ],
          "_type": "entry"
        },
... ... ... ...
    }
  }
]

3. The Ingestion Process

The complete code can be found in Github. The psuedo code above should explain the process clearly enough.

IV. Test Our KG

Once we have loaded some data into our graph database, usually it’s the time to write some query and test the results, but this time I will not start typing MATCH …, because I’d like GPT-3 to do the job for me.

GPT is a generative language model, which is what letter G stands for. By giving it some context of the problem to solve, it can figure out the best relevant answer. In this case, I’d like it to:

generate the Cypher statement for a specific question in natual language

telling it what is the question, as well as context like what the graph model looks like

The two parts above combined is the prompt. OpenAI has built the Codex model based on GPT-3 specifically for code generation tasks which supports many popular query languages incl. SQL, and it works well for Cypher too. Let’s launch the Codex Playground and have a try:

The prompt:

The generated text / Cypher statement:

Once we provide metadata of the graph model as part of the prompt, GPT can learn from it about the labels of nodes (Paper, Author), type of relationships (HAS_AUTHOR) and properties of both nodes and relationships, even the unique identifier of Paper node (which is id). So amazing!

To make this a generic process, I created some Cypher statments to generate prompt text from any existing graph database for Nodes, Relationships and Uniqueness Constraints:

// a) Generate prompt for nodes

CALL apoc.meta.data() YIELD label, property, type, elementType
WITH label, property, type, elementType
WHERE elementType <> 'relationship' AND type <> "RELATIONSHIP" 
WITH label, collect(property) AS props
RETURN '# ' + label + '(' + replace(
                                        reduce(propstr = '', p IN props | propstr + p + ',') + ')', 
                                        ',)',
                                        ')'
                                    ) AS statement;

// b) Generate prompt for relationships

CALL apoc.meta.data() YIELD label, property, type, other, elementType
WITH label, property, type, other, elementType
WHERE elementType <> 'relationship' AND type = "RELATIONSHIP" 
WITH label, property, other
UNWIND other AS label2
RETURN '# (:' + label + ') -[:' + property + ']-> (:' + label2 + ')' AS statement;

// c) Generate prompt for Uniqueness Constraints

SHOW UNIQUE CONSTRAINT YIELD *
WHERE entityType = 'NODE'
RETURN 'CREATE CONSTRAINT FOR (n:`' 
       + labelsOrTypes[0] + '`) REQUIRE (' 
       + replace(reduce(propstr = '', p IN properties | propstr + 'n.`' + p + '`,') + ')' + ' IS UNIQUE;', ',)', ')')
AS statement;

And below are some results I’ve got :

In the last example, GPT didn’t get the direction of HAS_AUTHOR relationship correct, but it is still impressive enough, isn’t it?

Summary

In this episode, we created a knowledge graph for academic papers by querying arXiv using APOC procedure load.xml, and also tested GPT-3 / Codex model for Cypher statemenet generation. Hope you enjoy it!