Building An Academic Knowledge Graph with OpenAI & Graph Database — Part 3

Use GPT-3 for Word Embeddings & Semantic Search

8 min readFeb 12, 2023

by Joshua Yu

The sunset at Wollongong Head Lighthouse, Sydney Australia by the author

This is the 3rd part of my trilogy: Building An Academic Knowledge Graph with OpenAI & Graph Database, and I am going to use OpenAI API for word/text embeddings for knowlege based search.

Text embedding is the process of representing text data in a continuous, dense, and meaningful high-dimensional vector space, usually referred to as an embedding space. Embedding text data into a continuous vector representation enables algorithms to capture the semantic relationships between words and documents, because similar words of real world meanings shoud be represented by similar vectors of shorter distance.

There is an article on Medium which gives a quite good overview of common word embedding methods, and I am sure there are many more available too:

Introduction to Word Embeddings

What is a word embedding?

towardsdatascience.com

In my previous episodes of the series (links are given below), we focused on the Completion function of the GPT-3 model by OpenAI, today let’s take a look at another important function, i.e. Embedding.

Part 1 — Overall pipeline, data ingestion and Cypher query generation

Part 2 — Entity and Relationship Extraction to enrich knowledge graph

I. OpenAI Word Embedding

Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts. Since the initial launch of the OpenAI /embeddings endpoint, many applications have incorporated embeddings to personalize, recommend, and search content.

There are 2 generations of models currently published by OpenAI, among which text-embedding-ada-002 (the 2nd generation as its suffix indicates) outperforms all the old embedding models on text search, code search, and sentence similarity tasks and gets comparable performance on text classification.

Source: https://openai.com/blog/new-and-improved-embedding-model/

Using the same text-embedding-ada-002 model, we can generate embeddings of texts for:

Search (where results are ranked by relevance to a query string)
Clustering (where text strings are grouped by similarity)
Recommendations (where items with related text strings are recommended)
Anomaly detection (where outliers with little relatedness are identified)
Diversity measurement (where similarity distributions are analyzed)
Classification (where text strings are classified by their most similar label)

The result vector has 1536 dimensions, and it’s supported to have context of up to 8192 bytes.

To get an embedding by calling the OpenAI API, it’s as simple as sending the text string as the parameter to it:

curl https://api.openai.com/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{"input": "Your text string goes here",
       "model":"text-embedding-ada-002"}'

The response is in the JSON format:

{
  "data": [
    {
      "embedding": [
        -0.006929283495992422,
        -0.005336422007530928,
        ...
        -4.547132266452536e-05,
        -0.024047505110502243
      ],
      "index": 0,
      "object": "embedding"
    }
  ],
  "model": "text-embedding-ada-002",
  "object": "list",
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

II. Text Embedding Powered Semantic Search

Searching for text content has been a long time subject for computer science. There have been numerous methods invented and solutions built, but why embedding can do a better job?

Word embeddings can perform better semantic search compared to other text representations because they capture the semantic relationships between words in a continuous and dense vector representation. This representation is based on the distributional hypothesis, which states that words that occur in similar contexts have similar meanings.

In a word embedding, each word is represented by a high-dimensional vector, and the distance between two word vectors represents the similarity between the words they represent. For example, the vectors for words like “dog” and “cat” may be close to each other in the embedding space, while the vectors for words like “dog” and “car” may be far apart. This enables algorithms to capture the semantic relationships between words in a way that is more expressive than other text representations, such as one-hot encoding or term frequency-inverse document frequency (TF-IDF).

Embeddings can be used for semantic search in several ways:

Nearest Neighbors search: One common use case for text embeddings is to perform nearest neighbors search on the embeddings. The nearest neighbors of a query vector represent the most similar documents to the query in the embedding space. There are nearest neighbors algorithm to find the most similar documents.
Document retrieval: Another use case for text embeddings is document retrieval. In this case, the goal is to retrieve a set of documents that match a query. This can be decided by computing the cosine similarity between the query embedding and the embeddings of the documents. The documents with the highest similarity scores are returned as the most relevant.
Text classification: Another use case for text embeddings is text classification. In this case, the goal is to classify a document into one of several predefined categories. To perform this task, you first compute the embeddings for the training documents, then use a machine learning algorithm such as support vector machines, random forests, or deep neural networks to train a classifier. Finally, you compute the embedding for a new document and use the trained classifier to predict its category.

Now with OpenAI API, we can easily get text embeddings for words and documents.

III. Similarity Functions

In all of the semantic search use cases based on word embeddings, it is always required to find the most similar texts for a certain query, by comparing their embeddings / vector representations. Ther are a group of similarity algorithms to use, and we are to choose Cosine Similarity here.

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is a widely used similarity metric in the field of natural language processing and information retrieval.

Given two vectors A and B, cosine similarity is defined as the cosine of the angle between the two vectors, and is calculated as the dot product of the two vectors divided by the product of the magnitudes of the vectors. Mathematically, it can be expressed as:

cosine similarity = (A . B) / (||A|| * ||B||)

Cosine similairy on a 2-dimensional space

As a result, the more similar two words are, the smaller the angle between them is. If two words are identical, their Cosine Similarity score is 1(0 degree). Because the value range of Cosine Similarity is [0,1], it can very well represent the scale of similarity.

IV. Search for Papers Based on Embeddings

In this project, we have used Cypher and APOC procedure library to perform all required tasks. To calculate similarity, the easiest way is to use another library from Neo4j Graph Database, i.e. Graph Data Science (GDS).

GDS is a suite of graph algorithms and libraries for graph analytics built on top of the Neo4j graph database. It provides a range of algorithms and libraries for graph analytics, including centrality measures, community detection, pathfinding, similarity, link prediction and graph embeddings, as well as a framework for building and running custom graph algorithms.

GDS is designed to make it easy for users to perform graph analytics on large and complex graph data, by providing an optimized graph processing engine and a library of pre-built algorithms that can be run with minimal configuration.

Graph algorithms - Neo4j Graph Data Science

The Neo4j Graph Data Science (GDS) library contains many graph algorithms. The algorithms are divided into categories…

neo4j.com

If you are using Neo4j Desktop, GDS can be installed as a plugin.

To run semantic search over the Title of papers in our Academic Knowledge Graph, we simply take 2 steps as below:

1) Call OpenAI API to get embeddings of every title of the paper, and store them in property embedding of the Title node. This is only required once.

2) For each search phrase, call OpenAI API to get its embedding as well, and use Cosine Similarity to find the most similar titles to it.

Let’s have a look at the code for step 1).

// 1.1)  api parameters
:param openai_api_url=>'https://api.openai.com/v1/embeddings';
:param openai_api_header_content_type=>"application/json";
:param openai_api_header_auth=>"Bearer " + '***OPENAI-API***';
:param openai_embedding_model=>"text-embedding-ada-002";

// 1.2)  get embeddings of all title
:auto MATCH (t:Title)
WHERE t.embedding IS NULL
CALL {
    WITH t 
    WITH t,
        apoc.convert.toJson(
            {
                model: $openai_embedding_model,
                input: t.text
            }
        ) AS payload
    CALL apoc.load.jsonParams(
        $openai_api_url,
        {
            `Content-Type`:$openai_api_header_content_type,
            Authorization: $openai_api_header_auth
        },
        payload, null,
        {}
    ) YIELD value
    SET t.embedding = value.data[0].embedding
} IN TRANSACTIONS OF 20 ROWS
RETURN count(t) AS count;

Because OpenAI restricts the number of API calls to be 20 times/minute for free tier account, you may get HTTP Error 429 and only have some embeddings done. The workaround is to run this statement 1.2) a few times till all Title nodes have non-NULL embedding property value.

Below is the code to do the actual search:

// 2.1) search text
:param search=>'unsupervised learning of graph embedding';

// 2.2) semantic search based on embeddings. Only top 10 most similar titles are returned.

WITH apoc.convert.toJson(
        {
            model: $openai_embedding_model,
            input: $search
        }
     ) AS payload
CALL apoc.load.jsonParams(
    $openai_api_url,
    {
        `Content-Type`:$openai_api_header_content_type,
        Authorization: $openai_api_header_auth
    },
    payload, null,
    {}
) YIELD value
WITH value.data[0].embedding AS searchEmb
MATCH (t:Title)
WHERE NOT t.embedding IS NULL
RETURN t.text AS title, gds.similarity.cosine(t.embedding, searchEmb) AS similarity
ORDER BY similarity DESC LIMIT 10;

Let’s have a look at some examples (case is insensitive):

i. Search for Graph Wavelet Neural Network

There is an exact matched title which has a score of nearly 1.0.

ii. Search for graph neural network

What is interesting in this list is the title Graph Learning: A Survey. Apparetnly there is a high similarity between neural network and learning in this context.

iii. Search for database

In our limited sample data, there is no database related paper but the results did return some titles having keywords like relational, structured data, transactions etc., even though the similarity score is all below 0.8. Again it has amzed me!

V. Further Discussions

In this episode, we looked at another function offered by GPT-3 model, i.e. word embeddings. Word embeddings are a powerful tool for semantic search because it truly captures the semantic relationships between words which enables algorithms to perform more expressive and semantically relevant search.

In this sample knowledge graph, we have loaded metadata of 100 papers from arXiv query API for a search text graph neural network, generted embeddings of paper titles, and then use Cosine Similarity to find the most similar title for a certain search phrase. For a knowledge graph of this size, this approach can work perfectly. However, if there are millions of papers, it would take significant amount of time to go through all of the embeddings and find the most similar ones. To make this a more scalable and performing solution for real use over large data volume, there are other techniques to apply which require special indexing method and graph algorithms over vectors in Neo4j graph DBMS, of which I will cover in the future articles.

Today’s code can be found from my Github repository. Happy coding!