Simple Schemes for Knowledge Graph Embedding

Preston Carlson

Published in

Stanford CS224W GraphML Tutorials

22 min readJan 26, 2022

By Lucas Tao, Samuel Winson Tanuwidjaja, Preston Carlson as part of the Stanford CS224W course project.

We will assume the following background:

Familiarity with the basic principles of Machine Learning and Deep Learning.
Familiarity with complex numbers, and how multiplication by imaginary numbers can be used to express rotations.
If you aren’t comfortable with operations involving complex numbers, the YouTube channel 3Blue1Brown has an excellent introduction to complex numbers.
A working knowledge of PyTorch is expected for the final section titled “Coding Tutorial for TransE and RotatE”. No knowledge of PyTorch is needed before then.
The coding tutorial is based on this colab, which is also linked in the tutorial section.

We assume no familiarity with Graph Machine Learning.

Knowledge Graphs

In many real-world applications, it is useful to have an understanding of how different concepts relate to one another. A common, expressive way of representing this information is with a knowledge graph [1]. A knowledge graph is a collection of nodes and edges where each node represents a real-world entity, and each edge represents a relation between entities.

For example, let’s consider the information that Willem Dafoe and Tom Holland both star in Spider-Man: No Way Home, which was released in 2021. We can represent this information in the following knowledge graph:

Here the nodes “Tom Holland” and “Willem Dafoe” are connected to the “Spider-Man: No Way Home” node with edges representing the “Stars In” relation, “Spider-Man: No Way Home” is connected to “Tom Holland” and “Willem Dafoe” via edges representing the “Features” relation, and “Spider-Man: No Way Home” is connected to “2021” via the “Released in year” relation.

Such a graph can allow us to answer questions about the entities and relations in the graph. (For example, we can use the graph above to answer the query “Who did Tom Holland star in a movie with in 2021?”) Because of this, real-world knowledge graphs are often used for question answering applications, as in search engines and conversation assistants.

And, when knowledge graphs are incomplete, they can be used to infer missing relations/edges between entities. For example, after building a knowledge graph from information on a community wiki, we will likely find that some information is missing from our graph:

Example of an Incomplete Knowledge Graph

Here, we see that “Spider-Man: No Way Home” is missing a genre. We might reasonably assume that similar movies have the same genre, and so predict that, since “Spider-Man: No Way Home” is similar to “The Avengers” and “Spider-Man”, it is an “Action” movie.
Finding such missing edges is a task called knowledge graph completion.

Knowledge Graph Embeddings

But how might we determine the similarity of two movies? Or any two entities in a knowledge graph? For this task, we will use a neural network to learn an embedding [2] of each node in our knowledge graph.

Generally, an embedding is a representation of a discrete variable to a real-valued vector. Embeddings are often used in the field of NLP to learn a low-dimensional representation of documents. Some information is lost about each individual document, but a representation is gained that allows documents to be more readily compared.

In our case, we want to train our neural network to represent each node in our knowledge graph as an n-dimensional vector. (Or, equivalently, a location in n-dimensional space.) Then, when comparing the similarity between entities, we can just calculate the distance between their embeddings! (I.e. entities with embeddings closer together are more similar than entities with embeddings that are further apart.)

We also need to train our neural network to represent knowledge graph relations in this embedding space, or we will lose a lot of the information in our knowledge graph. There are many methods for learning these embeddings, and in this article, we will discuss two embedding schemes that represent relations as movement in the embedding space.

Embedding Scheme One: TransE

One natural embedding scheme would be to represent each relation as a translation in embedding space.

Consider a head entity h related to a tail entity t via a relation r in our knowledge graph. (We will refer to this as a positive (h,r,t) triplet.) Using a translation embedding scheme, we can simply add h and r to move to t. More formally, we want to learn embeddings of h, r, and t, such that h + r ≈ t.

This is precisely what our first embedding scheme, TransE [3], tries to achieve.

In honor of the new “Spider-Man: No Way Home” movie that was recently released, let’s continue to use Spider-Man himself (Tom Holland) and the Green Goblin (Willem Dafoe) as part of examples to help us better visualize the mechanics of TransE’s embeddings.

Tom Holland has an English nationality, whereas Willem Dafoe is an American national. Expressing these statements as head, relationship, tail triplets would yield the following:

(head: Tom Holland, relation: Nationality, tail: English)
(head: Willem Dafoe, relation: Nationality, tail: American)

For ease of visualization, let’s imagine TransE representing the above relations in a 2-dimensional vector space. The visualization will be as follows:

As we can see from the figure, beginning from the embedding of head entity “Tom Holland”, we add the embedding of the “Nationality” relation, translating the head entity to be near the embedding of the tail entity “English”. Similarly, we see “Willem Dafoe” + “Nationality” ≈ “American”.

Also notice that when we add the embedding of “Nationality”, we apply the same translation to each node. This is because TransE learns a single embedding for each relation, which it then applies to all head entities in existing triplets with that relation. (E.g. all nodes use the same embedding of “Nationality”, but “Nationality” and “Stars in” have different embeddings.)

What types of relations can TransE represent?

There are many different types of relations, and different embedding schemes have different types that they can represent well.

Let’s see which types of relations TransE can represent, using some more examples in the same 2-D embedding space:

1. Inverse Relations

Inverse relations are relations r1 and r2 such that, the triplet (h,r1,t) exists if-and-only-if the triplet (t,r2,h) also exists.

In TransE, this would mean the embeddings of r1 and r2 would be related as r1 = -r2. (The inverse with respect to addition.)

For example, if Tom Holland stars in “Spider-Man: No Way Home”, then “Spider-Man: No Way Home” must feature Tom Holland. Here, ‘stars in’ and ‘feature’ are inverse relation types.

The following knowledge graph:

Will be embedded by TransE as:

2. Composition Relations

Composition relations are relations r1, r2, r3 such that, if the triplets (x,r1,y) and (y,r2,z) exist, then the triplet (x,r3,z) also exists.

In TransE, this would mean the embeddings of r1, r2, r3 would be related as r1 + r2 = r3.

For instance, given that Spider-Man protects New York City by fighting against Green Goblin, a supervillain attacking New York City, we see that this knowledge graph:

Knowledge Graph with Composition Relation

Will be embedded as:

TransE Embedding of Composition Relation

3. Anti-Symmetric Relations

An Anti-Symmetric relation is a relation r1 such that, the triplet (h,r1,t) exists if-and-only-if the triplet (t,r1,h) does not also exist.

For example, “Willem Dafoe” has a nationality of “American”, but “American” does not have a nationality of “Willem Dafoe”.

As a knowledge graph, we have:

Knowledge Graph with Anti-Symmetric Relation

Which TransE embeds as:

TransE Embedding of Anti-Symmetric Relation

When adding the “Nationality” embedding to the embedding of “Willem Dafoe”, the translation results in a location near the “American” embedding, as expected. And, when adding the “Nationality” embedding to the “American” embedding, the translation results in a location in the embedding space that is nowhere close to the embedding of “Willem Dafoe”.

TransE Loss Function

So, as long as our neural network can effectively learn embeddings, we see that TransE is expressive enough to represent many different types of relations. But, before our neural network can learn such embeddings, we need to provide it with a loss function that it will seek to minimize.

Formulation 1

A naive, intuitive formulation of the loss function might seek to minimize the distance between (h+r) and t for each (h,r,t) triplet in S (the set of edges in our knowledge graph). Such a loss function would look like this:

where d(h+r,t) indicates the distance between (h+r) and t.

This would in fact fulfill the h + r ≈ t condition of TransE, but not prevent our model from embedding different entities to similar or equivalent locations in embedding space.

For instance, consider if the embeddings of “Tom Holland” and “Willem Dafoe” are similar:

A useless embedding of Tom Holland and Willem Dafoe

If “Tom Holland” + “Nationality” ≈ “English” and “Willem Dafoe” + “Nationality” ≈ “American” hold, the loss function will be minimized. But, since “Tom Holland” ≈ “Willem Dafoe”, we see that “Tom Holland” + “Nationality” ≈ “American” must also hold. This is false, so this is intuitively an undesirable embedding.

In this case, “Tom Holland” and “Willem Dafoe” are virtually indistinguishable in the embedding space. We call such embeddings useless embeddings.

Formulation 2

So, let’s adjust our loss function. In addition to minimizing d(h+r,t) for positive (h,r,t) triplets, we’ll have our model increase d(h’+r,t’) for triplets (h’,r,t’), where (h’,r,t’) are corrupted triplets. (I.e. triplets corresponding to edges not present in our knowledge graph, but including a (h’,r) or (r,t’) that are present.) Here’s our new loss function:

To keep our loss function nonnegative, we use the positive part function (represented here by the brackets with a + symbol) that rounds any negative input to 0.

This will encourage the model to increase d(h’+r,t’)! But, this only ensures that d(h+r,t) and d(h’+r,t’) will be equal. So, we can end up with an embedding like this:

This implies that Tom Holland and Willem Dafoe are equally likely to be American, which we know is false. This is another undesirable embedding.

Formulation 3

So, let’s train our model to generate embeddings such that d(h+r,t) < d(h’+r,t’). To do so, we’ll add a positive scalar 𝜸 to our loss function.

We call 𝜸 the margin. The margin ensures that, in order to minimize the loss, the model learns to increase d(h’+r,t’) such that d(h+r,t) + 𝜸 ≤ d(h’+r,t’). That is, the distance component coming from the corrupted triplet will be at least 𝜸 more than the distance component coming from the positive triplet.

Now that “Willem Dafoe” + “Nationality” is closer to “American” than is “Tom Holland” + “Nationality”, we consider Willem Dafoe more likely to be American than we consider Tom Holland to be.

Negative Sampling

Now, to effectively calculate this loss function, we need corrupted triplets! Most knowledge graphs are very sparse (most entities are not related to most others), so there are many more corrupted triplets than positive triplets. So, instead of considering all absent edges to be corrupted triplets, we sample a small set of corrupted triplets (~128) for each positive triplet and use these to calculate the loss.

We can sample corrupted triplets by taking a positive triplet (h,r,t) and “corrupting” one of the entities, so that it no longer corresponds to an edge in our graph, yielding a corrupted triplet (h’,r,t) or (h,r,t’). This sampling assumes that the absence of such an edge means that the corresponding relation does not hold.

This assumption, that absent edges between nodes correspond to an absence of relation between entities, is called the closed-world assumption. Under the closed-world assumption, we assume that our knowledge graph is complete i.e. there are no missing positive triplets.

We can instead make the open-world assumption. Under the open-world assumption, we assume that our knowledge graph is incomplete, i.e. a missing edge does not necessarily correspond to a negative triplet.

Since our knowledge graphs are almost always incomplete, we usually want to operate under the open-world assumption, especially for applications involving knowledge graph completion. However, our method of negative sampling relies on the closed-world assumption. Does this undermine knowledge graph completion?

In practice, it does not. Because knowledge graphs are sparse and very large (>tens of thousands of entities), most corrupted triplets correspond to true negative triplets. This form of negative sampling results in a bias toward the closed-world assumption, but does not consider all missing edges to correspond to negative triplets.

Real-world TransE Embedding

So, what kinds of embeddings does TransE learn in practice? Below is a 2-D projection of an embedding learned by TransE on a subset of Google’s Freebase knowledge graph dataset, titled FB15k-237 [4]:

TransE Embedding — Anti-Symmetric “Nominated For” Relation

The arrow here corresponds to the embedding of the anti-symmetric “Nominated For” relation. We see that TransE learned a desirable embedding of movies and actors for this relation, since “Randy Thom” + “Nominated For” ≈ “Forrest Gump”, “Rodrigo Santoro” + “Nominated For” ≈ “300”, and other actors are not mistakenly embedded as being nominated for these movies.

Limitation of TransE — Symmetry

It bears mentioning that TransE cannot effectively model all types of relations. One important type of relation, the symmetric relation, cannot be expressed by TransE.

A Symmetric relation is a relation r1 such that, the triplet (h,r1,t) exists if-and-only-if the triplet (t,r1,h) also exists.

For example, since “Willem Dafoe” and “Tom Holland” worked together on the same movie, they are colleagues of each other.

As a knowledge graph, we have:

Which TransE embeds as:

TransE Embedding of Symmetric Relation (Useless)

Since “Willem Dafoe” + “Colleague” = “Tom Holland” and “Tom Holland” + “Colleague” = “Willem Dafoe”, we see that “Willem Dafoe” = “Tom Holland” and “Colleague” = -“Colleague” = 0, which are useless embeddings.

This is also what we see in practice. Here is how TransE embedded the “Sibling” relation on the FB15k-237 dataset:

TransE Embedding — Symmetric “Sibling” Relation

The arrow here corresponds to the embedding of the symmetric “Sibling” relation. We see that TransE learned undesirable embeddings of three pairs of siblings, where siblings are embedded extremely close together, and the “Sibling” embedding has very low magnitude.

Embedding Scheme Two: RotatE

Let’s now consider our other embedding scheme, RotatE [5]. Similar to TransE, RotatE treats relations as movement in embedding space. Unlike TransE, RotatE treats relations as rotations in complex space, rather than translations in real space.

RotatE learns embeddings for positive triplets (h,r,t) such that h ○ r ≈ t, where ○ is the Hadamard (element-wise) product on the embeddings.

Its loss function is similar to that of TransE, but with h + r, h’ + r’ sums replaced by h ○ r, h’ ○ r’.

As an example, here is a simple RotatE embedding in complex space:

Here, the “Nationality” relation is embedded as a 90 degree rotation.

RotatE is able to represent all types of relations that TransE can!

1. Inverse Relations

Here, r1 is equal to the inverse of r2 (with respect to the Hadamard product).

2. Composition Relations

RotatE Embedding of Composition Relation

3. Anti-Symmetric Relations

RotatE Embedding of Anti-Symmetric Relation

4. Symmetric Relations

Additionally, RotatE can model symmetric relations that TransE cannot express! It does so as 180 degree rotations:

Comparison of Embedding Schemes

Let’s see how RotatE compares to TransE when embedding the symmetric “Sibling” relation in the FB15k-237 dataset:

RotatE Embedding — Symmetric “Sibling” Relation

Unlike in TransE, we see that siblings are not embedded to similar locations! We also see that siblings are approximately 180 degrees rotated from one another, as expected.

Performance Comparison

Now we will compare how these methods perform on the dataset overall.

We use two methods for measuring performance of knowledge graph embeddings. The first is called Mean Reciprocal Rank. Mean Reciprocal Rank (MRR) in a knowledge graph embedding is a measure of how close (h+r) or (h ○ r) is to t, relative to other tails, t’. An MRR of 1 indicates that t is the closest embedding to (h+r), an MRR of 1/2 indicates that t is the second closest, and so on. Therefore, a higher MRR is better. We then take the average of the MRR of each positive triplet (or “query”) in the graph, yielding the MRR of the embedding:

We also use a metric called Hits@K, which looks at the top K examples from a sorted list, and divides the number of positive examples in the top K elements of the list by the total number of positive examples in the list.

In our case, for each positive triplet, we have a list containing that positive triplet and many corrupted triplets. We then sort the list by score according to the loss function, and Hits@K evaluates to 1 if the positive triplet is in the top K elements of the list, and evaluates to 0 otherwise. Our final Hits@K value is the average of these evaluations across all positive triplets.

Using K values of 1, 3, and 10, we obtained the following performance after 100 epochs:

Performance of TransE and RotatE on the FB15k-237 dataset

We see that RotatE performs better than TransE by all metrics, which is expected due to its greater expressiveness and the presence of many types of relation in our dataset.

Further Limitations

However, neither performs particularly well on this dataset! At least in part, this is because there is a type of relation that neither TransE nor RotatE can model effectively. Namely, N-to-1 Relations.

An N-to-1 relation is a relation r1 such that, a triplet (h1,r1,t) exists and a triplet (h2,r1,t) also exists. That is, multiple entities are related to another entity in the same way.

For example, Tom Holland and Willem Dafoe both star in “Spider-Man: No Way Home”:

Which TransE will embed as:

Since both h1, h2 are embedded to the same location, this again yields a useless embedding.

RotatE generates a similar embedding:

Though TransE and RotatE are limited, they can still obtain decent performance. Let’s see how to implement them in PyTorch!

Coding Tutorial for TransE and RotatE

Note for the reader: this section is written to be understandable without having read the above discussion in detail. If you would like more detail on a concept mentioned in this tutorial, please refer to the relevant section above!

This tutorial is based on our Google Colab here:

Google Colaboratory

Simple Schemes for Knowledge Graph Embedding

colab.research.google.com

Code snippets will be excerpted from this Colab, and we will refer to cells in the Colab when they would be cumbersome to show in-article.

Data

We will be using a subset of the Freebase database called FB15k-237. Freebase is a collaborative knowledge base comprised of (h,r,t) triplets that was formerly supported by Google. FB15k is a subset of Freebase compiled by Bordes et al. that contains 592,213 triplets with 14,951 entities and 1,345 relationships, and is often used by knowledge graph embedding papers as a test dataset. FB15k-237 is a variant of FB15k that removes inverse relations, due to an issue where models could easily infer test triplets by inverting train triplets. This removal results in a dataset of 310,116 triplets with 14,951 entities and 237 relationships. This final variant (FB15k-237) is the database that we use to train our model.

To get started, pre-generated train, validation, and test sets can be found here: https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding/tree/master/data/FB15k-237 [6], provided courtesy of one of the authors of the RotatE paper. There are also two .dict files that present unique index to entity and index to relation mappings which we will be using to map strings to neural net compatible integer tokens.

In the .txt files, we find lines that look like the following:

/m/0ydpd, /location/location/time_zones, /m/02hcv8

In this example, the head entity is represented by the MID /m/0ydpd (Asheville), the relation is /location/location/time_zones, and the tail is MID /m/02hcv8 (Eastern Time Zone). Translated to English, this tuple shows that Asheville, North Carolina, USA is located in the Eastern Time Zone.

A mapping between MIDs and WikiData links can be found here, for finding the real-world meaning of each MID: https://developers.google.com/freebase#freebase-wikidata-mappings

If you would like to generate your own dataset from the full knowledge base, the full Freebase data dump can also be downloaded from https://developers.google.com/freebase#freebase-rdf-dumps.

To load the data set, we can use the following two helper functions (data is tab delimited):

# Read the entities and relations dictionary files
def load_dict(file_path):
    loaded_dict = dict()
    with open(file_path, 'r') as f:
        for line in f:
            uid, val = line.strip().split('\t')
            loaded_dict[val] = int(uid)
    return loaded_dict# Read the KG triples
def load_triples(file_path, entity2id, relation2id):
    triples = list()
    with open(file_path, 'r') as f:
        for line in f:
            head, relation, tail = line.strip().split('\t')
            triples.append((entity2id[head], relation2id[relation], entity2id[tail]))
    return triples

This code can be found in the “Run Model” section of the Colab.

To help with mapping MIDs to the underlying entity name, the following function can help us retrieve an entity name from https://www.wikidata.org after using the Freebase-Wikidata mapping above to retrieve MID-to-WikiData URL mappings. When running this code, please be considerate of the site and refrain from initiating too many calls at once. (Consider adding a time.sleep(1) call in the loop.)

The code for this task can be found in the “MID to Entity Name Mapping” section of the Colab.

Negative Sampling and Data Loader

When training standard neural nets, we need both positive and negative examples to prevent introducing unwanted bias. In our case, each of the triplets in the dataset represents a positive example (i.e. they represent known facts). We therefore need to use negative sampling to prevent over-generalization of the model.

if self.data_type == "train":
    true_heads = self.true_relation_tail[(relation, tail)]
    true_tails = self.true_head_relation[(head, relation)]    # A number of negative sampling methods was tried - including
    # taking a set difference before random sampling, randomly sampling
    # each corrupted head, etc but the method implemented in the paper
    # ended up being the fastest    corrupted_heads = []
    corrupted = np.random.randint(self.num_entities, size=self.num_negative_samples*2)    while len(corrupted_heads) < self.num_negative_samples:
        mask = np.in1d(corrupted, true_heads, assume_unique=True, invert=True)
        corrupted = corrupted[mask]
        corrupted_heads.extend(corrupted)    corrupted_heads = corrupted_heads[:self.num_negative_samples]
    corrupted_heads = torch.LongTensor(corrupted_heads)
    corrupted_tails = []
    corrupted = np.random.randint(self.num_entities, size=self.num_negative_samples*2)    while len(corrupted_tails) < self.num_negative_samples:
        mask = np.in1d(corrupted, true_tails, assume_unique=True, invert=True)
        corrupted = corrupted[mask]
        corrupted_tails.extend(corrupted)
    corrupted_tails = corrupted_tails[:self.num_negative_samples]
    corrupted_tails = torch.LongTensor(corrupted_tails)    filter_bias = torch.LongTensor([0] * len(positive_sample))

This code can be found in the “Custom DataLoader” section of the Colab in the __getitem__() method of the DataGenerator class.

In this example, for each positive example, we also have a number of negative examples (128) obtained by corrupting either the head or tail of the positive example. We need many negative examples for each positive example because knowledge graphs are inherently sparse. The greater number of negative samples allows us to implicitly show our model this sparseness. (Note that when randomly corrupting positive triplets, we need to make sure that the resulting corrupted triplets isn’t actually a positive example elsewhere in the dataset.)

When evaluating our model, we compile a list of triplets for each positive example consisting of the positive example itself and as many negative examples as there are entities in the data set. The idea here is that to evaluate metrics like MRR and Hits@K, we need to determine the ranking of this true positive triplet against all other triplets with a similar head and relation but with a corrupted tail (or vice versa).

We note that it is possible for a head and relation to have multiple true tails (e.g. a 1-to-N relation “Teacher”, where Bob is a teacher of Jack, Janice, and Jolyne). This is accounted for below by adding a filter bias that pushes additional true tail embeddings down in the ranking so that they are effectively unimportant during MRR and Hits@K calculations.

else:    # We cannot empirically say that one head is better than another for a valid
    # (head, relation, tail) triplet. Ex. (Bob, friend, Joe), (Jack, friend, Joe).
    # In this case, we replace the alternate triplet with the current true head
    # and add a filter bias of -1 to push it down in the rankings so ideally prevent it
    # from showing up in our HITS@K and MRR metrics    corrupted_heads = [(0, test_head) if (test_head, relation, tail) not in self.all_triples
        else (-1, head) for test_head in range(self.num_entities)]
    corrupted_heads[head] = (0, head)
    corrupted_heads = torch.LongTensor(corrupted_heads)
    corrupted_tails = [(0, test_tail) if (head, relation, test_tail) not in self.all_triples
        else (-1, tail) for test_tail in range(self.num_entities)]
    corrupted_tails[tail] = (0, tail)
    corrupted_tails = torch.LongTensor(corrupted_tails)
    filter_bias = (corrupted_heads[:, 0], corrupted_tails[:, 0])
    corrupted_heads = corrupted_heads[:, 1]
    corrupted_tails = corrupted_tails[:, 1]return positive_sample, corrupted_heads, corrupted_tails, filter_bias

This code can also be found in the “Custom DataLoader” section of the Colab in the __getitem__() method of the DataGenerator class.

Objective

The objective when training our model is to optimize our embeddings of entities and relations to respect the geometric and spatial properties of the chosen function (TransE, RotatE, etc). Unlike typical deep neural nets where the weight matrices being trained often lack easily identifiable patterns, the optimized embeddings learned by our model preserve relationships like symmetry, inverse, composition, etc.

We will now implement the TransE and RotatE functions that define the learning objectives for the respective models.

TransE Implementation

def transE(head, relation, tail, sample_type, margin):
    dist = head + relation - tail
    score = margin - torch.linalg.norm(dist, ord=1, dim=2)
    return score

This code can be found in the “TransE” section of the Colab.

The TransE function is very straightforward. We want to minimize the L1 norm between (head entity + relation) and the tail entity. We use a margin to reduce the difficulty of the optimization problem by allowing embeddings that are “close enough” to incur minimal loss.

RotatE Implementation

def rotatE(head, relation, tail, embedding_range, sample_type, margin):
    # We used 2 x hidden_dim so that we can split them into real and imaginary
    # components here
    head_real, head_imag = torch.chunk(head, 2, dim=2)
    tail_real, tail_imag = torch.chunk(tail, 2, dim=2)    # This evenly distributes the relation between [-pi, pi]
    norm_relation = relation / (embedding_range / math.pi)    relation_real = torch.cos(norm_relation)
    relation_imag = torch.sin(norm_relation)    real_dist = (head_real * relation_real - head_imag * relation_imag) - tail_real
    imag_dist = (head_real * relation_imag + head_imag * relation_real) - tail_imag    # Each dimension represents its own rotation in imaginary space.
    # Take the Frobenius norm to compute the score and sum across all
    # dimensions
    total_dist = torch.stack([real_dist, imag_dist], dim=0)
    total_dist = torch.linalg.norm(total_dist, dim=0).sum(dim=2)    # If something is close enough, we don't want to penalize it
    margin_adjusted_dist = margin - total_dist
    return margin_adjusted_dist

This code can be found in the “RotatE” section of the Colab.

For the RotatE function, we need to make sure that our entity embeddings are initialized have twice the dimensionality of the hidden dimension, to allow for a real and imaginary component for each dimension. We start by splitting out the real and imaginary entity embeddings and applying a normalization of the rotation embedding to be between -pi and pi. We then attempt to minimize the L2 norm between (the head times the relation) and the tail. A margin value is used, similar to TransE.

The following formulas correspond to the distance calculation in the above code:

Head embedding = (a + bi)
Relation embedding = (c + di)
Tail embedding = (e + fi)
Head times relation = (ac — bd) + (bc + ad)i
Head times relation minus tail = (ac — bd — e) + (bc + ad — f)i

Loss Function

To compute our final loss, after getting the margin adjusted distances above, we apply a LogSigmoid activation function. When evaluating negative examples, we first multiply the distance by -1 before applying the activation function. This has the property of ensuring that we push apart negative sample head + relation and tail embeddings while pushing together positive sample head + relation and tail embeddings.

positive_sample_dist = model(positive_sample, 'positive') positive_score = F.logsigmoid(positive_sample_dist) positive_sample_loss = -positive_score.mean()# Here we take the negative of the distance values because
# we want the distances between (head, relation) and (tail)
# entities for corrupted samples to be far apart. In our model,
# we use "margin - distance", so a large distance would give
# a negative value that we then flip to positive. After plugging
# into the logsigmoid, that would yield a value close to 0
corrupted_head_dist = model((positive_sample, corrupted_heads), 'negative-head')
corrupted_head_score = F.logsigmoid(-corrupted_head_dist)
corrupted_head_loss = -corrupted_head_score.mean()corrupted_tail_dist = model((positive_sample, corrupted_tails), 'negative-tail')
corrupted_tail_score = F.logsigmoid(-corrupted_tail_dist)
corrupted_tail_loss = -corrupted_tail_score.mean()# In the paper, each corrupted_head and corrupted_tail is treated as a separate example.
# Here, we combine them into one, so we need to weight the positive_sample loss accordingly
# by adding it twice
loss = (positive_sample_loss + corrupted_head_loss + positive_sample_loss + corrupted_tail_loss) / 4
total_loss += loss.item()

This code can be found in the core training loop in the main() method of the Colab.

Training

Training of the model is done basically the same as any other neural net model. We load the data, initialize the optimizer (we use Adam), and train across multiple epochs. (In our case, we use 400 epochs.)

Conclusion

As we leave you, we want to mention some extensions to what we’ve discussed here. There are many more embedding schemes, such as TransR, DistMult, and ComplEx, all of which can represent 1-to-N relations!

We can also learn embeddings so that we can answer queries like “Who did Tom Holland star in a movie with in 2021?”, in a task called reasoning over knowledge graphs.

We hope this has served as a helpful introduction to knowledge graph embeddings. Know that you’ve only scratched the surface of an exciting, active field of research!

References

[1] Wang, M., Qiu, L., & Wang, X. (2021). A Survey on Knowledge Graph Embeddings for Link Prediction. Symmetry, 13, 485.

[2] Leskovec, J. (2021). Knowledge Graph Embeddings Lecture. Stanford CS224W.

[3] Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O. (2013). Translating Embeddings for Modeling Multi-relational Data. Advances in Neural Information Processing Systems.

[4] FB15k-237 Dataset. (2015). Retrieved December, 2021, from https://deepai.org/dataset/fb15k-237

[5] Sun, Z., Deng, Z. H., Nie, J. Y., & Tang, J. (2019). Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197.

[6] Sun, Z. (2019). Fb15k-237 knowledge base completion dataset. Retrieved November, 2021 from https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding

Simple Schemes for Knowledge Graph Embedding

Knowledge Graphs

Knowledge Graph Embeddings

Embedding Scheme One: TransE

What types of relations can TransE represent?

1. Inverse Relations

2. Composition Relations

3. Anti-Symmetric Relations

TransE Loss Function

Negative Sampling

Real-world TransE Embedding

Limitation of TransE — Symmetry

Embedding Scheme Two: RotatE

1. Inverse Relations

2. Composition Relations

3. Anti-Symmetric Relations

4. Symmetric Relations

Comparison of Embedding Schemes

Performance Comparison

Further Limitations

Coding Tutorial for TransE and RotatE

Google Colaboratory

Simple Schemes for Knowledge Graph Embedding

Data

Negative Sampling and Data Loader

Objective

TransE Implementation

RotatE Implementation

Loss Function

Conclusion

References

Written by Preston Carlson