Knowledge Graph Embeddings: Simplistic and Powerful Representations

Learning powerful knowledge graph embedding representations using TransE and predicting missing triplets from the FB15k-237 dataset.

Mark Endo
Stanford CS224W GraphML Tutorials
8 min readFeb 9, 2022

--

By Mark Endo as part of the Stanford CS224W course project.

Link to Google Colab: https://colab.research.google.com/drive/12IRVpTWeXZjKGabeQjrpajgE86LAGUbE?usp=sharing

In this tutorial, we analyze the power of knowledge graph (KG) embedding representations through the task of predicting missing triples in the Freebase dataset. First, we overview knowledge graphs and discuss the task of predicting missing triplets. Second, we look at a popular dataset used to learn KG models, namely FB15k-237. Third, we implement a popular KG method for learning knowledge graph embeddings (namely TransE [1]) and analyze our results. Fourth, we explore symmetric relations in the FB15k-237 dataset, which TransE is not able to accurately represent.

Background of Knowledge Graphs

A knowledge graph contains different types of entities connected by various relationship types. From a graph perspective, entities are represented by nodes, and relationships are represented by edges. Since there are multiple relationship types, a KG is a type of heterogeneous graph. Knowledge graphs model real-world entities and relations and can be used to represent different types of networks. A triplet is formed by a relation connecting two entities. It is noted as (h, l, t), where h is the head entity, t is the tail entity, and l is the relation that connects h to t. One of the most common tasks relating to KGs is predicting missing triplets. This task is especially important because KGs are often incomplete. In this tutorial, we will be predicting missing tails.

Analyzing FB15k-237

Freebase, one of the most popular knowledge graphs, is described as “an open shared database of the world’s knowledge.” In Freebase, entities can range from actors to cities to objects to many others. For this tutorial, we will be using FB15k-237, which is a variant of Freebase 15k that excludes inverse relations shared between the test and train set. There are a total of 310,116 triplets with 14,541 entities and 237 relations. This is one example of a subgraph in Freebase.

Freebase subgraph of Family Guy. credit: https://www.researchgate.net/figure/Freebase-subgraph-of-Family-Guy_fig2_301404391

As we can see, there are many different types of entities and relations just in this small subgraph.

Now, we are going to process the raw FB15k-237 text files into a PyTorch Geometric InMemoryDatasetand visualize the graph. We use the inMemoryDataset class because it loads all the data at once, and we can easily and quickly access it later. In order to inherit from the InMemoryDataset class, we need to implement the following methods: raw_file_names(), process_file_names, download(), and process(). More information about the inMemoryDataset class can be found here.

Importantly, in the process() method, we load the dataset into train, validation, and test splits. In each split, the graph is represented with an edge_index and edge_type. Later on, we can access the head entities with edge_index[0,:] and the tail entities with edge_index[1,:].

Once the dataset has been processed, we can visualize its structure. Since it has a large number of entities and relation triples, it would be hard to visualize using tools such as NetworkX and Matplotlib. Therefore, we explore the data into a format that can be used by large graph visualization tools. Even with these tools, the graph can still be quite large, so we can sample the graph before exporting it.

Once we have our graph in this format, we can load it into graph visualization software like Gephi.

Visualization FB15–237 subgraph in Gephi. The most common relationship types are shown at the top left.

We see that this sampled subgraph has clusters of nodes, and some nodes seem to represent centers of clusters. This is because some entities have relations with many other entities. One can imagine that an entity such as Kevin Bacon would be a center of a cluster. We also see that clusters are connected by intermediate entities that have various relation types connecting them. This seems like a natural feature to appear in knowledge graphs. From here, we can now discuss embedding these knowledge graphs so that we can predict missing triplets.

Training TransE

In this section, we discuss TransE as a method for embedding knowledge graphs. As stated earlier, connections in KGs are represented as triplets (h, l, t). In TransE, we model both the entities and relations in an embeddings space and try to adjust the embeddings such that h + lt (where bold letters represent embeddings). We can visualize the mapping of entities and relations onto the vector space as follows:

TransE knowledge graph embeddings in 2D space, such that h + lt

In this example, the weights are trained such that the sum of the head entity and relationship type vector approximates the tail vector. We can use a margin loss with negative sampling to train.

In the loss function, 𝛾 represents the margin and is set to 1. We can calculate distances using L2 norm. Here, (h’, l, t’) represents a corrupted triple by either replacing the head or tail with a random entity.

Examples of possible corruptions of head or tail. The relation type stays the same, and only one entity is replaced.

In order to create negative samples, we will corrupt the edge index by randomly replacing either head or tail entities with random entities. We can then access the negative samples with the corrupted edge_index and the edge_type.

At the start of each epoch, we normalize the entity vectors to have unit length so that entities are not pushed to have large embeddings, which would result in the model having difficulty in distinguishing positive and negative triplets. The body of the train code follows a fairly standard PyTorch implementation. One thing to note is the training samples should be shuffled between epochs. In order to do this, we can create a random ordering of triplets for each epoch and shuffle the edge_index and edge_type accordingly. The full code is available in the corresponding Colab.

Evaluation

Now that we have discussed training TransE to embed entities and relations, we continue onto the evaluation task of predicting missing tails. For each triplet in the validation and test set, the tail is removed and replaced by each entity in the graph. We then calculate the distance between h + l and each entity. Lastly, the entities are ranked in ascending order of distance.

Visualization of tail prediction. In this case, the rank is 2.

Here is a snippet of code for the evaluation protocol. Notice how the computation is quite expensive since for each head we have to calculate distances for every entity.

In order to evaluate our model’s performance on this task, we will use three popular metrics. The first is called Hits @ 10, and it measures the fraction of predictions where the “true” tail entity is in the top 10 ranks. It is measured as

The advantage of this metric is that it is highly interpretable. However, one drawback of this metric is that it does not distinguish between ranks that are larger than 10, ie ranks of 10 and 10+d where d ≫ 1 have the same contribution to the final score.

The next evaluation metric used is Mean Rank. It measures the average of all individual ranks and is noted as

This metric now distinguishes between ranks above ten, but it is highly dependent on the number of total entities. For example, a Mean Rank score of 5 would be very good if there are 15,000 entities, but it would be as good as random if there were 10 entities. Lastly, we evaluate on Mean Reciprocal Rank (or MRR).

This metric is nice because it is bounded between the range of 0 and 1, although it has been shown to be a flawed metric [2]. More information about these metrics can be found here [3].

Note: Even with accurate metrics, the evaluation procedure can still be flawed since some corrupted triplets can be valid ones and may be ranked above the test triplet. In order to alleviate this issue, you can remove corrupted triplets that are present in the training, validation, or test set.

Here are the implementations of each metric in PyTorch.

Results

When training for 50 epochs, we find that the test scores from the best model (using model selection from the validation set) are as follows:

With only training for 10 epochs (if you are working with limited resources), here are the results.

Visualizing Embeddings

Since TransE uses shallow embeddings to represent entities and relations, it is straightforward to visualize the training process. We can use PCA and reduce the dimensionality of our embeddings to 2D.

TransE relational embeddings visualization using PCA.

Below is the code for visualizing relational embeddings during training. Entities can also be visualized in a similar manner.

Weaknesses/Limitations of TransE

Although TransE is a powerful method for representing knowledge graph embeddings, it is not able to model all types of relationships. In particular, it cannot model symmetric relations and 1-to-N relations. In the final section of this tutorial, we will explore the FB15-237 dataset to find symmetric relations. To do so, we sample triplets from the train dataset and try to find symmetric relations present in the graph.

In general, we find that most of the triplets (more than 4/5ths) do not have a symmetric pair in the dataset. This suggests that TransE is able to accurately represent most relationships in the dataset and learn useful representations.

References

[1] Bordes, A.; Weston, J.; Collobert, R.; Bengio, Y.; et al. 2011. Learning structured embeddings of knowledge bases. In Proceedings of AAAI, 301–306.

[2] Fuhr, N.: Some common mistakes in IR evaluation, and how they can be avoided. SIGIR Forum 51(3), 32–41 (2017)

[3] Berrendorf, Max: On the Ambiguity of Rank-Based Evaluation of Entity Alignment or Link Prediction Methods

--

--