Week 2: initial results coming…

Marco Sobrevilla Cabezudo
2 min readJun 15, 2020

--

Previously in my GSoC trip:

I explored RDF2vec, adapt to my context and start generating node embeddings using RandomWalker and Weisfeiler Lehman walker.

What did I do?

I started with random walks and Weisfeiler Lehman walker in RDF2Vec. Random Walks worked very quickly (it took 1 day to run and generate all embeddings). In the case of Weisfeiler Lehman algorithm, it is taking several days. I tried to explore the same hyperparameters like the original paper (to reduce the time) but it still continues running.

The Random walks-based node embeddings were built considering all nodes in original and and modified triples in the WebNLG and this can be found in this link. I created a script to generate embeddings that uses the RDF2Vec code. This code needs parameters like the knowledge graph files, the vocabulary that you need to generate the embeddings, the number of walks per instance and the depth of the walk, among other parameters.

In particular, I found two problems:
- In WebNLG dataset we used the modified triple instead of the original triple. For example (this is a piece of the xml):
<originaltripleset>
<otriple>Aarhus_Airport | elevation | “25.0”^^xsd:double</otriple>
</originaltripleset>
<modifiedtripleset>
<mtriple>Aarhus_Airport | elevationAboveTheSeaLevel_(in_metres) | 25.0</mtriple>

This way, some nodes (predicates) like “elevationAboveTheSeaLevel_(in_metres)” are not part of the Knowledge graph. Besides, there is another relation called “elevationAboveTheSeaLevel_(in_feets)”. In principle, these relations are the same (could have the same embedding) but in terms of verbalization, they have some difference (one is measured in metres and the other one in feets).
- There are several literals that does not belong to the Knowledge graph. For example, “25.0” in the previous triple is a literal and it could be “32.0” or other number. This also happens with other kind of literals. We are exploring other kinds of generating embeddings that considers literals like this approach (For example, [1]).

In relation to the code base, I modified it to accept pretrained embeddings and to fixed (or not) source and target embeddings. The hyperparameters can be accessed here. I did a comparison between training the embeddings from scratch and using the random walk-based pretrained embeddings and fixing them during training. The initial results are described here:

Training Embeddings from Scratch
================================

Bleu_1: 0.623302
Bleu_2: 0.493453
Bleu_3: 0.396279
Bleu_4: 0.321260
METEOR: 0.352235

Using pretrained embeddings
============================

Bleu_1: 0.571150
Bleu_2: 0.447081
Bleu_3: 0.356029
Bleu_4: 0.286652
METEOR: 0.317034

As it can be seen, it seems to be more convenient not to use embeddings (by now). Some particular reasons to get this results are the quality of the embeddins. In a quick count, approximately 800–900 nodes were not found in the knowledge graph, which represents the 22% of the nodes in the WebNLG dataset.

It is worth noting that I am using the tool provided by Sharma to evaluate the original model and the using of pre-trained node embeddings.

What’s next?

More experiments and try to incorporate literals and modified relations! Besides, do some updates in the codebase to handle seeds and other small features.

[1] Kristiadi, Agustinus, Mohammad Asif Khan, Denis Lukovnikov, Jens Lehmann and Asja Fischer. “Incorporating Literals into Knowledge Graph Embeddings.” SEMWEB (2019).

--

--