5th Workshop on Semantic Deep Learning (SemDeep-5)

Artem Revenko
Semantic Tech Hotspot
10 min readSep 3, 2019

Link to workshop

This year’s SemDeep was one of the most popular workshops at IJCAI: more than 90 attendees registered. Why did the workshop attract so much attention? Are researchers interested to bring semantics into the intransparent deep networks or to employ the recently successful deep models to understand the semantics of data?

Disclaimer What follows is my personal subjective opinion. I mostly consider the works with respect to potential applicability to knowledge graphs (KG). KG is not the only possible application for SemDeep works, however, I hope the presented views to be as generic as the KGs themselves are.

IJCAI 2019 took place in the Venetian hotel in Macao. https://de.wikipedia.org/wiki/Datei:The_venetian_macao_outside_night.jpg

Word in Context Challenge

link to challenge

The workshop started with a linguistic challenge. Given a target word and two contexts for this target word, the task is to identify if the target word is used in the same sense in both contexts. See the table below for examples.

Word in Context: The task

The dataset for the challenge contains more than 7000 pairs and around 3000 unique words. The example sentences were taken from WordNet and are, therefore, manually curated. The remarkable feature of this challenge is that though word senses are very important for solving the task, the task actually avoids defining or even debating about what a word sense actually is. This feature is particularly important for applications. When one searches Google for “Apple smartphones” one does not fall into philosophical debates of what a word sense is. The non-expert users reach an accuracy of about 80% on the prepared challenge.

Though the challenge can be solved using different approaches and not necessarily deep learning, most of the participants did make use of the deep network classifiers to solve the task. This fact is not surprising given the recent achievements of deep learning in the NLU tasks, in particular on the GLUE benchmark (WiC dataset is included in SuperGLUE by the way).

What I found surprising is that the best score among the participants was achieved by a clustering-based approach (Daniel Loureiro and Alípio Mário Jorge LIAAD at SemDeep-5 Challenge: Word-in-Context (WiC)), i.e. not using a deep network classifier. One may argue that the authors clustered embeddings and the embeddings are produced by a neural network. This is true, however, the decision-making unit — the classifier — is not a neural network. Their result is below 70%, leaving a gap of more than 10% to human performance. Hence, the task is challenging for machines and the progress is still to be made!

Using hyperbolic large-margin classifiers for biological link prediction. Asan Agibetov, Georg Dorffner and Matthias Samwald

link to paper

Is “joint deformation” a consequence of “TNF alpha production”?

The considered task is link prediction, i.e. predicting new links (edges in the graph) between existing entities. Experiments are carried out in the biological domain.

… the main challenge is how to find the best representations for the links. This is the core subject of the recent research trend in learning suitable representations for knowledge graphs, largely dominated by so-called neural embeddings.

In language processing, we normally learn embeddings for tokens, sometimes also for parts of tokens or even symbols, in link predictions we embed graph nodes, i.e. entities. This direction of research has some years of history. However, recently hyperbolic embeddings demonstrated impressive results for representing hierarchies (see also how to implement Poincare embeddings). Agibetov et al. employ these hyperbolic embeddings together with hyperbolic SVM classifiers to predict new links.

… (flat) Euclidean classifiers misuse all the learned curved information that lives in hyperbolic embeddings.

Hyperbolic classifier (left) vs linear classifier (right).

Promising results are obtained for UMLS dataset with low dimensional embeddings space.

This is important for scalability and interpretability (easier to visualize 2 or 3 dimensional embeddings).

Most likely in case of some natural hierarchies in data one can expect even better results. How could we use it? Well, it would be interesting to carry out an experiment on a thesaurus (hierarchies of concepts) with some additional relations. At least we can expect the results to be nicely visualizable.

How to Use Gazetteers for Entity Recognition with Neural Models. Simone Magnolini, Valerio Piccioni, Vevake Balaraman, Marco Guerini and Bernardo Magnini

link to paper

A gazetteer is a list of domain-specific titles. Behind each title some entity is meant, however additional information about entities is not provided. Hence, gazetteer just contains a list of strings that hopefully appear in some similar contexts.

Examples of gazetteers include dishes (“beef stroganoff”, “grilled fish tacos”, “goulash”), items of furniture (“rocking chair”, “canopy bed”, “writing desk”), etc. Notice that an item in a gazetteer is not necessarily a named entity. For example, it is not necessarily capitalized, but also it does not possess some other linguistic features of named entities.

Obviously being able to automatically create a list of all dishes and keep this list up-to-date as new dishes come up would be very useful. You can definitely come up with some useful applications in your area. So, how do we do this? How do we recognize entities using gazetteers?

Structure of the neural gazetteer entity recognition (NNg). The input layer concatenates the features in a single vector.
  1. We may try to employ NER tools. These work well and there are many openly available on the web. Fine, but where do we get the training data? This is where Magnolini et al. employ gazetteers. The problem is how to find mentions of items from a gazetteer in a text. Consider a person name “Tom Cruise”. If you only take full occurrences of the given string then you miss “Mr. Cruise went to Cannes.” If you take any token then you encounter false positive in “Cruise was fantastic.” In order to mitigate this difficulty, the authors use a special neural network NNg that is pre-trained on gazetteers. The NNg classifier classifies an input sequence of tokens either as an entity of a certain gazetteer or a non-entity, with a degree of confidence. However, in contrast to NER tools, NNg is not capable of finding new entities of the given types.
  2. As already discussed, the entities in the gazetteers are not named entities. Therefore, existing NER tools might not be suitable. The authors rely on NeuroNLP2 that shows good results in NER benchmark but is also useful for finding nominal entities.

In the experimental results the authors show that the usage of NNg improves the results as compared to other strategies.

In a similar fashion, one can use a list of entities of some class from an (enterprise) KG to train an entity recognizer. Such entity recognizers would be a perfect tool to extend enterprise KG with new entities! An additional benefit of working with KGs is that in KGs we usually have more information about entities. We could try to go beyond NNg strategy and try to disambiguate the entity mentions more reliably using specialized disambiguation tools.

A Sequence Modeling Approach for Structured Data Extraction from Unstructured Text. Jayati Deshmukh, Annervaz KM and Shubhashis Sengupta

link to paper

Deshmukh et al. investigate seq2seq neural network model to produce structured output from texts. No doubt about the importance of the task: a solution would allow us to automatically extract knowledge from texts, then we could expand our KGs with this knowledge and use it for all kinds of downstream tasks.

The model takes pairs of text + table as input. The table contains structured output, the way we would like to have it from the model. Underneath the model labels each token in the text input with a pre-defined set of labels. In the picture, the set of labels include “name”, “birth_date”, “death_date”, “nationality”, “branch”, etc.

The authors are not the pioneers to consider this task, however, the results are pretty good — around 80% accuracy on Wikipedia Infobox dataset. Despite recognizing the importance of the task and having attractive results, the evaluation dataset is still to be questioned. Wikipedia abstracts and infoboxes is a specific style of text: normally written in simple English, the sentences most likely are not compound, only one subject is present. But in other types of text like news articles, research papers, contracts, the text does not necessarily have this style. And if, for example, there are two subjects — both persons — in a sentence, how do we know whose birth date is mentioned? We see this approach working nicely when the style of the text and the subject is fixed , as in Wikipedia, however, the main concern is if the approach would be generally applicable to different types of texts

It is also interesting to compare with some annotated dataset for relation extraction, like SemEval 2010 Task 8. In the SemEval dataset, the input consists of a sentence and two entities mentioned in the sentence. And the task is to find if the sentence expresses any of the 9 pre-specified relation between the given entities. We see this task formulation a bit more robust and generalizable, as it makes precise not only the relation itself but also the subject and the object of the relation.

Extending Neural Question Answering with Linguistic Input Features. Fabian Hommel, Matthias Orlikowski, Philipp Cimiano and Matthias Hartung

link to paper

The structure of the input embedding layer, enriched with additional linguistic inputs

Typically Question Answering (QA) systems take a piece of text and a question as input. The piece of text is then processed token by token or sometimes on sub-token level — either char by char or word-pieces like in BERT default tokenizer. In this work, the authors extend the input to QA by enriching the piece of text with additional linguistic features of the tokens — parts of speech (PoS), dependency labels (DL), semantic roles (SRL).

In the results of the investigation, we see that the best results are shown by a combination of all features — not surprising — however, using just PoS information gets us pretty close to the combination of features. This might be due to the fact that internally NN is able to “guess” the approximate results of dependency parsing just by having PoS information.

Results: usage of linguistic features relative to baseline

In the SQuAD competition, the difference between the best performing systems is usually some fractions of a percent, therefore an improvement of 2–3% might be seen very significant. The drawback of such an improvement is the loss of multilingualism — for each new language one not only has to find a training dataset but also find the linguistic tool that can provide the needed information.

The paper is well-written and very easy to follow, thanks to the clear structure. This paper won the best paper award!

The paper is prepared in frames of the Prêt-à-LLOD project where both Semalytix and SWC are use case partners. The project aims at preparing the linguistic tools and resources to be used in industry, to promote the usage of freely available linguistic (linked) open data.

Conclusions

The workshop produced a very nice impression. As workshop schedules are not as strict as large conference schedules, there was some time after each talk and the organizers have nicely moderated a discussion around the presented papers. The workshop papers are usually presenting ongoing research or some early discoveries, there was a chance to see promising new research directions and exchange opinions with the authors.

  1. In all 4 papers that we review in this blog post, we see the direction of using DL for the benefit of KG. Indeed, link prediction with hyperbolic embeddings → expand KG, recognition of new entities and relation extraction →expand KG, better QA → better fact extraction →expand KG. We might only say that in the Word in Context challenge another direction is explored: WordNet (also a KG) actually helps to solve the task and WordNet is used intensively in the approach that showed better results in the challenge. We are not sure if this fact — exploration of mostly one direction — is a coincidence or a general tendency. Anyway, this is a nice motivation for the researchers to more intensively explore another direction as well — use KGs to improve the DL systems.
  2. The usage of deep learning technologies to get a better understanding of the semantics of data has different applications. Many are connected to natural language understanding — including the works on Part-of-speech Tagging for Chinese Social Media Texts with Foreign Words and Learning Household Task Knowledge from WikiHow Descriptions. However, NLU is not the only application domain, consider at least link prediction by Agibetov et al.
  3. The presented works inspire for further experiments. At least the applications to (enterprise) KGs clearly promise additional functionalities and/or other potential improvements.

--

--

Artem Revenko
Semantic Tech Hotspot

PhD in applied math and CS. Interested in Semantic Web, NLP, Information Extraction, Machine Learning and friends