How I Used RoBERTa to Learn Common Sense Knowledge for Spatial Reasoning

Published in

Ixor

6 min readSep 13, 2021

Humans have a natural profound understanding about the physical world we are born in. We know how objects move, and where they are located in relation to us. This kind of common sense knowledge is learned implicitly and is difficult to capture by machine systems.

With the improving quality of language models, the common sense aspect stays a challenging and fascinating component. Language models like BERT have enabled breakthroughs in many applications, such as dialogue systems and generative language modelling. These models are pre-trained on huge datasets, in an unsupervised manner. Recently, there has been a switch in literature to see these models as knowledge bases instead of simply generating syntactically correct language. After all, they encounter a lot of information during the pre-training. Research shows that the pre-training is sufficient for the model to learn factual and even some common sense knowledge [1, 2]. But do they also capture common sense knowledge about the physical structure of our world?

This is the subject of my thesis: exploring whether neural language models are capable of performing qualitative spatial reasoning. Spatial reasoning itself is still a very broad topic, and so it is restricted to a specific domain: relative positional reasoning (left, right, above, below, in front of and behind). If the model is told that the cat is to the left of the tree, and the tree is to the left of the cow; can it deduce that the cat is also to the left of the cow?

Constructing a Dataset of Relative Positions

The data we use to train and test is textual and consists of two parts:

The context explains the relative locations of a number of objects.
The question contains two objects that appeared in the context, with a masked positional predicate. This predicate can only be determined by using spatial reasoning on the information given in the context.

To make the data as realistic as possible, it will contain positional relations in three dimensions, and will be based on real-life scenario’s. The two existing datasets on textual relative positions did not satisfy these conditions, so we created a new dataset.

We extracted our scenario’s from the COCO image dataset [3]. This ensures that the data is both based on real-life scenario’s and has relative positions in three dimensions. However, it is not straightforward to extract the 3D spatial information from the images as they are only 2D representations of a 3D world. To do this, we used the FCRN depth estimator [4]. The depth estimator takes as input the image, and outputs a heatmap of the depth coordinations. An example of such a heatmap can be seen below.

An example output of the FCRN depth estimator [4] on an image of the COCO dataset

With the images and their respective depth estimations, we can extract our scenario’s. The COCO dataset provides segmentations for the objects in an image. In the first step, these segmentations and the depth estimations are used to extract the spatial relations between each pair of objects. In the next step, a number of these relations are chosen to create the context. Based on the chosen relations, the target relation is determined. This relation can only be reasoned from the relations mentioned in the context using either transitive or reverse reasoning. And lastly, the context and question are formulated in natural language.

We use the language model RoBERTa [5], which is pre-trained using masked-language modelling, so the question is formulated as a masked sentence. This means that the positional relation is replaced with a mask. The model will then predict the most probable word to fill in the mask. Masked sentences are a promising probing approach in zero-shot setting, as the research of Schick and Schtze shows [6, 7]. When using masked sentences, the specific formulation can influence the performance of the model. So to get a more robust evaluation, multiple patterns of masked sentences are used, each with a different formulation.

Results

The goal of the research is two-fold: evaluate whether the model captures such common sense knowledge during the pre-training, and evaluate whether the model is able to learn it during finetuning. The first situation is also known as zero-shot learning. We also perform some ablation studies on the number of objects mentioned in the context, on the complexity of the sentences and on providing some additional information in the context.

Zero-shot Evaluation

The figure below shows the confusion matrix of the results in a zero-shot setting. In this masked sentence pattern, the model had to fill in the mask with either ‘yes’ or ‘no’. It is clear that the model mostly picks the same answer, regardless of the given information in the context. It cannot outperform a random guessing model. This result is observed in all masked sentence patterns. When changing factors such as the number of objects or sentence complexity, the behaviour of the model does not change. This suggests that the model does not learn spatial common sense information during pre-training.

Confusion matrix of one of the masked sentence patterns in a zero-shot setting: the model picks the same word to fill in the blank 66.44% of the time.

Fine-tuned Evaluation

After training the model on a training set, the performance of the model became nearly perfect on the test set. This is further confirmed when we look at how many patterns the model gets right for each scenario. As you can see in the figure above, the model gets all patterns correct for around 80% of the scenario’s, which is quite good. When we add extra information (the caption of the original image) to the context information, the performance improves even more. Since the data only contains six relative locations, the model is most likely overfitted on this limited number of patterns. Future research could study whether the reasoning ability of the finetuned model improves on unknown patterns as well.

Visualisation of the number of patterns the model is able to predict correctly for each scenario. In nearly 80% of the scenario’s, the model is able to predict all the patterns correctly. This shows the robustness of the finetuned model.

Conclusion

Although research shows that language models can be used as knowledge bases through the information they learn during pre-training, this research shows that they do not learn the common sense needed to reason with relative locations. This knowledge can be learned, as the finetuned experiments clearly show. Future research can extend the number of relative positions, and explore different kinds of common sense knowledge.

References

[1] F. Petroni, T. Rocktschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel. Language models as knowledge bases?, 2019
[2] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2018.
[3] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollr. Microsoft coco: Common objects in context, 2015.
[4] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 239–248. IEEE, 2016.
[5] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
[6] T. Schick and H. Schtze. Exploiting cloze questions for few shot text classification and natural language inference, 2021.
[7] T. Schick and H. Schtze. It’s not just size that matters: Small language models are also few-shot learners, 2021.

At IxorThink, the machine learning practice of Ixor, we are constantly trying to improve our methods to create state-of-the-art solutions. As a software company, we can provide stable products from proof-of-concept to deployment. Feel free to contact us for more information.