Hallo! Hallo! KU Leuven & TU Berlin Introduce ‘RobBERT,’ a SOTA Dutch BERT

Published in

SyncedReview

3 min readJan 23, 2020

The first language of around 24 million people and a second language for nearly 5 million, Dutch is the third most widely spoken Germanic language, after English and German. A group of researchers from Belgium’s Katholieke Universiteit Leuven and the Technische Universität Berlin recently introduced a Dutch RoBERTa-based language model, RobBERT.

First introduced in 2019, Google’s BERT (Bidirectional Encoder Representations from Transformers) is a powerful and popular language representation model designed to pre-train deep bidirectional representations from unlabeled text. Studies show that BERT models trained on a single language notably outperform the multilingual version.

Unlike previous approaches that have used earlier implementations of BERT to train a Dutch-language BERT, the new research uses RoBERTa, the improved version of BERT introduced last summer by researchers from Facebook AI and University of Washington, Seattle. RobBERT was pre-trained on 6.6 billion words totaling 39 GB of text from the Dutch section of the OSCAR corpus.

Researchers evaluated RobBERT in different settings on multiple downstream tasks, comparing its performance on sentiment analysis using the Dutch Book Reviews Dataset (DBRD), and on a task specific to the Dutch language, distinguishing “die” from “dat(that)” on the Europarl utterances corpus. The results show that RobBERT outperforms existing Dutch BERT-based models such as BERTje in sentiment analysis and achieves state of the art results on the “Die/Dat” disambiguation task.

The paper identifies possible improvements and future directions for this research, such as in training similar models, changing training data format and pre-training tasks such as sentence order prediction, and applying RobBERT in additional Dutch language tasks.

The pretrained RobBERT models can be used with Hugging Face’s transformers and Facebook’s Fairseq toolkit. The RobBERT logo, incidentally, derives from the fact that the word “rob” also means “seal” in Dutch.

The paper RobBERT: a Dutch RoBERTa-based Language Model is on arXiv. The model and code are available on GitHub.

Author: Yuqing Li | Editor: Michael Sarazen

Thinking of contributing to Synced Review? Sharing My Research welcomes scholars to share their own research breakthroughs with global AI enthusiasts.

We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.

Need a comprehensive review of the past, present and future of modern AI research development? Trends of AI Technology Development Report is out!

2018 Fortune Global 500 Public Company AI Adaptivity Report is out!
Purchase a Kindle-formatted report on Amazon.
Apply for Insight Partner Program to get a complimentary full PDF report.

Hallo! Hallo! KU Leuven & TU Berlin Introduce ‘RobBERT,’ a SOTA Dutch BERT

Written by Synced