Hallo! Hallo! KU Leuven & TU Berlin Introduce ‘RobBERT,’ a SOTA Dutch BERT

Synced
Synced
Jan 23 · 3 min read
Image for post
Image for post

The first language of around 24 million people and a second language for nearly 5 million, Dutch is the third most widely spoken Germanic language, after English and German. A group of researchers from Belgium’s Katholieke Universiteit Leuven and the Technische Universität Berlin recently introduced a Dutch RoBERTa-based language model, RobBERT.

First introduced in 2019, Google’s BERT (Bidirectional Encoder Representations from Transformers) is a powerful and popular language representation model designed to pre-train deep bidirectional representations from unlabeled text. Studies show that BERT models trained on a single language notably outperform the multilingual version.

Unlike previous approaches that have used earlier implementations of BERT to train a Dutch-language BERT, the new research uses RoBERTa, the improved version of BERT introduced last summer by researchers from Facebook AI and University of Washington, Seattle. RobBERT was pre-trained on 6.6 billion words totaling 39 GB of text from the Dutch section of the OSCAR corpus.

Image for post
Image for post

Researchers evaluated RobBERT in different settings on multiple downstream tasks, comparing its performance on sentiment analysis using the Dutch Book Reviews Dataset (DBRD), and on a task specific to the Dutch language, distinguishing “die” from “dat(that)” on the Europarl utterances corpus. The results show that RobBERT outperforms existing Dutch BERT-based models such as BERTje in sentiment analysis and achieves state of the art results on the “Die/Dat” disambiguation task.

The paper identifies possible improvements and future directions for this research, such as in training similar models, changing training data format and pre-training tasks such as sentence order prediction, and applying RobBERT in additional Dutch language tasks.

The pretrained RobBERT models can be used with Hugging Face’s transformers and Facebook’s Fairseq toolkit. The RobBERT logo, incidentally, derives from the fact that the word “rob” also means “seal” in Dutch.

The paper RobBERT: a Dutch RoBERTa-based Language Model is on arXiv. The model and code are available on GitHub.

Author: Yuqing Li | Editor: Michael Sarazen

Thinking of contributing to Synced Review? Sharing My Research welcomes scholars to share their own research breakthroughs with global AI enthusiasts.

Image for post
Image for post

We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.

Image for post
Image for post

Need a comprehensive review of the past, present and future of modern AI research development? Trends of AI Technology Development Report is out!

2018 Fortune Global 500 Public Company AI Adaptivity Report is out!
Purchase a Kindle-formatted report on Amazon.
Apply for Insight Partner Program to get a complimentary full PDF report.

Image for post
Image for post

SyncedReview

We produce professional, authoritative, and…

Synced

Written by

Synced

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global

SyncedReview

We produce professional, authoritative, and thought-provoking content relating to artificial intelligence, machine intelligence, emerging technologies and industrial insights.

Synced

Written by

Synced

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global

SyncedReview

We produce professional, authoritative, and thought-provoking content relating to artificial intelligence, machine intelligence, emerging technologies and industrial insights.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store