BERTweet: A Visual Summary For Mortals

Published in

ML and Automation

6 min readJun 9, 2020

A BERT Model For Understanding Tweets

This is first in a series of research paper related article which I am posting. The aim is to filter down the related research paper for mere mortals like you and me.

Today I am going to talk about a recent research paper which I read. This particular paper is titles as “BERTweet: A pre-trained language model for English Tweets”.

Meet the BERT

BERT is all the rage nowadays and for good reasons. Comb through your twitter feed and you will get an idea about what I am talking about.

Currently there are lots of libraries which provide easier implementation of BERT and several models which are adapted from BERT.

This is yet another research paper about yet another BERT model but, this time the good old BERT is trained on tweets instead of English language corpus like Wikipedia.

But before going any further let me try to give a layman view of what BERT is.

A Quickey about BERT

Bidirectional Encoder Representations from Transformers or BERT for short is a language model based on neural networks known as “transformers” (not the optimus prime guy). BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google. For more details about BERT refer to the links at the end of this article.

In this paper the researchers have trained the BERT model on English tweets. Why is this special?

Well! regular English language sentences like those in Wikipedia look like this →

These are regular english words combined to form sentences. Such sentences don’t have anything funny in terms of grammar.

Now look at a tweet →

A tweet has an informal way of conveying knowledge. Tweets are laden with images, hashtags, @ mentions . Oh! and most importantly they are limited to a few characters only. 280 characters to be precise.

This might lead to a challenge in applying existing language models pre-trained on large-scale conventional text corpora with formal grammar and regular vocabulary to handle text analytic tasks on Tweet data.

To address this issue the researchers pre-trained a BERT model on English Tweets using a 80GB corpus of 850M English Tweets.

Hello BERT-Tweet:Code Name BERTweet

The researchers of this paper took the BERT-base model and trained it on English language tweets using the same training techniques which was prescribed by the Facebook research group in their research paper describing the RoBERTa training principal.

This paper doesn’t detail the architecture of the RoBERTa model as it can be referred to in the original paper laid down by Facebook AI. Here is short explanation of the RoBERTa approach by Edward Ma.

How Do They Pre-train BERT On Tweet corpus?

One thing which is of interest is that the training data which they have created is a combination of two corpora.

The first corpus consists of tweets from January’2012 till August’2019
The Second corpus consists of the tweets related to COVID19 starting from January’2020 till March’2020.

The interesting thing here is that the pre-trained language model which is the result of this research paper, contains the knowledge of the pre-covid19 world as well as the world during the covid19 pandemic.

This opens up windows to a variety of application using this model. For example if this model performs as well a sit is claimed then Identifying COVID19 related tweets from general tweets would prove to be easier.

Personally I haven’t tried this but I will give it a shot.

The BERTweet model is based on BERT-Base and thus has the same architecture.

The above is an illustration of the comparison between the BERT-base and the BERT-Large model. This illustration and a nice explanation about BERT can be found in this article by Jay Alammar.

Now coming to the training part. They trained this model as per the RoBERTa training procedure.

What is the RoBERTa training procedure? Well! that deserves a separate post of itself but here is the short story of it’s origin.

One fine morning the brilliant people at Facebook-AI while tinkering around with the original BERT model found that BERTy was somewhat malnourished(read under-trained). So, they thought what if? we could train it with much better hyperparameters and with much larger supply of nutrition(read larger batches). This is what they did.

They tweaked around with the original hyperparameters and trained BERT with larger mini-batches and learning rates and thus transforming BERT into a higher level athlete code-named — RoBERTa.

If you want to read about a less exaggerated version of the RoBERTa origin story then read it here.

Judgement Day

As part of the experiment setup the BERTweet language model was fine-tuned on two tasks →

POS tagging
Text classification

The scores obtained from these downstream tasks shows that BERTtweet outperforms RoBERTa-base and XLM-R-base on all experimental datasets.

However, when compared to RoBERTa-large and XLM-R-large. These two models perform better in POS tagging as compared to BERTtweet. This was due to the larger size of these two models.

Interestingly BERTtweet performs better in Text classification than RoBERTa-large and XLM-R-large.

The Report Card

Finally I will conclude this post with some score cards from the original paper.

Keep scrolling for links to the original papers and codes.

A Few Announcements

The first announcement is that I have launched my course Deep Learning: Code First Introduction Using Pytorch on Udemy.

It’s a short and simple course.
The course is just 47mins long so that it is suitable for those who are short on time but still want to have a quick introduction to the field of deep learning.
The course is based on a code first approach and is created with coders in mind.

So, use this link to head over to udemy and buy this course. Currently there is a 95% off on my course. Buying this course will help me to create more free contents for you and will motivate me to stay on my mission to democratize Machine learning.

The second announcement is that I have also launched my podcast SimpleAI .

I created the podcast with the same aim as all my work. To democratize Machine Learning.
All my content in the podcast is geared towards making the different concepts of machine learning as simple as possible.
Like all my blogs and tutorials this podcast is also designed to be short, under 12 mins long.

Search for SimpleAI on Anchorfm, Google podcast ,Spotify or your favourite podcast player and subscribe to my podcast to get notified immediately when I post new episodes. Do comment and share my podcast as it would make it easier for others to discover it.

Here are the some of the links to my podcast →