Detecting Fake News — with a BERT Model

9 min readOct 2, 2022

Fine-tuning Google’s BERT-base model with Transfer Learning

In the last couple of decades, the emergence of social media (Facebook, Instagram, Twitter, etc.) & messaging platforms (WhatsApp, Telegram, etc.) have brought us all closer than ever. Just imagine, how easy it is to voice an opinion today, on topics that matter to us.

But on the contrary, this very ease of information dissemination has also made social media platforms tools for spreading falsehood, popularly called as fake news. It hardly takes a few hours for propaganda drivers to put something online and get it circulated, leading to conflicts and defamations. So, it’s quite pertinent to build sophisticated fake news detection algorithms, that could flag online content spreading misinformation, with adequate reliability.

Well, in this tutorial we shall build a powerful Fake News Detection Model, using the pre-trained BERT, with the help of Transfer Learning.

Brief on this learning series..

Well, this article is actually the third & last instalment of my three part learning series, where we are:

Understanding intuition behind Transfer Learning,
Deep-diving into Google’s BERT Model — which has achieved superhuman performance in its language understanding, and finally
Training (actually, fine-tuning) a Fake News Detection Model, by Transferring Learning from pre-trained BERT Model

Now, let’s continue further on this third part, where we build a sophisticated fake news detection model.

Watch the video tutorial instead

If you are more of a video person, go ahead and watch it on YouTube, instead. Make sure to subscribe to my channel to get access to all of my latest content.

Getting started..

This is the snapshot of the dataset we are using. Here’s the source for this dataset.

We have two separate .csv files, one having the real news, called true.csv and another one, having the fake news, called fake.csv. Both files have the exact same data form.

We have the title of the news, the entire news article text, the subject, which is basically the category of news and the date on which it was published. For our use case, we shall merge these files into a single large dataset, and add a new column ‘Label’, that will have ‘true’ mentioned against all observations from the true.csv and ‘fake’ mentioned against the observations from fake.csv

Plan of action

Moving on, this is our step-by-step plan on building this project..

First up we load our dataset, i.e., the true.csv and fake.csv files, merge them into one and generate true / fake labels
Then, we obtain the pre-trained BERT model to use as the base of our Fake News Detection Model using HuggingFace’s Transformers library. BERT has insane language comprehension capability. So, it shall make our model better understand news context and hence make intelligent predictions on news being fake or not
Then, we define our Base Model and the overall architecture. We shall be using PyTorch for defining, training, and evaluating our neural network models
After that, we freeze the weights on the starting layers from BERT. If we don’t do this, we lose all of the previous learning.
Then we create new Trainable Layers. Generally, feature extraction layers are the only knowledge we reuse from the base model. To predict the model’s specialized tasks, we must add additional layers on top of them. Additionally, we define a new output layer, as the final output of the pre-trained model will almost certainly differ from the output we want for our model, which is binary 0 and 1
As the last step we fine-tune our model. I’ll talk more about fine-tuning more in the next slide..
And once we are done, we move on to make predictions using our Fake News Detection Model on unseen data

Model training approaches

BERT is a big neural network architecture, with a huge number of parameters, that can range from 100 million to over 300 million. And, training a BERT model from scratch on a small dataset would result in overfitting. So, it is better to use a pre-trained BERT model that was trained on a huge dataset, as a starting point. We can then further train the model on our relatively smaller dataset and this process is known as model fine-tuning. To do this, there are these three approaches:

First one, is to Train the entire architecture, where we can further train the entire pre-trained model on our dataset and feed the output to a softmax layer. In this case, the error is back-propagated through the entire architecture and the pre-trained weights of the model are updated based on the new dataset.
Second one, is to Train some layers while freezing others, which is like partially training a pre-trained model. We keep the weights of initial layers of the model frozen while we retrain only the higher layers. We may do trial-&-error to understand how many layers to be frozen and how many to be trained.
And the third one, is to Freeze the entire architecture, where we basically freeze all the layers of the pre-trained model and attach a few neural network layers of our own and train this new model. Note that the weights of only the attached layers will be updated during model training.

In this tutorial, we will use the third approach. We will freeze all the layers of BERT during fine-tuning and append a dense layer and a softmax layer to the architecture.

Model building

Now, let’s get started with our Fake New Detection Model building using Python. This is our project folder on Google Drive, having all the project related files in one place. I’ll share a link to this in the description part below. Here, b2_FakeNewsDetection is our Jupyter notebook. Let’s fire it up, to do a quick code walkthrough.

By the way, to proceed with this tutorial, a Jupyter Notebook environment with a GPU is recommended. The same can be accessed through Google Colaboratory which provides a cloud-based Jupyter Notebook environment with a free GPU. For this tutorial, we shall be working on Colab. Once you are on Colab, activate the GPU runtime by clicking on Runtime -> Change runtime type -> Select GPU.

Alright, now let’s get coding. As first step, let’s set up our working environment.

Here, we install Huggingface’s transformers library, which allows us to import a wide range of transformer-based pre-trained models. Additionally, we are installing pycaret. We also set up our working directory.

Next up, let’s load the dataset.

Here, first up we load true and fake csv files as pandas dataframe. Then, we create a column ‘Target’, where we put the labels as True / Fake. Finally, we merge the two dataframes into one data, by random mixing.

Next up, the target column has string values, which a computer won’t understand. So, we need to transform them into numeric form. To do this, we use Pandas get_dummies to create a new column called label, where we put all Fake labels as 1 and True as 0. Towards the end, to check if our data is balanced across the two labels, we may plot a pie chart. As you would see, our data is fairly well balanced.

Next up, we split up our data into training validation and test set, in 70:15:15 ratio.

Now we come to the BERT fine-tuning stage, where we shall perform transfer learning.

This is what we are doing here:

First up, we import BERT-base model that has 110 million parameters. Along with it, we also import the BERT Tokenizer

There is an even bigger BERT model called BERT-large that has 345 million parameters. For our use-case, however, BERT-base would do just fine.

Then as next step, we preprocess our input data. We shall use the title of the news article to train our model.

For sure, we could also use the news article text, but handling that much data would require more compute resources than what Colab has on offer and will take a long time. So, we limit our scope here to just the title of the news article. And trust me, even with this, we shall get an accuracy score of something around 90%.

Now, let’s figure out how to standardize the word length of our news titles, as it would vary from one article title to another. For this, we plot a word count histogram, to understand what the typical word length is like. Here, you would see that the majority of titles have word length under 15.

So, we shall perform padding on all our titles to limit them to 15 words in the tokenization stage.

With this understanding, now let’s go ahead to tokenize our sequences, that is titles in our training, test and validation sets.

We also convert the integer sequences to tensors. And finally, we define data loaders for both train and validation set. These data loaders will pass batches of train data and validation data as input to the model during the training phase.

Moving on, we freeze pre-trained model weights. If you can recall, earlier I mentioned in this tutorial, that we would freeze all the layers of the model before fine-tuning it. So, let’s do it now. This will prevent updating of model weights during fine-tuning.

If you wish to fine-tune even the pre-trained weights of the BERT model then you may not execute this code.

Moving on, we define our model architecture.

We are using PyTorch for defining, training, & evaluating our deep learning model. Post our BERT network, we are adding dense layers 1 & 2 followed by softmax activation. Then, we define our hyperparameters; we are using AdamW as our optimizer.

Then we define our loss function. And lastly, we are keeping number of epochs to 2. With Colab’s free GPU, one epoch might take upto 20 mins. So, I’m taking this low numbers to not keep waiting on forever. Haha!

So, just to summarise:

we have defined the model architecture,
we have specified the optimizer, and the loss function,
and our data loaders are also ready

Now, we need to define functions to train (or fine-tune) and evaluate our fake news detection model. Let’s do it:

And finally, now we can start fine-tuning our BERT Model to learn fake news detection:

Now let’s build a classification report on the test set using our fake news model:

As you would see, we are getting a strong 88% accuracy.

Both precision & recall for class 1 are quite high which means that the model predicts this class pretty well. If you look at the recall for class 1, it is 0.85 which means that the model was able to correctly classify 85% of the fake news as fake. Precision is 0.92, which means that 92% of the fake news classifications by the model, are actually fake news.

Let’s also run predictions on these sample news titles. First two are fake and the next two are real.

Quite rightly, our model classifies all four of these reviews correctly.

Guys, congratulations to you for making it to this point. Do give yourself a pat on the back for completing this Transfer Learning Fake News Detection Project all by yourself. ❤️❤️

Conclusion

To summarize, in this tutorial we fine-tuned a pre-trained BERT model to perform text classification on a small dataset. I urge you to fine-tune BERT on a different dataset and see how it performs. For example, you may do a sentiment classification or a spam detection model. NLP use cases are endless, really.

You can even perform multiclass or multi-label classification with the help of BERT. In addition to that, you can even train the entire BERT architecture as well if you have a bigger dataset.

In case you have any doubts or got stuck somewhere, leave a comment below, and I’ll help you out.

Let’s talk Machine Learning 🤗🤗

Guys, if you need further guidance on building a career in data science or any help related to this vast domain, you may go to my website www.skillcate.com and set up a free 1:1 mentoring session with me, by filling out this small form.