Fine-Grained Propaganda Detection and Classification with BERT

Getting started on SemEval 2020 Task 11: “Detection of Propaganda Techniques in News Articles”

9 min readApr 29, 2020

Introduction

Although we may not recognize them right away, propaganda is commonly found in every day news articles and columns. It is dangerous to be ignorant of the propagandized scripts, as they tend to shape information to foster predetermined agendas or ideologies, which will bias our thoughts on worldwide issues. For example, take a look at this statement:

“Take it seriously, but with a large grain of salt.” Which is just Allen’s more nuanced way of saying: “Don’t believe it.” —Viganò. Kill the Messenger, Christopher A. Ferrara

By substituting one’s statement with another similar statement, the author makes it easier to refute one’s argument. This technique is called Straw Man and there are many other propaganda techniques that may fly under our radar.

In the hopes of better understanding how propaganda is used by writers and to analyze what sort of techniques they use, researchers focusing on Natural Language Processing (NLP) have developed workshops such as NLP4IF (NLP for Internet Freedom) to develop machine learning algorithms that can detect and classify propaganda in given texts.

In this article, we will take a dive into Propaganda Analysis Project Team’s recent initiative in SemEval 2020, “Detection of Propaganda Techniques in News Articles”, and setup basic BERT systems using online Python libraries to solve the shared tasks.

Problem Definition

The proposed project from Propaganda Analysis Project Team divides the overall task into two subtasks:

Span Identification (SI) Task

This is a binary sequence classification task. Given a text file of news article, identify text fragments that contain at least one propaganda technique.

Technique Classification (TC) Task

This is a multi-class sequence classification task. Given a text fragment identified as a propagandizing text, classify the text into one of 14 propaganda categories (there are 18 given techniques in total, but due to relatively low frequency in some techniques, some techniques are merged to represent a “class” of techniques). Here are list of 14 categories:

Appeal to Authority
Appeal to Fear Prejudice
Bandwagon, Reductio ad hitlerum
Black and White Fallacy
Causal Oversimplification
Doubt
Exaggeration, Minimisation
Flag-Waving
Loaded Language
Name Calling, Labeling
Repetition
Slogans
Thought-terminating Cliches
Whataboutism, Straw Man, Red Herring

For each techniques, the organizers of this project have summarized descriptions in their previous work, under Propaganda and its Techniques section.

Data

The dataset we will use is provided from the organizers of this project. After registering, they will send you an access link to your team page, where you can download the dataset with some helper functions. The total dataset consists of mainly three parts, folders with news articles, and folders with SI labels for the articles and folders with TC labels for the articles, which then divides into three partitions: train, dev and test. For this project, we are to submit results for test article set, but in this article we will generate results for dev article set for the purpose of getting started with the project.

Now, moving technical aspects aside, let us look at an example article text file and the gold-label files associated with it:

As stated earlier, the article file is a text file that contains a single sentence or enter key (“\n”) as lines. Label files are in csv format, with first column always indicating the unique id of the article (in this case “123456”), and the last two columns indicate the starting index offset and ending index offset of a text fragment in the article. For instance, the first row of SI labels indicate that from 34th character up to 40th character in the entire article contains a propaganda technique, which matches to fragment babies. TC label file has an additional column in second column, which denotes what type of propaganda it is. Also note that for SI label file, there is one less row, because indices (607, 653) and (635, 653) overlap. All overlapping indices are merged for SI task and are recommended to be sorted in the output file.

In total, the provided dataset has 536 articles, with 371 articles for train set, 75 for dev set, and 90 for test set. The table below shows the distribution of each propaganda techniques over the article sets (since this data comes from gold label files, there is no data for test set).

Table 1: Propaganda technique counts per article set and total / merged total counts

Setup & Preprocess

Since we have looked thoroughly over the problem and the data we are going to use, let’s setup the coding environment and preprocess the data to feed into the neural network models.

Setup

We will need various deep learning libraries, so one of the best environment to start in is Google Colab. After making a new project, install Python libraries transformers and tensorboardX (Note: tensorboardX is not necessary, but provides a great logging tool). Also clone in your repository that contains dataset and helper functions.

!git clone [your_repository]
!pip install transformers
!pip install tensorboardX

Since we will lose the generated files or checkpoints when connection to the runtime is lost, it is recommended to mount your google drive to save files.

from google.colab import drive
drive.mount('/content/gdrive')

Now, import in-built and installed libraries. You may add more libraries for your own use.

import os
import glob
import codecs
import csv
import pandas
import logging
import math
import numpy
import picklefrom collections import defaultdict, Counter
from multiprocessing import Pool, cpu_count
from tqdm import tqdm, tqdm_notebook, trangeimport torch
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, TensorDataset)
from torch.utils.data.distributed import DistributedSamplerfrom tensorboardX import SummaryWriter
from transformers import (BertConfig, BertTokenizer, BertForSequenceClassification, WEIGHTS_NAME, AdamW, get_linear_schedule_with_warmup)from sklearn.metrics import f1_score

Finally, define some global variables to indicate folder / file paths, technique to label mappings, and basic configs (Note that training arguments are also initialized here. Change them manually before training if you wish, but we will use the same config for both BERT models for SI & TC task in this article).

train_articles = "datasets/train-articles"
dev_articles = "datasets/dev-articles"train_SI_labels = "datasets/train-labels-task1-span-identification"
train_TC_labels = "datasets/train-labels-task2-technique-classification"
dev_SI_labels = "gold_labels/dev-labels-task1-span-identification"
dev_TC_labels = "gold_labels/dev-labels-task2-technique-classification"
dev_TC_labels_file = "gold_labels/dev-task-TC.labels"
dev_TC_template = "datasets/dev-task-TC-template.out"techniques = "tools/data/propaganda-techniques-names-semeval2020task11.txt"
PROP_TECH_TO_LABEL = {}
LABEL_TO_PROP_TECH = {}
label = 0
with open(techniques, "r") as f:
  for technique in f:
    PROP_TECH_TO_LABEL[technique.replace("\n", "")] = int(label)
    LABEL_TO_PROP_TECH[int(label)] = technique.replace("\n", "")
    label += 1device = torch.device("cuda")
n_gpu = torch.cuda.device_count()
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("LOG")MODEL_CLASSES = {"bert": (BertConfig, BertForSequenceClassification, BertTokenizer)}
args = {"data_dir": "datasets/",
        "model_type": "bert",
        "model_name": "bert-base-uncased",
        "output_dir": [your drive storage for SI model]
        "max_seq_length": 128,
        "train_batch_size": 8,
        "eval_batch_size": 8,
        "num_train_epochs": 1,
        "weight_decay": 0,
        "learning_rate": 4e-5,
        "adam_epsilon": 1e-8,
        "warmup_ratio": 0.06,
        "warmup_steps": 0,
        "max_grad_norm": 1.0,
        "gradient_accumulation_steps": 1,
        "logging_steps": 50,
        "save_steps": 2000,
        "overwrite_output_dir": False}

Preprocess

For our implementation, we will try different preprocessing methods for input datasets. To sum up, there will be total of 3 different methods:

Method 1: For training input for SI task
Method 2: For evaluating input for SI task
Method 3: For training / evaluating input for TC task

The reason why I chose different preprocessing method for each input dataset is because the tasks operate on different level of granularity; for SI task, we need to find the span within the text which makes it character-level classification task and for TC task, we are classifying a given fragment, which is word- ~ sentence-level classification task. The entire code needed for preprocessing is embedded below (Beware, a very long code!)

Since it will also take very long explanation, let me summarize how each preprocessing works:

Method 1: For each article, read in the text and list of coupled indices. Parse the article into sentences (already done if you read in by lines) and also parse sentences into fragments by indices if propaganda fragment inside. Tokenize the fragment into word sequences and store with label 0 (non-propaganda fragment) or label 1 (propaganda fragment).
Method 2: For each article, read in the text and parse them into sentences and tokenize it. Then, generate a powerset of the sequence, but only taking the sequence subset if elements are in continuous order. For example, the sentence “I like NLP.” will be tokenized into [‘I’, ‘like’,‘NLP’, ‘.’] and I take into account: [‘I’, ‘I like’, ‘I like NLP’, ‘I like NLP.’, ‘like’, ‘like NLP’, ‘like NLP.’, ‘NLP’, ‘NLP.’, ‘.’]. I decided to prune insignificant fragments like ‘.’ and such.
Method 3: For each article, read in the text and only store text fragments indicated by each coupled indices. Also store matching labels if for training set, and random labels for evaluation set.

All the dataset are first stored in pandas.DataFrame format, and then turned into TensorDataset format that can be directly input to train() function. Actual call for preprocessing will be done in training and evaluating stage.

Photo by Caspar Camille Rubin on Unsplash

Train & Evaluate

Task SI

After coming so far, let’s not waste our time now. Like defined in config above, we will be using pretrained BERT model, specifically “bert-base-uncased” model with number of labels equal to 2. The structure of train() function is a modification of Thilina Rajapakse’s implementation of transofmers and huggingface’s transformer libary examples.

After training, we do evaluation. Due to the exponential increase in number of sequences in Preprocessing Method 2, we do preprocessing and evaluation altogether for each file and overwrite the variable to decrease total RAM usage.

Task TC

For TC task, we use the same pretrained BERT model, but with number of labels equal to 14.

For evaluation, we can preprocess all files at once for this task, so we need a separate helper function (which is very similar to classify_per_article() above!)

For both tasks, the generated output text files in csv format can be tested on your team page for this project. Simply upload and submit the file, and you will get the results window.

Results and Discussions

To discuss how well our method did, we will need to set some baseline algorithms for each task. For SI task, we will use one algorithm that randomly generating span ranges(Baseline-Random) and the other just returning the entire article for every article (Baseline-All). For TC task, we will use one algorithm just returning the most frequent label (which is Loaded Language) for all fragments (Baseline-MostFreq) and the other performing linear regression on the fragments to generate predictions (Baseline-LinReg). For Baseline-Random and Baseline-LinReg, these algorithms are the baselines used by the organizers, so there should already be output files generated for you. For Baseline-All and Baseline-MostFreq, they are not at all hard to generate, so I leave this part for little coding exercise.

After submitting my predictions, I got following results:

First of all, the main takeaway for the models is that they can identify the propaganda spans in news articles and categorize given fragments with matching propaganda techniques used.

Although for SI task our model performs slightly worse than choosing all the text as propaganda fragments, knowing how our preprocessing algorithm duplicates the fragments many times and indices are merged after predicting, it makes sense that our algorithm is doing pretty much similar thing as choosing all the texts.

For TC task, we see that our model does better than both baselines, making us think how powerful deep learning models like BERT can be, without any fine-tuning and being trained for only 1 epoch. However, seeing how the labels were very imbalanced in Table 1, it will be a challenge to improve the performance over certain level, given that our resulting output only used 3 labels from top most occurring labels. However, even with 3 labels, our model can probably achieve much better performance than 35%, which leaves us room for exploration.

Photo by Javier Allegue Barros on Unsplash

What’s Next

Although our models are not yet able to achieve results far surpassing the generic baseline methods, they can produce results and there are lots of rooms for further improving our model, with either advanced preprocessing methods, designing more robust networks, or fine-tuning. I strongly encourage you to take this as a starting point and go on with such efforts.

Thank you for reading this far and good luck on implementation!