How Texty detects and makes sense of manipulative news

Here we will explain the methodology beyond three Texty’s projects about media: “Hot Disinfo from Russia” (2019–2020), “FakeCrunch” (“Фейкогриз”, 2019), and “We’ve got bad news” (2018). All of them rely on ULMFiT — deep learning classifier architecture made by fast.ai, which we trained to detect manipulative news in Russian and partly in Ukrainian. Combination of “Hot Disinfo from Russia” and “FakeCrunch” has won Sigma Awards 2020, an international data journalism competition, as a “best news application”. This article covers data collection, preprocessing, labeling, training a classifier, as well as nitty-gritty details of each project.

texty.org.ua

10 min readMar 3, 2020

Inspiration

Fakes, bots, trolls became buzzwords in Ukraine long before it happened in the US. The reason is the Russian invasion in Crimea and Donbas, 2014. It is supported by mass disinfo campaigns, ranging from primitive fakes about Ukrainian “nationalists eating babies” to subtle claims by “patriots” that authorities are profiteering on bloodshed. The first time we encountered it in 2016 when we prepared a visualization of connections between groups and people related to one Russian troll on Facebook. It became clear that a great share of trolls’ content is links to other websites. And we did not have a clue about obscure sites they were referring to. Neither Ukrainian mainstream media nor some Russia Today.

As data journalists, we went down the rabbit hole — collected suspicious websites and tried to make sense of what they write about by becoming affectionate readers. Almost immediately we understood that human beings can’t read it all and, most important, can’t find any logic or patterns in thousands of news.

Data collection and preprocessing

The data behind our quantitative disinfo projects comes from around 400 Ukrainian and Russian websites. Each hour our script collects RSS feeds from these sites. We use scrapy for fast, error-prone asynchronous requests and feedparser to process the feeds. We receive links, publication time and title from feeds. All the data is stored in the PostgreSQL database, this solution proved to be more convenient to us than ElasticSearch or some corpora formats.

Another scrapy parser collects full texts from links obtained from RSS. Readability (we took the Python solution) has saved us from parsing each site’s html. The script removes redundant elements from the page, leaving us the headline and text of the article — no menus, ads, footers, comment sections. What is left is to remove images and embeds with BeautifulSoup, write some rules in case of regular readability failures on some site’s html, and write full texts to the database.

For each full text we detect language (with langid), tokenize it (split word units and punctuation, we plan to change it for SentencePiece) with beautiful Ukrainian library tokenize_uk which works for Russian as well, lemmatize (get normal forms of words using pymorphy2, for topic modeling only) and convert unlemmatized tokens to ids for ULMFiT classifier, ids are just numbers of words in the vocabulary.

As a result, we have more than 90 000 pieces of news per week. Some of them are in Ukrainian, some in Russian, a lot are irrelevant for us: sports, celebrities, weather, etc. We’ve built another ULMFiT classifier to detect irrelevant pieces. It cuts off all those unimportant (for us) news. The rest goes to classification for manipulativeness.

AI classification

We developed the first version of the manipulative news classifier in 2018 for the “We’ve Got Bad News” project. Later we rewrote code for new versions of fast.ai and fed it new labels, but the core workings stayed the same. Detailed methodology of classification is here (we have plans to update it in case of big changes).

We have developed an AI to find “bad news” because it is impossible to write down all the rules that help us distinguish manipulative news from normal ones. Machine learning is just the right solution if you can tell whether something is fake or not (as well as cat or dog, rock or pop music, person or vehicle…) but can’t write formal rules for the classification in your favorite programming language. Instead, supervised ML algorithms “learn” how to distinguish different types of objects from labels provided by the developer.

So we have 8 000 news pieces with labels: whole text marked as containing emotional manipulation and/or flawed arguments or neither. Labels were made by trusted news feed editors and our team. They were used to train manipulative news classifier.

The first prototypes of a classifier, from ones on random forest (simple ML algorithm) to bare LSTM models (more sophisticated), weren’t accurate enough. The reason was the lack of training data — a few tens of thousands of labeled news might have helped. Labeling is tiresome and expensive, the deadline was approaching. But the revolution in natural language processing saved us.

In computer vision, people had been using transfer learning for a few years already. The point is taking a trained model or training own classifier on abundant data for a different task, then use it for the originally needed task. For example, guys from big tech made a powerful classifier to detect various objects on images, and we just slightly change it to classify breeds of cats. Some patterns in data have already been learned and can be used for a different datasets. Just as it’s easier to learn JavaScript if you code in Python than to learn it as your first programming language. But in NLP this approach became popular only in 2018 thanks to developers of GPT and ULMFiT models. Both algorithms are based on a language model, the one that has learned to predict the next item (token or letter or ngram or whatever is the way we split text) in a sequence. To train it we need just a lot of plain text, it can be Wikipedia. We had a lot of news as well. Having learned the language itself, the model needs fewer data to learn some classification on data on a given language. It worked.

List of manipulative sites and their audience, as of November 2018

This trained classifier with the language model as a backbone processes each relevant article in our dataset so that we can find manipulative ones. In “We’ve Got Bad News” we used ≈2 million automatically classified pieces to compile the list of manipulative sites.

The Agenda

In addition to knowing who writes manipulatively, we want to know what are manipulations about. So Texty has started to track topics related to Russian disinformation in manipulative news. Challenges were:

We don’t know what they will write about in advance, so can’t make labels for the classifier (bye-bye ULMFiT)
We want to track topics in time, so need to add new topics to the model
Our topics must be descriptive and interesting, it makes no sense to find too general topics such as “politics” or “international”, at the same time they have to be broad enough for tracking over time.

This is why we used unfashionable topic modeling with NMF (almost the same in usage as LDA but faster). Lemmatized manipulative news is used to fit the model, which splits all the input into N clusters (now we predict topics for Ukrainian and Russian mainstream news as a reference), no need for supervision — labels. We have found the number of clusters experimentally, it is yet another inconvenience in NMF. We publish monitorings of disinfo topics every week, and every week we fit a new NMF model to cluster current week news. The rest of the work is left to humans.

As the NMF output, we have two tables. In the first one, each row is a piece of news and each column is some cluster, the cell value indicates how strongly text relates to cluster-topic. The second table has clusters as rows and words as columns, so we can know what are the keywords for clusters. Some clusters are pretty good: news about elections with keywords as “election, vote, candidate, poll”. Another are artifacts or based on some irrelevant basis, such as news about arguments in social media. Some clusters are too narrow and should be merged with other ones while some are too general.

Document — topic matrix of NMF. Max values, pink, indicates an item’s cluster

To tackle the problem of inaccurate topic detection analysts manually review raw NMF clusters. Merging topics is simple, we just treat two clusters as one. Too general or irrelevant clusters are then ignored. If an article from a removed cluster has strong enough relationship with another cluster, it will be assigned that second-rate cluster. As the last resort, we write rules to clean or slightly re-group the clusters based on keywords or the strength of news — cluster relation. Topics that are not related to Russian disinfo are ignored and corresponding news is removed from monitoring.

Weekly monitorings show how popular disinfo-sensitive topics are in news from different types of sites. To track topics over time we join weekly results. Each weekly topic is assigned a meta-topic because weekly topics focus more on distinctive events. For example, weekly topics about MH17 investigation are grouped in meta-topic “MH17”, and this general topic is a part of broader meta-topic “War”.

Top-level meta-topics on topic-radar.texty.org

We used the clustering of NMF clusters (yes, NMF of NMF’s topic-word vectors) to find out some general groups of small topics. It helped us compose a list of narratives, meta-topics, which we can track. Using NMF clusters from all the weekly models we select the most likely meta-topic for each article. We do it to assign meta-topic to a maximum number of articles: if an article turned out to have no topic detected by its weekly NMF, a model fitted on other week data can assign a topic to this “orphan” piece. Next, we group the number of articles in each meta-topic by day and by the type of website. It is almost ready to be displayed on the dashboard.

Why do we use a share of the topic in all news of the day/week in a particular group of websites? We don’t want to show the raw number of news, instead, we must show attention paid to some topic. A simple calculation of percentage of a topic in the total amount of news from the type of websites lets us remove the influence of:

Number of news loaded, in case we add a new source to the database or fail to crawl news from some site
Disparities in number of news in different groups of sites
Weekly news production cycle (less on weekends and more during the week) or holidays.

Фейкогриз (FakeCrunch)

Topic monitoring is more of a nerdish pleasure. When we released the list of manipulative sites we received questions like “oh, cool! Can you give people notifications or block those sites?”. Yes, now we can warn about manipulative news but aren’t going to block anything for the sake of freedom of speech. As manipulative news classifier became more accurate (our target metric is a false positive rate, the less the better) we built a browser add-on and Telegram bot to check news and let people report manipulative news themselves.

FakeCrunch at work. Warning and reporting interface

There are similar solutions targeting English-speaking audiences, but none for Ukraine yet. So FakeCrunch (it sounds cute and funny in Ukrainian) has fulfilled the need for an easy and straightforward media literacy tool in Ukraine. The only technical improvement it needed was the almost real-time classification of manipulative news to fill FakeCrunch’s database of known manipulations. We display a “suspicious” label if the piece of news was marked by an AI classifier with a high probability of manipulativeness. All labels from users are moderated to avoid abuse of service.

Despite being relatively simple, FakeCrunch is important for our team as a summary of the work we do for diverse users. It is neither research, as a list of manipulative news websites, nor analytics, as topic monitoring is, instead, it’s a simple tool for everyone to protect oneself from manipulations.

What’s next

Probably, we undergo another revolution in natural language processing now! There’re appearing models that can learn from labels in one language and transfer its knowledge to texts in another language. What if we can use all the factcheck and hate speech labels in English and apply them to Russian and Ukrainian? What if we can use our labels in Russian to train a classifier in Ukrainian, for which we don’t have enough data now? Should we pre-train classifiers using weak supervision together with transfer learning? These are the questions for our team as of now.

We are also planning to update a topic dashboard to display more groups of sites and get meta-topics more accurately with the supervised classification of meta-topics (now known). Also, it is time to feed labels from FakeCrunch users to manipulative news classifier. And many other things.

Thank you for reading it all!