Natural Language Inference for Fact-checking on Wikipedia

12 min readNov 11, 2021

A review of the thesis of Mykola Trokhymovych successfully defended in June 2021 as the MSci degree requirement of the master program in Data Science at UCU.

Mykola Trokhymovych presented this research at the CIKM’21 Applied Track. The accompanying research paper, “WikiCheck: An End-to-End Open Source Automatic Fact-Checking API Based on Wikipedia,” written by Mykola Trokhymovych and Diego Saez-Trumper, was published in the Proceedings of the 30th ACM International Conference on Information & Knowledge Management.

There is a tremendous amount of data available to everyone on the web — and this makes it such a great place! But every rainbow has its rain. With this heap of information come bias, propaganda, and fakes. This type of harmful content is becoming more challenging to identify, which forces active research in fact-checking. The ultimate objective is to determine whether a given claim is true or false, and the increasing demand has facilitated rapid progress in developing tools and systems to automate this task.

Fact-checking tasks can leverage existing research areas, such as the Natural Language Inference (NLI) models, that aim to predict an entailment relation label given a claim-hypothesis pair. The goal is to determine whether the truth of the hypothesis follows from the truth of the premise (claim), as in the following example:

Credit: http://nlpprogress.com/english/natural_language_inference.html

Even though automated fact-checking is currently developing very fast in academia, there is a gap between research achievement and applicability in real life. Possible approaches for fact-checking are described as a solution to the FEVER shared task presented by Nie, Chen, and Bansal, 2018; Yoneda et al., 2018; Hanselowski et al., 2018. Although they present an end-to-end approach for fact verification, the efficiency and usability of such systems for real-world applications remain an open question because we need to deal with the trade-off between speed and accuracy.

End-to-end fact-verification for Wikipedia

The main goal of the research of Mykola Trokhymovych was to transform academic study on automated fact-checking into a practical open-source end-to-end application used for fact verification based on open Wikipedia knowledge (see Figure 1). By “end-to-end” we mean a system that will:

receive a sentence (claim);
query Wikipedia looking for evidence to contrast with (hypothesis);
apply the NLI model on that pair (claim, hypothesis), returning whether the claim is SUPPORT, REFUTED, or if there is NOT ENOUGH INFO on Wikipedia about it.

Datasets

The recent research in the Natural Language Processing (NLP) field is firmly bound to data. The SOTA results are achieved not only thanks to innovative models but also due to significant data, accurate filtering techniques, and understanding of data nature.

In his work, Trokhymovych considered using multiple datasets for training and validation. They were divided into two main groups:

General domain datasets:

Stanford Natural Language Inference (SNLI) — comes from image captions;
Multi-Genre Natural Language Inference (MNLI) — comes from a wide range of styles, degrees of formality, and topics: conversations, reports, speeches, letters, fiction;

2. Specific domain datasets:

Fact Extraction and Verification (FEVER) — manually generated and labeled claims, related evidence as links to Wikipedia dump.

The general domain datasets include samples consisting of 3 main parts — claim, hypothesis, and label (entailment, neutral, contradiction) — and are used as a standard benchmark for the NLI task.

Having analyzed length distribution for claims and hypotheses, the author found that the length of the claim is equally distributed within three classes, while the length of the hypothesis significantly differs. For instance, in SNLI, the entailment class hypothesis is usually shorter than others. This could force the model to learn the length of a sentence instead of its meaning. The MNLI dataset situation is much better as distributions of the length of texts are more balanced.

Besides, it was discovered that there is a significant disbalance across labels of samples with the same hypothesis. Frequent hypotheses are usually used in either entailment or contradiction class. This is not a natural situation as the model will learn only the sense of hypothesis instead of the desired relation between the premise and hypothesis pair. So, it is essential to analyze how filtering out such patterns from training will influence the models’ validation results.

The dataset from the specific domain group will be used for domain adaptation. The FEVER consists of a claim, evidence link, and label — SUPPORT (S), REFUTED (R), or NOT ENOUGH INFO (NEI). This dataset depicts another problem formulation consisting in classifying the relation between two pieces of text and linking the given claim to the corresponding evidence in a knowledge base, a Wikipedia dump. Such a more general problem setting is very close to the real-life scenario, simulating what humans could do to fact-check a given claim. FEVER is the main validation for the current research in fact-checking as it is of good quality, represents written speech, and is Wikipedia domain-specific.

Proposed approach

Mykola Trokhymovych proposed decomposing the fact-checking system architecture into two major parts — the candidate’s selection (Model-level one) and the NLI classification (Model-level two).

Figure 2: Automated fact-checking software architecture

The main idea of that approach was to reproduce the human way to do the fact-checking process, where the initial input is a claim — the piece of text that should be checked. We start by finding related facts, evidence of correctness, or wrongness of a given claim using trusted sources. Having found hypothetical evidence, we compare these two pieces of text to decide if they demonstrate the same fact or not.

Finally, we can conclude that the found hypothesis either supports the initial claim, refutes it, or does not relate.

Candidate’s selection model

Model-level one is used for the information retrieval stage. It consists of query enhancing logic along with the Wikimedia API. At this stage, the whole joined FEVER dataset was used to validate the solution. The author utilized a claim column that corresponds to the model’s input and an evidence column containing information about the ground truth Wikipedia page link desired to get as an output (see Figure 3). The Average Recall (AR) metric was used for results validation and comparison.

Figure 3: Model one validation process and example

NLI classification model

Model-level two is the Natural Language Inference model that performs ternary classification. Its aim is to identify exact sentences in predefined candidates’ articles that are evidence of the correctness or wrongness of a given claim.

Figure 4: Sentence based Siamese classifier with BERT encoder

The general concept of the presented NLI model is a Siamese network using a BERT-like model as a trainable encoder for sentences. The idea comes from Conneau et al., 2017, with the difference that Mykola Trokhymovych does not use multiplication of sentence vectors in the concatenation layer but utilizes only original vectors and their absolute difference. In such a way, an approach previously used by Reimers and Gurevych, 2019 for training sentence embeddings was presented as an efficient solution for the NLI problem. Compared to the word-based method, the sentence-based one allows calculation of claim embedding only once and then reuses it for every hypothesis. Also, embeddings for the hypothesis can be batch processed and precalculated in advance. So, the presented architecture enables to cache intermediate results of embeddings and reuse them for online prediction.

Experiments

Improving the performance of the search

Model-level one goal is selecting candidates for further analysis. It is a crucial building block of the final solution as, without well-picked candidates, we will not get desired results. Therefore, to improve the performance of the search, the author has decided to make the following modifications:

perform named entities recognition (NER) on a given claim;
use two strategies of treating obtained named entities: NER_merged — extra query with only all joined named entities and NER_separate — additional queries for each entity found in the claim;
increase N — number of candidates to extract for each query.

For the experiments, out-of-the-box models “en_core_web_sm” and “en_core_web_trf” from the spaCy tool framework and the “ner-fast” model from Flair were used.

Metrics to optimize the search performance were: time — seconds per 1000 queries, the primary AR metric, and the N_returned, that is, the average number of candidates returned after Model-level one.

It was found that NER models’ usage dramatically increased candidate selection performance while enlarging the N number did not show a significant impact. As for the strategy, the NER_separate approach gives better results than NER_merged. Also, a big N for the first stage can be harmful to the second stage, where the system should compare each sentence of picked articles with the given claim. Therefore, for further experiments and API development, the “Flair ner-fast NER_separate N=3” configuration was used. The main reason for such a decision is relatively high AR (0.879) with the lowest number of candidates returned (6.27), as the most time-consuming part of the system is the sentence-embeddings calculation used in the NLI model. Moreover, the number of returned articles from Model-level one directly influences the time for embeddings calculation.

Building sentence-based NLI model

The study under review compared the effect of three different pretrained language models:

BERT base (uncased): uncased means it does not make a difference between “nli” and “NLI”;
BART base: uses a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT);
ALBERT base: one of the top-performing models based on the GLUE score.

The initial experiment was to reproduce the results of the SOTA models and also measure their efficiency. To have a general overview, the author decided to pick both word-based and sentence-based models, like SemBERT and HBMP, accordingly. They have top-performing results and official repositories with code. Models were evaluated on 9,824 samples from the SNLI test set. The results are presented in Table 1:

As we can see, the word-based model SemBERT has the best accuracy. However, it is noticeably slower on inference than sentence-based models, as it does not allow caching and saving intermediate results. Besides, it was found that custom architecture with the BART language model gives comparable and even better accuracy than the existing sentence-based solution HBMP.

For the next part of the experiment, the author decided to use an unsupervised fine-tuning of language models that can potentially help to improve the domain-specific model and do not require annotated data. For that, BERT base (uncased) and BART base models were fine-tuned on the WikiText dataset following the experiment setup by Devlin et al., 2018:

selected 15% of tokens at random;
80% of tokens selected were changed to [MASK] special tag;
10% were switched to another token, and 10% remained the original;
trained only from Masked language modeling problem formulation and for one epoch.

Domain adaptation

This experiment aims to train the FEVER-specific NLI model that was the primary building block of a fact-checking system for Wikipedia. To do that, a bunch of data filtering techniques was applied:

cleaned tags at the end of the sentence, separated by tabulation symbol (see, e.g., “Selena began recording professionally in 1982.\tSelena\tSelena(film)” — includes tags Selena and Selena(film);
balanced distribution of S/R classes among the same hypothesis sentences;
filtered out absolute duplicates by fields claims and hypothesis;
undersample NEI class samples to the number of primary classes.

Experiments showed that the best performing model is the fine-tuned BART base trained on clean text, which achieved 74.82% accuracy.

Complete fact-checking system evaluation

In order to measure application accuracy, the author decided to use the official FEVER validation tool, which allows comparing the provided solution with FEVER competitors.

It gives several metrics used for validation: FEVER score, Evidence F1, and Accuracy.

The pipeline for FEVER validation differs from the presented fact-checking system flow. The changes are shown in the figure below and marked in red:

Figure 5: Fact-Checking system flow for FEVER validation

With these modifications, the system is not querying Wikipedia texts from MediaWiki API but takes the actual texts from the Wikipedia dump dated 2017, provided along with the FEVER dataset. As for the aggregation block, it aims to produce the final label and set of evidence to contrast with solutions provided during the FEVER competitions. Two stacked CatBoost models trained on a sample of probability outputs from the NLI model were used for that block.

Table 2: Complete fact-checking system FEVER accuracy

As we can see, the presented fact verification system, WikiCheck, reached the top-10 results of the FEVER competition. For comparison, the final model with a BART base NLI classifier was used.

Speaking about efficiency, the approximate time needed for the fact-checking process is about six seconds. The most time-consuming parts are text extraction using Wikimedia API and hypothesis embeddings calculation. Loading the text takes about 40% of the total application time, and calculation embeddings for the hypothesis take 50% (see more details in Trokhymovych paper). Therefore, future work to improve the system efficiency should focus on improving text extraction and embedding calculation.

Results

As a result of current research work, Mykola Trokhymovych introduced a new fact-checking system, WikiCheck, that can receive a sentence (claim), query Wikipedia looking for evidence, and then apply the NLI model on that pair (claim, hypothesis), returning the relation of each corresponding pair, which is one of SUPPORT, REFUTED or NOT ENOUGH INFO. The presented system has comparable to SOTA results, being in the top-10 solutions of FEVER competition. One should mention that it can be used on CPU, low memory devices, making the system more applicable.

Conclusions

To sum up, the main conclusions that can be drawn from the research described in this blog post are as follows:

the author found that the FEVER dataset has its limitations and annotation artifacts that can influence the NLI model’s performance and proposed the heuristic filtering technique for increasing the model’s accuracy;
usage of NER models for search increases the quality of results, in case there are named entities in queries, as in the FEVER dataset;
utilizing unsupervised fine-tuning of Masked Language Models with domain-specific texts will further increase accuracy on the downstream NLI task;
the author provided reasoning why sentence-based NLI models are faster than word-based for real-world applications;
he proposed possible architecture of sentence-based NLI model that shows comparable to SOTA results;

In addition, Mykola Trokhymovych presented a new open-source fact-checking system, WikiCheck, that is in the top-10 solutions of FEVER competition.

References

Mykola Trokhymovych (2021). “Automated Fact-checking for Wikipedia.” Ukrainian Catholic University, Department of Computer Science, 52 pages.
Mykola Trokhymovych and Diego Saez-Trumper (2021). “WikiCheck: An End-to-end Open Source Automatic Fact-Checking API based on Wikipedia.” In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (CIKM ‘21), pp. 4155–4164. Association for Computing Machinery.
Yixin Nie, Haonan Chen, and Mohit Bansal (2018). “Combining Fact Extraction and Verification with Neural Semantic Matching Networks.” arXiv: 1811.07039 [cs.CL].
Yoneda et al. (2018). “UCL Machine Reading Group: Four Factor Framework For Fact Finding (HexaF).” In Proceedings of the First Workshop on Fact Extraction and Verification (FEVER), pp. 97–102. Association for Computational Linguistics.
Hanselowski et al. (2018). “UKP-Athene: Multi-Sentence Textual Entailment for Claim Verification.” In Proceedings of the First Workshop on Fact Extraction and Verification (FEVER), pp. 103–108. Association for Computational Linguistics.
McDowell, Zachary and Matthew Vetter (2020). “It Takes a Village to Combat a Fake News Army: Wikipedia’s Community and Policies for Information Literacy.” In Social Media + Society.
Conneau et al. (2017). “Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680. Association for Computational Linguistics.
Reimers Nils and Iryna Gurevych (2019). “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992. Association for Computational Linguistics.
Devlin et al. (2018). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics.