An Overview of Textual and Visual Content to Detect Fake News

Published in

The Startup

16 min readNov 26, 2020

Abstract

This work presents a review of the state-of-start of fake news detection using textual and visual content in the context of online news media such as social media, news blogs, or online newspapers. Most of the studies in the literature show different solutions for the problem using either the textual content or the visual content, but not both. Also, the authors use only one kind of news media, as opposed to a combination of them. Most of the research is focused on the use of supervised techniques, especially deep learning models, which do not perform well outside of the scenarios where they were trained. A possible solution lies in the use of semi-supervised and unsupervised models, which are relatively scarce in the literature.

Introduction
Critical revision of the state-of-the-art
Public Datasets
Description of a methodology
Future Project
References

1. Introduction

Fake news is not a recent problem but is changing. Nowadays, the communication between people has expanded significantly, and because of that, we saw the appearance of novel ways to spread news like social media, news blogs, and online newspapers. These media facilitate the distribution of real-time information, and also everyone can easily write fake news on the internet. There are several concepts that are close to fake news: rumors, satire news, fake reviews, misinformation, fake advertisements, conspiracy theories, false statements by politicians, etc., which affect every aspect of people’s lives. (Zhang and Ghorbani 2020) summarizes the basic characteristics of fake news as velocity, volume, and variety, and propose a definition of fake news: “fake news refers to all kinds of false stories or news that are mainly published and distributed on the Internet, in order to purposely mislead, befool or lure readers for financial, political or other gains.”

Considering that the dissemination of digital information is exponential, artificial intelligence plays a fundamental role, and more specifically natural language processing (NLP) and machine learning approaches. Fake news detection is the task of evaluating the truthfulness of a certain piece of news. There are other similar domains like botnet detection, malicious or fake account detection, unknown news creator detection, sentiment analysis (Chatterjee et al. 2019), deception detection, stance detection, controversy, news similarity analysis, credibility, etc. All these problems are related with data mining, social media, natural language processing, and false information detection (Zhang and Ghorbani 2020).

In the context of fake news detection and NLP, the survey (Saquete et al. 2020) proposes different subtasks to solve the problem: deception detection, stance detection, controversy, polarity, automated fact checking, clickbait detection, and credibility. More specifically, the credibility of online information depends on reliability of the media, the information source, and the message. In addition, in social sciences, health (Ma and Atkin 2017), psychology and marketing (Chen and Chen 2019), media credibility is studied in three types of media: online news, news blogs, and social networks. In relation with this kind of media, the research presents challenges with evaluation metrics and datasets. That is because the metrics have been developed to evaluate the reliability of traditional news media. Also, studies tend to use different datasets, making the comparison between models more difficult (Saquete et al. 2020). Another problem to consider is that, in the literature, researchers usually focus on a single language, as explained by (Faustini and Covões 2020). In this work, they carry out research to detect fake news using five datasets in languages from different groups such as Latin, Germanic and Slavic, using supervised learning models. However, labeling fake news datasets is a hard task which normally requires a large human effort. Besides, the quality of that labeling affects the performance of a supervised learning model. Thus, an unsupervised learning model is deemed more practical and feasible for solving real-world problems (Zhang and Ghorbani 2020).

Fake news may not only be textual, in fact the visual communication has been a problem for a long time. (Bannatyne et al. 2019) presents a brief historical review both textual and visual information, and concludes that, over time, technology will allow us to modify reality very easily. Actually, that is already happening, since we have sophisticated softwares at our disposal that easily modified images, videos and even audio with a high quality. Lately, we saw the appearance of deepfakes, the result of the use of powerful machine learning and artificial intelligence techniques to manipulate or generate visual and audio content with a high potential for deception (Kietzmann et al. 2020). The term deepfakes started on the Reddit platform, where a user calling himself deepfakes (deep learning + fake) published videos of adult content where regular actors have been replaced by celebrities. The sophisticated techniques and the large accessibility of data has facilitated the manipulation of images and videos. As a result, open software and mobile applications such as ZAO and FaceApp allow users to create fake images and videos in an easy way (Tolosana et al. 2020). Also, this problem is worrisome in the news context since there are enough tools to spread fake content easily and quickly. For instance, (Kietzmann et al. 2020) mentions, among others, the viral video called You Won’t Believe What Obama Says In This Video! which has been manipulated very realistically in its audiovisual content. This work (Vaccari and Chadwick 2020) use this video to understand the social impact in concepts like deception, uncertainty, and trust.

“You Won’t Believe What Obama Says In This Video!” by Youtube

In the image and forensic fields, fakes are traditionally detected by examining intrinsic (optical lens, colour filter array or interpolation, and compression) and external changes (copy-paste or copy-move of different elements of the images, and reduce the frame rate in a video). This is of special importance in the era we live in, as most media fake content is usually shared on social networks, whose platforms automatically modify the original image/video (Tolosana et al. 2020).

2. Critical revision of the state-of-the-art

To begin with, the first question that arises is how fake news is identified, and what types of indicators should be analyzed. There are different indicators related to veracity in the big data context such as truth, trust, uncertainty, credibility, reliability, noisy, anomalous, imprecise, and quality (Lozano et al. 2020). According to (Zhang and Ghorbani 2020), a possible indicator to detect fake news may be credibility. The credibility concept is defined in (Asr and Taboada 2019), (Du et al. 2019). In this work (Medina-Rodríguez et al. 2020), they explain the extraction of different credibility features. The proposals to tackle this type of problem are diverse. (Yaqub et al. 2020) include three interrelated concepts pertaining to user behavior: confirmation bias, perceived source credibility, and desire for socializing. Their objective is to know if users would continue to share messages with false content when they know that the information is false. To assess whether it may be a possible solution to mitigate the spread of fake news, they label some texts with different colors, where false content is red and otherwise green, and conclude that the effectiveness depends on the characteristics of the user. Another way to address the credibility issue is presented in (Shu et al. 2019). Their proposed is a tri-relationship that captures the interrelationship simultaneously among publishers, news pieces, and users to detect fake news in the social media context. To prove their hypothesis, they use two platforms (BuzzFeed, PolitiFact) to obtain labeled datasets, and thus, they use supervised learning models. In this work (Aker et al. 2019), they present a study of correlation between features. (Saquete et al. 2020) presents an extensive survey of credibility in social media. They conclude that some credibility features are difficult to obtain in an automatic way; these include attractiveness, intelligence, and transparency of a source of information. Nevertheless, there are credibility features that can be included in an automatic system, such as number of mentions, existence of hypertext references, use of multimedia content, and identification of message length. Also, other features could be obtained using NLP tools, such as textual coherence and objectivity of the information.

2.1. Textual information

There are different proposals using NLP techniques. (Bauskar et al. 2019) use NLP and labeled datasets (BuzzFeed, PolitiFact) to extract features such as content-based features and social features, and they also analyse different steps for data preprocessing. In this work (Elhadad et al. 2019), the authors present a method to extract features that avoid redundancy in textual context. (Hamdi et al. 2020) discusses some approaches: user-based features, graph-based features, NLP features, and influence-based features. The survey (Oshikawa et al. 2018) mentions all the datasets that are being used so far, and also summarizes the methods that are usually utilized. The datasets are divided in three categories: claims (one or a few sentences to validate), entire articles (many sentences related to each other), and Social Networking Services (similar to claims, but also includes non-text data). The majority of methods used in the literature are supervised while semi-supervised and unsupervised are less common. The authors introduce the methods in different steps: first preprocessing, and second machine learning models. NLP is usually used in preprocessing, but also in rhetorical structure theory (RST) and recognizing textual entailment (RTE). RST analyses the coherence of a story but it can only be used if the dataset has entire articles. RTE is frequently used to gather and to utilize evidence, which is the task of recognizing relationships between sentences. In the second step, in machine learning models, the most popular method used is: Recurrent Neural Network (RNN), especially Long Short-Term Memory (LSTM) and also Convolution Neural Networks (CNN). The authors provide recommendations about fake news detection: news articles or claims might be a mixture of true and false statements; there are studies that shows that satire can be distinguished well from both real or fake news, so we have to be careful about the types of fake news that are collected; it is easier to obtain claims datasets than the entire article datasets; and they suggest for feature work different hand-crafted features, also using non-text data, combined with RNN would be the best type of model, since it presents better accuracy. In (Deepak et al. 2020) go one step further and add secondary features related to who published the news, such as domain name, authors details, etc. To do this, they use search engines, online data mining, using news’ keywords, and add the result to the news’ information to evaluate. Also, they use a labeled dataset (George McIntire), and use LSTM in their experiments. This approach improves against the one without the use of online data mining.

Most labeled fake news datasets are meant for classification (Oshikawa et al. 2018). One of the conditions for fake news classifiers to achieve good performances is to have sufficient labeled data. However, obtaining reliable labels requires a lot of time and human labor. Therefore, semi-supervised (Chen and Freire 2020) and unsupervised methods (Medina-Rodríguez et al. 2020), (Yang et al. 2019) have been proposed. This is a big challenge, because there is not so much research about that. (Zhang and Ghorbani 2020) proposed as future work the use of four types of unsupervised learning models for fake news detection: cluster analysis, outlier analysis, semantic similarity analysis, unsupervised news embedding. Also, they mention a real-time visualization technique for fake news detection, and an early prediction and intervention for online fake news (Yang et al. 2020). The review (Minaee 2020) mentions different supervised and unsupervised learning models used on various text classification tasks, including sentiment analysis, news categorization, topic classification, question answering, and natural learning inference. Also provide an overview of popular text classification datasets.

2.2. Visual information

Detection of deepfake has also been investigated. The survey (Tolosana et al. 2020) presents the different concepts related with facial manipulation using large-scale public data, and also deep learning techniques such as Autoencoders (AE) and Generative Adversarial Networks (GAN) that eliminates manual editing. They cover four types of manipulation: whole face synthesis, identity swap, attribute manipulation, and expression swap. Another kind of face manipulation is face morphing; but this is not so popular in the research community than the other four. In summary, they do a complete survey about the techniques recently used in deepfakes: the generation techniques, the most used methods are AE and GAN; the detection techniques, specially LSTM and CNN; and the datasets. They conclude that it is easy to obtain a good fake detector under controlled scenarios, i.e. when the fake detector is evaluated in the same conditions they are trained for. For that reason, they suggest further research on the generalizability of fake detectors in unseen conditions. They mention that there are other researchers that have studied those aspects. Besides, they suggest that it would be valuable to improve the detector using fusion techniques, which means combination of other sources such as the text and the audio that accompanies the video (Hossain et al. 2019), (Luisa Verdoliva 2020), (Cao et al. 2020). The survey (Luisa Verdoliva 2020) explains different kinds of unseen conditions about images that have been already studied in the forensic context. Defense Advanced Research Projects Agency of U.S. Department of Defense explains the digital media verification and launched the large-scale Media Forensic initiative (MediFor) with references in terms of methods and datasets. In short, digital media verification should seek physical integrity, digital integrity, and semantic integrity. They show an exhaustive survey of the recent techniques used to detect this kind of features, which usedeep learning techniques such as LSTM, CNN, and GAN. There are studies that suggest the use of GAN techniques to invert the generation process. In addition, they mention that the use of one-class methods or looking for local anomalies/manipulations seem to perform reasonably well in challenging real-world conditions. (Agarwal et al. 2020) shows an example of forensic techniques for detecting a specific class of deepfake videos which analyse the mouth movement related with different phonemes. The review (Cao et al. 2020) presents visual features from visual content in fake news which is categorized into four types: forensics features, semantic features, statistical features and context features. Besides, they mention that there is a critical challenge which is the explainability of fake news detection, i.e., why a model determines a particular piece of news as fake. They suggest that fact checking approaches could offer a solution for this challenge. In this study (Shu et al. 2019), they present an explainable detection method of fake news. The authors suggest that the explanation can provide new insights and knowledge originally hidden, and extract features from noisy information can help to detect fake news. Their proposal is using both the intrinsic explainability of news sentences and user comments to improve fake news detection performance.

2.3. Both textual and visual information

The use of the relationship between text and image is presented in (Zhou et al. 2020), (Otto et al. 2019), (Singhal et al. 2019). In this work (Zhou et al. 2020), they propose a fake news detection method which extracts both textual and visual information, and evaluates their relationship using similarity. As future work, they propose to consider the following information: network, video, textual and visual similarity, and the relationship between pairwise news articles. In (Otto et al. 2019), their studies are based on previous work in the fields of visual communication and information retrieval semantic correlation. In summary, the authors use a set of three metrics: cross-model mutual information, and the status relation, and also provide a method to generate automatic datasets.

3. Description of a methodology

The first step is to obtain different public datasets that cover the three types of online news. This kind of data are been used extensively; they are presented and analysed in the following surveys:

Text news: (Oshikawa et al. 2018), (Minaee 2020), (Medina-Rodríguez et al. 2020), (Zhang et al. 2015).
Visual news: video (Tolosana et al. 2020), image and video (Luisa Verdoliva 2020).
Both textual and visual news: (Zhou et al. 2020).

The second step consist in designing a method that extracts both textual and visual features related with credibility. These are:

Textual features: content-based features and social features (Bauskar et al. 2019), (Elhadad et al. 2019); user-based features, graph-based features, NLP features, and influence-based features (Hamdi et al. 2020); NLP, RST and RTE (Oshikawa et al. 2018); add secondary features related to who published the news such as domain name, authors details, etc. (Deepak et al. 2020).
Visual features: invisible features (Luisa Verdoliva 2020); forensics features, semantic features, statistical features and context features (Cao et al. 2020).
Correspondences between to textual and visual features: cross-model mutual information, and the status relation (Otto et al. 2019).

As for the third step, to test different learning techniques to select the most promising. I will consider the following methods for both visual and textual content:

Semi-supervised learning models: RNN, LSTM, CNN, and (Chen and Freire 2020).
Unsupervised learning models: (Medina-Rodríguez et al. 2020), (Yang et al. 2019), (Minaee 2020). Evaluate the possibility of using Big Data techniques.
Different metrics to validate the results will be used.
Controlled and real-time scenarios to validate my proposal.

4. Future Project

In summary, my proposal aims to obtain a method capable of detecting fake news using textual and visual content, and able of analysing all kinds of news context. I will focus on three objectives. First, I will select different public datasets that tackle social media, news blogs, and online newspapers. Second, I will analyse the selected data to extract features, and evaluate options to extract more features from other information sources. Besides, I will analyse the possible relationship between the selected features. Third, once the dataset and features have been selected, I will choose different semi-supervised and unsupervised models to detect fake news from the data. To reach the three objectives, I will perform a review and select the most promising public data sets, features and models. After that, I will statistically study the results obtained. After evaluating the results, I will focus on the best performing features and methods, and propose further development aimed at improving their usefulness.

References

Zhang, Xichen, and Ali A. Ghorbani. 2020. “An Overview of Online Fake News: Characterization, Detection, and Discussion.” Information Processing & Management. https://doi.org/10.1016/j.ipm.2019.03.004 .
Chatterjee, A., Gupta, U., Chinnakotla, M. K., Srikanth, R., Galley, M., & Agrawal, P. (2019). Understanding Emotions in Text Using Deep Learning and Big Data. In Computers in Human Behavior (Vol. 93, pp. 309–317). https://doi.org/10.1016/j.chb.2018.12.029
Saquete, Estela, David Tomás, Paloma Moreda, Patricio Martínez-Barco, and Manuel Palomar. 2020. “Fighting Post-Truth Using Natural Language Processing: A Review and Open Challenges.” Expert Systems with Applications. https://doi.org/10.1016/j.eswa.2019.112943 .
Ma, Tao (jennifer), and David Atkin. 2017. “User Generated Content and Credibility Evaluation of Online Health Information: A Meta Analytic Study.” Telematics and Informatics. https://doi.org/10.1016/j.tele.2016.09.009 .
Chen, Mu-Yen, and Ting-Hsuan Chen. 2019. “Modeling Public Mood and Emotion: Blog and News Sentiment and Socio-Economic Phenomena.” Future Generation Computer Systems. https://doi.org/10.1016/j.future.2017.10.028 .
Bannatyne, Mark William Mckenzie, Agnieszka Katarzyna Piekarzewska, and Clinton Theodore Koch. 2019. “If You Could Believe Your Eyes: Images and Fake News.” 2019 23rd International Conference in Information Visualization — Part II. https://doi.org/10.1109/iv-2.2019.00034 .
Faustini, Pedro Henrique Arruda, and Thiago Ferreira Covões. 2020. “Fake News Detection in Multiple Platforms and Languages.” Expert Systems with Applications. https://doi.org/10.1016/j.eswa.2020.113503 .
Kietzmann, Jan, Linda W. Lee, Ian P. McCarthy, and Tim C. Kietzmann. 2020. “Deepfakes: Trick or Treat?” Business Horizons. https://doi.org/10.1016/j.bushor.2019.11.006 .
Vaccari, Cristian, and Andrew Chadwick. 2020. “Deepfakes and Disinformation: Exploring the Impact of Synthetic Political Video on Deception, Uncertainty, and Trust in News.” Social Media Society. https://doi.org/10.1177/2056305120903408 .
Lozano, Marianela García, Joel Brynielsson, Ulrik Franke, Magnus Rosell, Edward Tjörnhammar, Stefan Varga, and Vladimir Vlassov. 2020. “Veracity Assessment of Online Data.” Decision Support Systems. https://doi.org/10.1016/j.dss.2019.113132 .
Asr, F. T., & Taboada, M. 2019. “Big Data and quality data for fake news and misinformation detection.” In Big Data & Society (Vol. 6, Issue 1, p. 205395171984331). https://doi.org/10.1177/2053951719843310
Du, Y. Roselyn, Lingzi Zhu, and Benjamin KL Cheng. 2019. “Are Numbers Not Trusted in a “Post-Truth” Era? An Experiment on the Impact of Data on News Credibility.” Electronic News: 1931243119883839.
Medina-Rodríguez, R., Talavera, A., Hernani-Merino, M., Lazo-Lazo, J., & Mazzon, J. A. 2020. “Global Brand Perception Based on Social Prestige, Credibility and Social Responsibility: A Clustering Approach.” In Information Management and Big Data (pp. 267–281). https://doi.org/10.1007/978-3-030-46140-9_25
Yaqub, Waheeb, Otari Kakhidze, Morgan L. Brockman, Nasir Memon, and Sameer Patil. 2020. “Effects of Credibility Indicators on Social Media News Sharing Intent.” Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3313831.3376213 .
Shu, Kai, Suhang Wang, and Huan Liu. 2019. “Beyond News Contents.” Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. https://doi.org/10.1145/3289600.3290994 .
Bauskar, Shubham, Department of Computer Science and Engineering, Maulana Azad National Institute of Technology, Bhopal, India, Vijay Badole, Prajal Jain, and Meenu Chawla. 2019. “Natural Language Processing Based Hybrid Model for Detecting Fake News Using Content-Based Features and Social Features.” International Journal of Information Engineering and Electronic Business. https://doi.org/10.5815/ijieeb.2019.04.01 .
Elhadad, Mohamed K., Kin Fun Li, and Fayez Gebali. 2019. “A Novel Approach for Selecting Hybrid Features from Online News Textual Metadata for Fake News Detection.” In International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, pp. 914–925. Springer, Cham.
Hamdi, Tarek, Hamda Slimi, Ibrahim Bounhas, and Yahya Slimani. 2020. “A Hybrid Approach for Fake News Detection in Twitter Based on User Features and Graph Embedding.” Distributed Computing and Internet Technology. https://doi.org/10.1007/978-3-030-36987-3_17 .
Oshikawa, R., J. Qian, and W. Y. Wang. 2018. “A survey on natural language processing for fake news detection”. arXiv preprint arXiv:1811.00770.
Deepak, S., and Bhadrachalam Chitturi. 2020. “Deep Neural Approach to Fake-News Identification.” Procedia Computer Science. https://doi.org/10.1016/j.procs.2020.03.276 .
Yang Liu and Yi-Fang Brook Wu. 2020. “FNED: A Deep Network for Fake News Early Detection on Social Media”. ACM Trans. Inf. Syst. 38, 3, Article 25, 33 pages. https://doi.org/10.1145/3386253.
Minaee, S., N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao. 2020. Deep learning based text classification: A comprehensive review. arXiv preprint arXiv:2004.03705.
Chen, Zhouhan, and Juliana Freire. 2020. “Proactive Discovery of Fake News Domains from Real-Time Social Media Feeds.” Companion Proceedings of the Web Conference 2020. https://doi.org/10.1145/3366424.3385772 .
Yang, Shuo, Kai Shu, Suhang Wang, Renjie Gu, Fan Wu, and Huan Liu. 2019. “Unsupervised Fake News Detection on Social Media: A Generative Approach.” Proceedings of the AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v33i01.33015644 .
Tolosana, R., R. Vera-Rodriguez, J. Fierrez, A. Morales, and J. Ortega-Garcia. 2020. “DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection”. arXiv preprint arXiv:2001.00179.
Aker, Ahmet, Vincentius Kevin, and Kalina Bontcheva. 2019 “Credibility and Transparency of News Sources: Data Collection and Feature Analysis.”.
Hossain, M. Shamim, M. Shamim Hossain, and Ghulam Muhammad. 2019. “Emotion Recognition Using Deep Learning Approach from Audio–visual Emotional Big Data.” Information Fusion. https://doi.org/10.1016/j.inffus.2018.09.008 .
Verdoliva, L. 2020. “Media forensics and deepfakes: an overview”. arXiv preprint arXiv:2001.06564.
Agarwal, S., H. Farid, O. Fried, and M. Agrawala. 2020. “Detecting Deep-Fake Videos from Phoneme-Viseme Mismatches”. In Proc. Conference on Computer Vision and Pattern Recognition Workshops.
Cao, Juan, Peng Qi, Qiang Sheng, Tianyun Yang, Junbo Guo, and Jintao Li. 2020. “Exploring the Role of Visual Content in Fake News Detection.” arXiv preprint arXiv:2003.05096.
Shu, Kai, Limeng Cui, Suhang Wang, Dongwon Lee, and Huan Liu. 2019. “defend: Explainable fake news detection.” In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 395–405.
Zhou, Xinyi, Jindi Wu, and Reza Zafarani. 2020. “SAFE: Similarity-Aware Multi-Modal Fake News Detection.” Advances in Knowledge Discovery and Data Mining. https://doi.org/10.1007/978-3-030-47436-2_27 .
Otto, Christian, Matthias Springstein, Avishek Anand, and Ralph Ewerth. 2019. “Understanding, Categorizing and Predicting Semantic Image-Text Relations.” Proceedings of the 2019 on International Conference on Multimedia Retrieval. https://doi.org/10.1145/3323873.3325049 .
Singhal, Shivangi, Rajiv Ratn Shah, Tanmoy Chakraborty, Ponnurangam Kumaraguru, and Shin ’ichi Satoh. 2019. “SpotFake: A Multi-Modal Framework for Fake News Detection.” 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM). https://doi.org/10.1109/bigmm.2019.00-44 .
Zhang, Xiang, Junbo Zhao, and Yann LeCun. “Character-level convolutional networks for text classification.” In Advances in neural information processing systems, pp. 649–657. 2015.