AI/ML for Fake News & Misinformation Detection
The concepts of the past influence the ideas of today. The internet has played an integral role in this process because of its remarkable capacity to store information; the internet is home to billions of ideas/concepts and consequently, our initial perception and ideas are formed by what we see and hear online. That is because ideas are formed based on our initial knowledge, and the Internet is the main source of the information we acquire. Valid information can spark right and reasonable ideas whereas, fake information can often spark wrong and extreme ones. For that reason, it is vital to be cognizant of the information we consume online, however, that is no easy task since fake information is easily fabricated. Artificial intelligence and machine learning provide a solution to prevent the spread of fake news and misinformation, using natural language processing, a set of algorithms that are optimized to process, analyse, and represent human language, however, challenges do exist regarding this field.
The spread of misinformation and fake news has become widespread and influential on people’s opinions. A study by Ipsos Public Affairs for Canada’s Centre for International Governance Innovation (CIGI) found that 90% of Canadian people have fallen to fake news, with 33% of them experiencing it “frequently,” and similarly, 52% of US citizens stated that they encounter fake news online “regularly” [Thompson; Watson]. This is an illustration of how common this problem is. It indicates the urgency for companies and organizations to act upon it. Misinformation can be trusted similar to how valid information can be trusted, since it is presented matterfactly. When users trust a source, the information may influence the choices that they make. With many fragile ongoing social, economic and scientific matters and topics, fake news and misinformation could determine how future discussions and thoughts could be formed, since the world of tomorrow is determined by our decisions in the present. Online platforms are aware that they need to moderate and monitor the content for accuracy more effectively, however, they have been unable because the correct technology has not been available for years.
Many of the online platforms have been unable to flag fake news and misinformation, and they are not the entities that have to be necessarily blamed, because detecting falsified content, for years, was only possible if thousands of researchers and literature editorial experts went through the content that was being posted on the platform.
Hiring thousands of employees would pose a substantial cost. Naturally, one might argue that online platforms have to pay that price since the content is posted on their platform. In clear terms however, it is not their fault, but their responsibility. Responsibility is being accountable to others for events that are within our control, and fault is being the main cause of the failure. Online platforms are not the main source of failure, but they are well aware that it is their responsibility to provide a pleasant and safe experience to their users. This is not to say that they are not unintentional contributors in this problem.
With that said, they are, however, alongside other researchers and organizations, developing natural language processing machine learning algorithms to moderate and remove misinformation.
Solution: Natural Language Processing (NLP)
NLP is a subcategory of machine learning and artificial intelligence, suitable for the understanding of human language. Elizabeth D. Liddy, the Emeritus Professor of Information Science and former Dean of the Syracuse University School of Information Studies defines it as below [Liddy]:
A theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications.
There is no agreed-upon definition of natural language processing, but Liddy’s definition is able to capture and represent most aspects of it. NLP is a “range of computational techniques” because there is no single method or algorithm, but multiple, and “naturally occurring text” is any form of text that is spoken by humans. There are multiple methods of NLP, each suitable for different scenarios, however, each of them have the main objective of “achieving human-like language processing” because that would provide an effective solution to problems, such as spread and creation of fake news and misinformation.
Training and deploying a NLP model for fake news and misinformation detection is similar to other kinds of ML algorithms, whether that is object detection or audio classification. There are mainly three components/stages: data, algorithm, and predictions.
- Data: There are a number of datasets specifically created for fake news and misinformation detection, with examples including LIAR and FakeNewsNet [Tolios]. The datasets contain carefully fact-checked statements that are labelled either true and valid or false and misleading. The developer could then process the statements through different NLP methods such as tokenization and the removal of stop words (i.e. the, of, for) so that the statements only include the significant features of the overall text.
- Algorithm: The data is fed into an algorithm. Different models fit different needs, and for NLP, the most popular and effective supervised (labelled classes) ML algorithms are support vector machines, bayesian networks, maximum entropy, conditional random field, and neural networks [Barba].
- Predictions: All ML algorithms would work for fake news and misinformation detection because all of them have a purpose of predicting an outcome based on given options and probabilities. Fake news and misinformation detection does not require any special ML algorithms, rather it is a simple classification problem from the perspective of computer scientists.
One prominent challenge with fake news and misinformation detection is that topics change frequently. What it means is that, compared to other kinds of algorithms, for example, object detection, the dataset created based on past texts can not and would not work as effectively as time passes. New topics emerge, and people begin writing about them. A model that is optimized to detect fake news and misinformation based on past text is not able to achieve the same accuracy anymore. For that reason, a challenge regarding fake news and misinformation detection is that the dataset should be updated frequently, adding texts with new topics and ideas.
Fake news and misinformation detection can be solved using machine learning algorithms, however, the challenge with this project is that the data should be updated constantly. From our perspective, fake news and misinformation detection is going to be similar to the cybersecurity industry: a cat-and-mouse game where the mouse, the individuals creating fake news and misinformation, are always caught by the cat, the more advanced ML algorithm that uses newer data. For that reason, even if an ML algorithm is effective today, it might be deemed as utterly obsolete and useless a year from now. The mouse may escape because the cat has aged. Fake news and misinformation detection algorithms are always going to improve and advance, and similar to other technologies, more challenges are going to arise that have to be tackled.
Barba, Paul. “Machine Learning (ML) for Natural Language Processing (NLP).” Lexalytics, 29 Sept. 2020, www.lexalytics.com/lexablog/machine-learning-natural-language-processing.
Liddy, Elizabeth D. Natural Language Processing, School of Information Studies — Syracuse University, 2001, surface.syr.edu/cgi/viewcontent.cgi?article=1043&context=istpub.
Thompson, Elizabeth. “Poll Finds 90% of Canadians Have Fallen for Fake News | CBC News.” CBCnews, CBC/Radio Canada, 11 June 2019, www.cbc.ca/news/politics/fake-news-facebook-twitter-poll-1.5169916.
Tolios, Giannis. “How I Created a Fake News Detector with Python.” Medium, Towards Data Science, 8 Oct. 2021, towardsdatascience.com/how-i-created-a-fake-news-detector-with-python-65b1234123c4.
Watson, Amy. “Frequency of Fake News on Online News Websites U.S. 2018.” Statista, 23 Oct. 2019, www.statista.com/statistics/649234/fake-news-exposure-usa/.
Thank you for the great feedback and suggestions from Omar Soliman.