AI and Fake News in Social Media
Recently, I was tasked with writing a speech for English class and was allowed to choose any topic; naturally, I chose AI. More specifically, I wrote about AI and fake news in social media and the use of supervised versus unsupervised training. This speech was to be part of a debate-like format, where myself and a partner would write a three to five minute speech arguing for or against a resolution. After doing some research, we were randomly assigned to the pro-supervised and con-supervised arguments and independently wrote our speeches. I have attached those here for your reading pleasure, along with some context I presented so those without expertise can better understand this topic. There are three parts, the context, the con-supervised, and then the pro-supervised. Each section will have the author listed and each author’s separate works cited page will be included at the end. Only the sources used in the text will be included in the works cited, ignoring the background research not used in the speeches. These are a bit of a long read, so get comfortable and enjoy! And a big thanks to my partner, Dhanin Wongpanich, for letting me post his speech here.
Contents
Context
Author: Ani Aggarwal
Trump’s recent use of Twitter as his main form of communication with the American people marks a turning point for social media. It’s becoming increasingly popular, allowing users to interact with each other at unprecedented speeds. However, its ability to exchange pictures, video, and stories has allowed fake news and disinformation thrive on these platforms. Disinformation is false information whose goal is to mislead or change public opinion (Merriam-Webster). Today, the most apparent form of disinformation is fake news, and I’m sure you have all interacted with it in the form of Instagram posts with questionable statistics such as global warming deniers or other posts that are now flagged with ‘false information’ tags. Unfortunately, “[s]tudies have shown that fake news spreads six times faster and lasts longer than true news.” To combat the rapid spread, there is a growing need for moderation on social media platforms (Gale In Context). However, platforms such as Instagram and Twitter are so massive that teams of human moderators will not filter nearly enough disinformation to be practical. This is where these companies turn to artificial intelligence, also called AI. AI is the idea that machines can perform specific tasks that humans can, given some guidance. There are two types of AI that are relevant to moderating these platforms: supervised and unsupervised AI. These are two different ways of ‘teaching’ AI and are often also called supervised and unsupervised learning. To better understand these learning techniques, consider the following analogy: an AI system is a baby, who, for example, is tasked with identifying dogs versus cats. In a supervised model, you stand next to the baby and tell it, “that thing over there is a dog, and that one is a cat,” and repeat the process for thousands of different cats and dogs. By hearing the correct name for the animal and seeing it, the baby learns to associate things such as wagging tails with dogs and various other attributes with cats. An unsupervised model, on the other hand, doesn’t involve you telling the baby anything at all, but rather just showing it various cats and dogs. The baby, by itself, will learn to group the animals by similar features. This baby would not know what a dog or cat is because you never told it those labels. Instead, it would merely sort out animals, much like you might have sorted out your Lego bricks when you were young without knowing the technical names for them. As you can tell, each technique is slightly different, and the data used to train them is different. Supervised learning requires humans to label animals while unsupervised learning only requiring the animals themselves. Each of these AI types also have unique biases. Why is it important to consider these biases? Well, AI is already being used in a plethora of fields, and that number will continue to grow. As such, cases of AI being used improperly due to bias will rise, leading to devastating effects on a great number of people. One such example is Amazon’s AI recruitment tool which filtered out applicants and was found to have favored men over women. Luckily, this was caught and Amazon scrapped it (Piano), but not before many women had lost chances at their dream jobs or their livelihoods because of biased AI. But AI’s bias is not limited to sexism: an algorithm used in US court systems predicted the likelihood defendants would become recidivists. However, this algorithm was found to wrongly accuse twice as many black offenders as white offenders (Shin), preventing scores of people from returning home and seeing their families merely become of their skin color. Racist, sexist, and generally biased AI is a massive problem, and each learning technique will examine how it controls bias as well as other key factors to create fair and effective AI.
For Unsupervised Learning
Author: Dhanin Wongpanich
Disinformation is quickly becoming one of the largest problems we face in this decade, playing a critical role in everything from political elections to the spread of the Covid-19 pandemic. In our fight against this issue, artificial intelligence (AI) has emerged as a powerful weapon that can detect disinformation. Currently, researchers are pursuing supervised learning algorithms that rely on human annotated datasets. However, the more unconventional approach of unsupervised learning promises to draw from the strengths of both AI and humans, which ultimately results in less bias, better performance and new insights.
One of the primary advantages of unsupervised models is how they are created. Data is one of the most important parts of this process. In supervised learning, the AI learns by looking at human annotated data, which often consists of chunks of text or news articles that are accompanied by human commentary. After this process, the AI can learn to sort data it has never seen before. Unsupervised learning presents us with a different approach. Instead of using human annotated data, it simply takes in the raw data — the news articles and chunks of text alone — and uncovers natural patterns in the data. This has huge advantages. It avoids the expensive and time-consuming process of human annotation and also helps to reduce human errors. Human annotations will often have noisy and incorrect annotations which greatly reduce the effectiveness of supervised learning and the resulting AI (Soni). Furthermore, by forgoing the expensive process of annotating data, unsupervised datasets can be much larger. The effect of this is twofold. More data allows for models to be more generalized, and each potential error in the dataset has a smaller effect, says Apoorv Agarwal, cofounder of Text IQ, a startup focused on privacy and security in AI systems (Kahn). Alternatively, the lower cost of producing an unsupervised dataset can also be used in another way — creating more up-to-date datasets. Because of the high cost and time commitment that creating an annotated dataset requires, most annotated datasets are many years old, and may not be representative of what current disinformation looks like. Combined with bad actors who attempt to defeat these automated systems, it is almost impossible for these outdated annotated datasets to accurately represent the current situation. This disconnect between the learning data and the real world is one of the main reasons why promising AI based systems fail to deliver in practical applications (Heaven). Bridging this gap can be done in a variety of ways; however, one of the simplest and most effective is to create a new dataset. Should we really spend our effort creating expensive supervised datasets that will soon be obsolete, when cheaper and arguably better unsupervised datasets could be used instead?
Furthermore, one of the most important issues plaguing AI is systematic bias. Although it is hard to imagine an AI caring about race or gender, AI has already shown to be tainted by our systematic biases. In one case, tweets by self-identified African Americans were 1.5 times more likely to be flagged as offensive by Twitter’s automated detection algorithm. Thomas Davidson, a researcher from Cornell University, suggests that the problem is with the data. He remarks, “You can have the most sophisticated neural network model, but the data is biased because humans are deciding what’s hate speech and what’s not” (Ghaffary). This highlights an inherent flaw in supervised learning based AI that results in the entrenchment of historical biases which further hurt people of color and other marginalized groups. Although it is impossible to completely eliminate all bias, unsupervised learning is a step in the right direction. Unsupervised learning forgoes human annotation, which means that human biases are less likely to transfer into the AI. Yann LeCun, a recipient of the prestigious Turing Award — often called the Nobel Prize of computing — agrees. He remarks that “self-supervised systems were… less likely to be biased than some AI software that learns from labeled examples… because labels are often applied by biased humans and the data sets were smaller, so each biased example would have a bigger impact” (Kahn). Despite this, unsupervised models are not without their issues. Unsupervised learning still suffers from biases in data composition, and because of the massive size of unsupervised datasets, it is often impractical to audit. Ultimately, unsupervised systems still suffer from human biases, but cutting out human annotations takes a valuable step forward in reducing bias.
Another key advantage to unsupervised models is their ability to create their own intuitions. Without relying on human annotations, unsupervised models are forced to create their own ways of sorting the data which is often different from humans. Although this initially seems like a disadvantage, it is actually beneficial because when used in conjunction with human moderators, they can detect fake news that could slip past humans. This is considered a common strength of unsupervised learning based AI, where it is often used to detect patterns and trends humans cannot see (Soni). Furthermore, because of unsupervised learning’s lower cost, unsupervised models tend to be more advanced and can introduce new ideas without requiring the creation of new, expensive datasets. In one study, an unsupervised model was able to consider “text content, images, propagation information, and user information of publishing news”, something that would be extremely difficult with the limited supervised datasets that currently exist. This more holistic approach improves accuracy, and the new insights that arise following the development of unsupervised models could pave the way for the future of AI. Unsupervised learning’s potential for rapid development and separation from human-introduced biases makes it a powerful tool to defeat disinformation.
When we consider the benefits of unsupervised learning, it is undeniable that it contributes less bias and allows us to work with more robust datasets. It is not a matter of if we should switch to supervised learning, but when. Although supervised learning may be the AI of today, unsupervised learning will be the AI of tomorrow.
For Supervised Learning
Author: Ani Aggarwal
With the sharp rise in fake news, social media platforms should continue to invest in supervised AI to combat this disinformation due to its regulatable bias, low resource demand, and superior performance over unsupervised AI.
One of the biggest concerns of using AI for moderation on large platforms is the bias that it may introduce. Data used to train AI models is full of biases, and those biases are replicated in the algorithms themselves. An example of this was shown in a study which trained a model to associate words with each other on both Wikipedia in Chinese and on Baike Baidu, the Wikipedia ‘equivalent’ in China (Knight). In this study, the model that was trained on Wikipedia associated ‘democracy’ with positive words such as ‘stability’ while the same model trained on Baike Baidu associated ‘democracy’ closer to ‘chaos’. This result can be attributed to China’s censorship and pro-communist restrictions on its internet, introducing anti-democratic bias in Baike Baidu (Knight). As one researcher summarizes it, “[p]re-existing biases are embedded in the data on which we choose to train [these] algorithms” (Gall). Thus, carefully controlling the data that is being used to create these algorithms is necessary to produce fair AI. But how would this process differ between supervised and unsupervised learning? Well, given a particular dataset, say a collection of Instagram posts, an unsupervised version just contains the posts, while a supervised version contains the posts as well as labels. Though the underlying posts themselves are the same, the dataset for supervised learning must be annotated by humans. This may initially seem like a disadvantage of supervised learning: humans have their own bias which would be mirrored in the data they annotate, while the unsupervised datasets don’t need human annotation. However, these human biases are, in fact, already present in both datasets because they were created by humans and thus are susceptible to sampling and exclusion biases (Gall). As Samuele Lo Piano states in his peer reviewed paper on ethical principles in AI, “the data are […] not an objective truth[; instead, they are] dependent upon the context in which they have been produced.” And so, when unsupervised models try to learn from these inherently biased datasets, they replicate that bias, due to the very nature of unsupervised learning. That is, unsupervised AI models must infer naturally occurring patterns within a dataset (Soni) and when those patterns are biased, the algorithm will be too. Supervised learning, on the other hand, can have a much more closely regulated bias because it merely learns the relationship between given data and desired outputs (Soni), meaning that biased datasets can be counteracted by balanced biases in the human annotations, resulting in an overall more fair AI.
Supervised learning also opens up a wide variety of annotation methods, each with their own merits, that can increase the fairness of AI models. One such method is currently being used on Instagram by Facebook. Facebook’s current AI systems use supervised models which learn “from manually annotated examples” but still use “human moderators […] to flag content that is controversial or subjective” (Demartini). This allows AI that is not nuanced enough to filter out clearly undesirable content while simultaneously learning from Facebook’s human moderators in real time. Another technique that is available to supervised learning that isn’t available to unsupervised learning is the use of user interactions on social media posts to create data. For example, a post calling coronavirus a hoax might have some user comments that point out the information is fake, while it might also have some users who support the conspiracy. Recently, eight PHD researchers from Arizona State University and Microsoft published a peer reviewed paper in which they used these posts with users’ interactions to identify fake news (Shu). This model used the opinions of many people rather than just a few, reducing bias, while also taking into consideration the credibility of various users. This built reliable AI that was community based and had low bias. However, this still leaves the question of cost: is the superior bias of supervised learning really worth the cost?
Yes! It’s clear that impartial AI is crucial to avoid discrimination, but resource cost isn’t an issue anymore either! Supervised learning’s need for human annotated data was resource intensive compared to unsupervised learning in the past. However, new studies show that small amounts of data can create incredibly effective supervised AI (Shu), resulting in lower resource cost on social media companies. Additionally, these supervised models greatly outperform state of the art unsupervised models (Shu), with averages of 4% higher scores in real world tests (Shu, Gangireddy). Thus, it is no surprise that platforms like Facebook use supervised learning, with up to 99.5% of removals done by supervised AI (Kertysova).
With supervised learning’s superior performance, lower resource cost, and fairness, it seems to be the obvious choice for AI algorithms in the future. So, as you scroll through your ‘suggested’ feed on Instagram and don’t see anti-vaxers or QAnon conspiracies, thank supervised learning for its hard work.
Works Cited
Author: Ani Aggarwal
“Definition of Disinformation.” Merriam-Webster, Merriam-Webster, www.merriam-webster.com/dictionary/disinformation.
Demartini, Gianluca. “Users (and Their Bias) Are Key to Fighting Fake News on Facebook — AI Isn’t Smart Enough Yet.” The Conversation, 29 Jan. 2021, theconversation.com/users-and-their-bias-are-key-to-fighting-fake-news-on-facebook-ai-isnt-smart-enough-yet-123767.
“Fake News in Social Media.” Gale In Context Online Collection, Gale, 2019. Gale In Context: High School, link.gale.com/apps/doc/MHTAOB948972284/SUIC?u=los42754&sid=SUIC&xid=c6803676. Accessed 12 Mar. 2021.
Gall, Richard. “Machine Learning Ethics: What You Need to Know and What You Can Do.” Packt Hub, 23 Sept. 2019, hub.packtpub.com/machine-learning-ethics-what-you-need-to-know-and-what-you-can-do/.
Gangireddy, S. C., Padmanabhan, D., Long, C., & Chakraborty, T. (2020). Unsupervised Fake News Detection: A Graph-based Approach. In 31st ACM Conference on Hypertext and Social Media: Proceedings (pp. 75–83). Association for Computing Machinery (ACM). https://doi.org/10.1145/3372923.3404783
Kertysova, Katarina. “Artificial Intelligence and Disinformation.” Brill, Brill Nijhoff, 12 Dec. 2018, brill.com/view/journals/shrs/29/1–4/article-p55_55.xml?language=en.
Knight, Will. “How Censorship Can Influence Artificial Intelligence.” Wired, Wired, 4 Feb. 2021, www.wired.com/story/how-censorship-can-influence-artificial-intelligence/.
Lo Piano, S. Ethical principles in machine learning and artificial intelligence: cases from the field and possible ways forward. Humanit Soc Sci Commun 7, 9 (2020). https://doi.org/10.1057/s41599-020-0501-9`
Shin, Terence. “Real-Life Examples of Discriminating Artificial Intelligence.” Medium, Towards Data Science, 4 June 2020, towardsdatascience.com/real-life-examples-of-discriminating-artificial-intelligence-cae395a90070.
Shu, Kai, et al. “Leveraging Multi-Source Weak Social Supervision for Early Detection of Fake News.” ArXiv.org, 3 Apr. 2020, arxiv.org/abs/2004.01732.
Soni, Devin. “Supervised vs. Unsupervised Learning.” Medium, Towards Data Science, 21 July 2020, towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d.
Work Cited
Author: Dhanin Wongpanich
Ghaffary, Shirin. “The Algorithms That Detect Hate Speech Online Are Biased against Black People.” Vox, Vox, 15 Aug. 2019, www.vox.com/recode/2019/8/15/20806384/social-media-hate-speech-bias-black-african-american-facebook-twitter.
Heaven, Will Douglas. “The Way We Train AI Is Fundamentally Flawed.” MIT Technology Review, MIT Technology Review, 18 Nov. 2020, www.technologyreview.com/2020/11/18/1012234/training-machine-learning-broken-real-world-heath-nlp-computer-vision/.
Kahn, Jeremy. “Can A.I., so Often Blamed for Perpetuating Hidden Bias, Help Uncover It Too?” Fortune, Fortune, 30 Mar. 2021, fortune.com/2021/03/30/humans-are-plagued-by-hidden-biases-a-i-can-help/.
Kahn, Jeremy. “Facebook Claims Computer Vision Breakthrough with Instagram-Trained A.I.” Fortune, Fortune, 4 Mar. 2021, fortune.com/2021/03/04/facebook-says-its-new-a-i-that-learns-without-labelled-data-represents-a-big-leap-forward-for-computer-vision/.
Li, Dun, et al. “Unsupervised Fake News Detection Based on Autoencoder.” IEEE Xplore, IEEE Xplore, 11 Feb. 2021, ieeexplore.ieee.org/abstract/document/9352726.
Soni, Devin. “Supervised vs. Unsupervised Learning.” Medium, Towards Data Science, 21 July 2020, towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d.

