How AI will help us get rid of fake news by 2030
The Internet was supposed to connect the world and make it a better place. However, recently I have a feeling like it does everything, but it. Google anything that has been widely described by your media (“is global warming true?”, “is 5g safe?”, “brexit”), and you’ll find tons of articles saying different things.
Most of these stories will be contradicting each other. Which means at least half of them are false (or “partially true”). How do you tell which ones are which?
The Internet has turned into such a mess of false and meaningless information, that now it’s nearly impossible to browse it effectively.
Such state of things could have been predicted earlier on, but it was around 2016 (D. Trump election, Brexit referendum) when the issue became globally apparent. It was also then when the term “fake news” got popularized.
Since then, people have blamed everybody and everything:
- Third parties. (“Somebody interfered our elections!”)
- Facebook / Capitalism / Ads industry. (“They care only about clicks and $$$!”)
- Journalists. (Even the most righteous news agencies had to resort to clickbaity titles and controversial articles, or they would lose their audiences and go bankrupt.)
- Other people. (By battling them on social media, some tried to debunk the fake news. In result, however, many of my friends went apolitical, by caring and reading even less and less about politics. Why would you care, if you can’t get reliable information on it?)
- Democracy. (“Democracy is ruined because people don’t care what they read, share, and support.”)
Well, we can keep blaming everybody, or we can sit down to the problem and do something about it.
Actually, some people have already sat down.
Did you know… over 1 billion USD has been donated in the years of 2015–2024 with the goal of preventing disinformation, improving discussion on the Web and increasing Web’s content credibility? Many projects have already been created in this area, and have started producing promising results.
In my opinion, it’s a matter of a few years until we will see a daily impact of them on our lives.
Let me elaborate on this.
Revolt against disinformation
Events of 2016 caused some of the governments and NGOs to take action. They started by issuing reports about the causes and effects of disinformation on society, and about the possible ways of tackling it:
- First Draft: “The Impact of CrossCheck on Journalists & the Audience” (2017)
- EU: “A multi-dimensional approach to disinformation” (March 2018); “Tackling online disinformation” (2019)
- Canada: “Safeguarding Elections” (2019)
In such reports, researchers agree on many things:
- Terminology, like what is it “to mislead”, “to misinform”, “to disinform”. Also observations: for example, fake news isn’t a problem when everybody knows it’s fake. On the other hand, if news is true, but it is intentionally misleading people into believing a false claim, that’s a problem. (Called “disinformation”.)
- Disinformation is a huge problem. “It may have far-reaching consequences, cause public harm, be a threat to democratic political and policy-making processes, and may even put the protection of citizens’ health, security and their environment at risk.” (European Commission)
- Actions against disinformation need to be taken immediately.
- There are good and bad ways of acting against disinformation. For example, censorship will cause the opposite effect. Thus, instead of blocking the access to fake news or showing “trustability” scores without any explanations behind them, we should just be informing why is it fake, and why does it have such a score.
After having independent reports proving that disinformation is a danger to the world we know, a lot of countries, companies and nongovernment foundations have set up vast funds for any projects that might fix it. Here’s a few of them:
- European Commission: 100 mln EUR in 2015–2027
- UK: 100 mln GBP in 2019–2024
- Google DNI: 115 mln EUR in 2015–2019
- Facebook Journalism Project: 300 mln USD in 2019–2021
- Open Society Foundations: 24 mln USD in 2019
- Knight Foundation: 300 mln USD in 2019–2024
- AI Ethics Initiative: 750k USD in 2018–2019
Just the ones listed above will donate more than 1 billion USD in the upcoming five years to battle the disinformation problem. More than a hundred of projects has been already bootstrapped thanks to these funds. (I’ll describe these projects in a bit.)
In the meantime, some people didn’t just want to wait. They wanted to act. So they volunteered to (or created their own) fact-checking organizations, i.e.:
- FullFact (UK, from 2010)
- Chequeado (Argentina, from 2010)
- Demagog (Poland, from 2014)
- FactCheck.org (US, from 2003)
There are already over 100 independent fact-checking organizations all over the world. IFCN, the biggest global fact-checking organization, had set up neutrality and clarity standards for fact-checking, and now spans over 67 organizations from 42 countries.
What are they and what do they do? The vast majority of them is non-profit and voluntary-based. They look for influential articles and politicians’ claims and check whether they are true or misleading. Then, they share the fact-check reports through social media. They also accept requests from the public (via e-mail, Facebook, Instagram, WhatsApp) for any claims that should be reviewed.
Collaboration of competing newspapers
Some of these NGOs messaged news agencies with a call for collaboration to work together on fact-checking, just before the upcoming national elections:
- Mexico, 2018 presidential elections — 60 publishers, universities and organizations
- Brazil, 2018 presidential election — 24 publishers
- France, 2017 presidential election — 37 publishers
- Nigeria, 2019 general election — 45 journalists, 16 publishers
For example, in France, most of the major (usually competing to each other!) news agencies joined up to fight their common enemy: disinformation. They shared notes on fact reports with themselves and publicized fact-checks together.
This proves that the journalism industry is in such a bad position nowadays, that they will try anything to bring back the people’s trust in the media. They will even collaborate to succeed.
Unfortunately, for each published fact-check, a hundred of other fake articles is created in the meantime. We won’t be ever able to make a long-lasting impact if we don’t set up an IT infrastructure for battling this problem in an automatic manner.
Google reacted quite quickly. It collaborated with Schema.org to create a global standard for machine-readable fact-checks (“Claim Reviews”). It is now using it to show them in Google Search results and also published a Fact Check Explorer that searches just the facts.
Meanwhile, W3C, a globally recognized organization that sets up Web standards for the whole world, issued a huge report called ”Technological Approaches to Improving Credibility Assessment on the Web”. Now it is trying to settle down upon definitions of Credibility Signals, to help create a global ecosystem of interoperable credibility tools. (I’ll describe those Signals in the next section.)
Automating the battle
NGOs, governments, private companies, individuals like us - we all agree that we need to prevent disinformation. We need to stop people, bots and 3rd parties from spreading information that intentionally misleads people into believing a false claim.
So if it’s all about:
- detecting disinformation,
- verifying the truthiness and credibility of claims and articles,
- informing the press and the public about the credibility of the given claim or article,
Then the question is, can we automate it?
If we had given enough data to a machine learning model (data that would consist of articles with highlighted fact-checkable sentences), then it would be smart enough to highlight claims-to-be-verified automatically on any article. In case it ever would make a mistake, editors would give him feedback about it, to improve his model even more.
With such a tool, if we had a set of articles (or links to them), we could automatically extract claims from them that “mean something” and should be verified.
Finding related claim and its’ claim review
Now, imagine, that all fact-checking organizations are following the ClaimReview schema. Then, we could have a database with all the fact-checks from the Internet. Actually, Google probably already has it.
Additionally, imagine, that we have a bot that can bring down the claims to their most simple form. Then, if we would have two sentences that would differ from each other a bit in words, but at the same time they would mean the same thing, we could bring them down to the same claim. Having that, we would be able to automatically find a related claim review (if such exists of course) for any given claim.
That would mean we could automatically fact-check claims on-the-fly using the existing database of claim reviews.
Even more, maybe our web scraper would be smart enough so that it would be able to find claim reviews in the articles by just parsing their content? Then we wouldn’t need to rely on website owners to implement the ClaimReview protocol at all.
Google already does such text interpretation a bit by giving us a straight answer to our search queries right away. So it seems like all of the above imaginations are possible?
Detecting and understanding claims are one thing, but verifying them is another story. The truth is, it might be difficult to train the AI to verify the claims on its’ own. After all, if it could understand and verify anything we’re saying, it would be a damn smart AI. (For that smart AI, we’ll probably have to wait until ~2050–2100.)
However, if we have made the AI beat the best Jeopardy player, for sure at least we could train it to automatically verify some subset of the claims. Claims involving statistics, numbers of publicly accessibly data should be easy to be checked automatically.
Furthermore, it could automatically highlight logical fallacies and cognitive biases. For example, imagine a tool that is able to automatically highlight any claims based on a survival bias. (“They did it, so we can do it as well.” — the tool would call bullshit — because without additional arguments, such a sentence is false.)
Live fact-checking of TV/videos
Automated voice-to-text (and video-to-text) transcription tools are already very mature. Just go to some random YouTube video and enable subtitles. Many of them have been written by AI, not by a human.
Having that, and also having a tool that can automatically detect and verify claims in the given text, it means we could have live debates on TV, in which the truthiness of politicians’ claims are detected on-live and shown underneath the video.
Some of the fact-checking organizations have already done it. It doesn’t work perfectly yet, sure, but they’re just non-profits that have started doing it a year ago. What if we give it enough funding, data, and ~5 years?
Calculating the trustability score of an article/publisher
Nevertheless, let’s assume we can’t verify all the claims, neither automatically nor manually. Not automatically because of technical barriers, and not manually, because verifying them by humans takes time and also often requires specific domain expertise (i.e. in medicine or science).
In such a case, when we’re not able to completely understand the given article, or fact-check all its’ sources (or facts that it relies on), is it possible to tell if the article says true? Or at least, to say with what probability it is saying true?
Of course, it’s possible. We, humans, have been living with tremendous amounts of information for centuries. In such cases, we have our own ways of judging the truthiness of the given information source. For example, we ask ourselves:
- What style of language does the article use? Is it full of emotions? Full of weirdly implied sentences? Or is it more of a science language? Does it have a clear logical order?
- Is the article title or content clickbaity and pretentious? Is it full of personal opinions? Does it contain logically incorrect sentences (i.e. facts based on a survival bias)?
- Is the article based on any sources? Where are those sources? Are they trustworthy? Do they actually support the claims said in this article?
- Are there any false claims in this article?
- Does the website on which this article is published often say true? When it doesn’t, does it correct its’ mistakes?
Each of the answers for the above questions gives us a small hint on whether the given article is trustworthy or not. W3C named them as Credibility Signals.
If you sum them all up, you receive Credibility Score. The probability of the article to be telling the truth.
Many signals might influence that score, and nowadays each fact-checking organization has a different set of those signals. Many of the signals, however, repeat all over. They just might have different names, or the organizations might have a different methodology for calculating the signal’s value.
It’ll probably stay like that for some time until these tools and organizations mature. However, eventually, I hope we’ll have something like a browser extension, to which we could connect several “credibility score sources”, which would give us live feedback about the thing we’re reading:
- What’s the credibility score of this claim/article/website?
- Why exactly the score is as it is? (Give me reasons, links, proofs. Highlight me the sentences that influence it.)
Detecting the language style of the text
While calculating the credibility score for some of the sources, we might end up with an observation that if text is full of insults, there’s a higher chance it will be misleading. Basing on this assumption, is there a way to automate detecting whether text contains insults (or some other elements that might influence the text credibility)?
Tracking down the source
Unfortunately, reading just the article’s text often is not enough to tell if the article is telling the truth. We need to track down the source of information the article relies on. For example, if we would be able to prove that the shown image or video is fake, or that the cited source of the information doesn’t exist, then just basing on that we could tell with high probability that the whole article is not telling the truth. (If we can’t prove it, then the article will be “unverified” at best.)
My favorite technique of finding the source is extensively googling the given matter. Googling it as long until I go through the entire Internet, and find all the relevant information on the subject. Then, I’m trying to connect all the dots. I merge down the articles that are copies of each other or say the same thing. I try to establish the chronological order: who was the first to say “X”?
However, recently it started to be more and more challenging to do so manually. There are so many copies of articles everywhere. Some of them are not instantly apparent. (Because they‘ve been briefly modified not to resemble the original copy.) Many articles do not cite the source. So you have to go through tens of links, forum, and social media discussions to find them. Not mentioning that you have to continuously measure their credibility and reject the bullshit sources (or at least, not treat them as seriously as the trustworthy ones).
What if we had a tool that could do all of the above automatically? I mean something better than “Google Search”, like a smarter search assistant. Instead of just giving me the links for the given search query, it would show me the whole “network” of articles related to it. It could be shown as a graph. It could be sorted by date. Filtered by Credibility Score. And so on.
Some organizations/companies are already doing it. TrustServista, for example. Are there any more? Please comment. If not, maybe it’s about time to create one?
Image/Video source detection
Reverse image search is already possible (i.e. through Google Images). You upload an image, and the search tool finds you all the occurrences of that image on the Internet.
However, it doesn’t tell you which one of the occurrences is the actual source of the image. Also, what if the image has been edited (i.e. somebody changed/added subtitles to it). Is it still going to be possible to find the related images, including the original ones?
The same question applies to videos. If I have a snapshot of a video, can I find the video source from which it comes?
Image/Video fakeness detection
Related to the above, but even more needed. Can we detect that the given image or video is fake? (Deep fakes are already floating around, but probably it’s 2020 when it will start to be very serious.)
As this topic is getting very loud recently, tools are slowly being created to help detect the fakeness of the image/video. Some of them try to measure it just by looking at the image or video (i.e. by looking for incorrectly malformed pixels). Such way might work for a moment, but sooner or later, deep fakes will become that good, that they won’t differ from the real images and videos at all.
Then, probably the only way to know it will be to have some enormous central sources of ‘information credibility’, which would inform us with what probability the given information/image/video is real and or not. For example, if there are many trustworthy sources admitting that the given video is real, we can assume it is real. In any other case, we will need to assume that it might be fake.
Some existing approaches for this problem:
Improving discussion on the Web
All the above tools will improve the ability to research and read the Internet. They will give us information on whether a given article is reliable or not, and what are the other related articles to it.
However, I don’t always have time to read all the related articles. Nor do I want to depend on that AI tool telling me what is reliable or not. (At least until it gets very mature, which might not happen soon enough.) So I’d like this process of content verification to be a bit more community-based. I’d like to discuss the claims in the articles, as they are, in the article, without the need of running away to some 3rd party platform (Facebook, Twitter) or hoping that the given website lets me add a comment (and will not filter it through).
I’d like to see what the whole world is saying about the given article. I‘d like to see fact-checkers’ annotations while I’m reading the article. Can we do it?
Highlighting and annotating the Web
Imagine we have a browser extension that lets us highlight anything on the web. (Just like we highlight quotes on Kindle for example.) We could highlight any quote, text, or even an image on the website we’re visiting and save it for ourselves.
Even more, what if we could comment on it? Tell the others that “it’s true because X”, “false because Y”, or that “it’s meaningless because Z”. What if we could share such highlights and annotations with the others and let them discuss them as well?
Such a thing could be done through a browser extension, but not only. If people would enter through the shared link, or if the website publisher would install a small script on their website, then the annotating experience could be available for everybody, without the need of installing anything.
Well, enter Hypothes.is. This tool has been already developed since 2011 and is already being used by thousands of people. (5 million annotations have been already created in it.) Although for now, it seems like it has been focusing on universities and education purposes, I hope it’s a matter of time until it will get popular among the masses as well.
“Bridging” the Web
The founder of Hypothes.is, Dan Whaley, didn’t just create the browser extension and then pitched it to plenty of people to install it. He also convinced W3C to establish a global standard for Web Annotations.
The whole Web Annotations thing is already well prepared to hit the masses. What is more, it doesn’t have to be accomplished by Hypothes.is. (Although I hope it would. It’s non-profit and open source.) Standard predicts that annotations can be written by anyone, stored anywhere, and be published by any annotations platform (with Hypothes.is being just one of such).
Another interesting fact is that an annotation can link to a highlight/annotation on another website. In such a way, you can “bridge” two documents with each other.
Moderating the comments
Unfortunately, I don’t see the whole Web Annotations thing to succeed if we don’t find a better way to keep a good level of discussion on the Web. How will we defend Hypothes.is from millions of spam and troll annotations? Again, it would be more disinforming than informing.
The first answer is: the same way as we defend Twitter, Facebook, Reddit, Wikipedia. Through moderation and reputation systems. However, that also often doesn’t work well, right? Unless the moderation team is very strong, the discussion forum can be very quickly gamed by the bad actors, or just become a mess of spam.
What if we could help the moderators anyhow? Like, we could automatically measure if the given comment or annotation is spam, meaningless, or harmful to the discussion?
As on the whole Web, we already have huge history of comments and their moderation. Maybe we could pass it down to AI, so it can learn to predict if the given comment is supposed to be meaningful or not? The ones that would be predicted to be harmful to the discussion, would be blocked or just passed down to moderators.
Even more, maybe we could have a Software as a Service that does just that, so it can be used by any comment platform on the world? We already have Akismet for filtering out spam comments from any Wordpress or other blog website, but as far as I know, it just predicts the spam. How about predicting the usefulness, harmful language, and “claim reliability” of the comments?
Well, finally, some things have started to be done in here. See:
AI as a rescue to social media and journalists
All the above tools will help journalists to create better content. I suspect that over time, all of the press agencies over the world will use them on a daily basis. (I also hope that the public will also have access to them, not just the press.)
Newsrooms that won’t have access to such tools won’t be able to create meaningful content. Thus, they’ll go bankrupt. (Or write just the tabloid stories.)
Same for social media platforms. If the currently popular platforms like Facebook or Twitter won’t do anything about the information mess they’re having, users will sooner or later leave them. (They’re already losing user engagement. It already seems like they don’t fulfill their customer needs.)
The only option for web content platforms is to improve their content. Even if it’s the users’ content. It has to be moderated and filtered in a better (and transparent for the end-user!) way.
Same goes for the internet browser or a search engine. A good internet browser helps to browse the Web. Users will sooner or later resort to the browser that is helping them find what they need. That’s the core feature of the browser, isn’t it?
The above reasons will push the journalism and IT industries to merge (at least a bit) and hugely invest in AI.
And before you yell “But people are stupid! They don’t care about the truth. They’ll always vote for X, no matter what they say.”. That’s simply not true. I think psychologists will agree with me that nobody wants to be intentionally stupid. Nor anybody wants to appear like a silly by supporting a dumb person (i.e., a person that repeatedly tells claims that everybody around knows and can prove are false).
If only we have reliable ways to immediately tell if a given claim is true or not, then it will force the politicians to strive for the truth more often. Even in the case when the “bad” “populist” politicians would be in power, this would force them to be more level headed anyways. Then, the whole information market would be more transparent, and closer to the optimum result for all its’ democracy participants.
Ten years from now, many of the already developed AI tools will mature and will help us detect and fact-check claims, images, videos on a daily basis. They will be used by most of the press and also some of the public. Social media platforms, TVs, Google Search, your web browser, all of them will be influenced by it. Politicians will be more level grounded.
AI will moderate comments on-live and de-emphasize the ones that have a bad impact on the discussion. This will finally improve the level of comments in most of the blogs, forums and social media.
Hopefully, this will lead us into the future where most of the claims and public data will get digitized and be publicly verifiable. Any try of corruption or falsifying the reality will be immediately recognized by the AI and the media.
In effect, our politics will be more stable. Public decision making will be more like science papers based on logical reasoning, than “I’m the coolest!” election contest based on fake news, tabloids and emotions.