Open knowledge, false knowledge, and the dangers of machine generated text

Alan Roseman
11 min readSep 19, 2023

--

Introduction

There seems to be a problem with open knowledge. There is more “knowledge” available and more accessible than ever, but the quality is so variable. There is a lot of false knowledge about too. This makes it more difficult to research and extract the current true knowledge on any subject.

Image by Bing Image Creator (2023)

It is new(ish) technologies and new media formats, including social media, that have made more knowledge, information and mis-information all so much more accessible. Advantages are: fast conversations vs. monthly periodicals or a letter to the newspaper editor. Blogs appear daily, twitter conversations go on in real-time, responding to live events. People can say their thoughts, without editorial interference, or the inconvenience of fact checking. Fact checking may be applied later, by offended or other parties. This creates a great tapestry of information and perspectives. A global conversation, almost. There is much repackaging, analysis and opinion about knowledge. This is useful to make it more accessible. Reading experts’ opinions on some work will give you new insights.

What is going wrong? Lack of quality control due to dispensing with the traditional gatekeepers has led to a media body laced with false claims and untruths. There can be repetition of bad information, unchecked bias, propaganda, commercial interest. Sometimes misinterpretations or mistranslations occur. These can be propagated and mutated due to the Chinese whispers effect.

The open knowledge movements are actually integral to the solution to these problems. Open publishing, open data, open methods, all contribute to a more accessible deeper level of knowledge, which make it possible for the knowledge consumer to discern and distinguish different levels of quality and reliability.

The open knowledge movement has brought down barriers and paywalls. It replaces an unfair and outdated system of academic researchers paying to publish, and then paying to access. The paywall separates academics from the wider world, and is a barrier to reaching their ultimate audience — the public, who are often their funders via the public purse.

In higher education we are consumers and producers of knowledge, and knowledge is integral to our research and teaching practises, which are generally evidence based. The explosion in the amount of knowledge and sources has added value. The new media moves at a faster pace and helps keep us up-to-date. Here I present some analysis and some recommendations for navigating this turbulent sea of knowledge and mis-information.

Mis-information and untruths

Information, knowledge, mis-information, dis-information, fake-news, lies, misrepresentation, selective truth, unsupported claims, … : how are all these related? Can we simply divide these into truth and lies? The truth can change with time. There can be different truths, different perspectives. Is there one whole truth even?

The Impact of ChatGPT and generative AI

ChatGPT has made a big impact recently. It is being used and mis-used to generate a lot of articles and text on the internet (and elsewhere), of varying quality; even fake research papers. ChatGPT is based on a machine learning paradigm called the Large Language Model (LLM). LLMs can produce very relevant and useful text in response to user prompts, in perfectly correct language. The text produced can appear to be written by a human writer.

OpenAI is the company that developed GPT (their LLM) and ChatGPT. They started with the intention to develop AI in a positive and open way. However, the codebase is no longer open, so OpenAI is not really open. There are many applications or services based on GPT (e.g. Microsoft’s Bing) and related technologies (e.g. Google’s Bard). These are often referred to as generative AI (artificial intelligence) methods, but they are not intelligent in the actual meaning of the word. They are really large language regurgitators, with algorithms based on what words or groups of words probabilistically come together, based on large bodies of text they have been trained on. Their reliability will be based on the quality of their training data. In their design these LLMs have no way of conceiving something new. Though, a cunning new algorithm could remix some previous ideas, giving the illusion of some creativity and intelligence.

We don’t always know how the text we read is produced, and what technology might have been used. Where ChatGPT-like writing has been passed off as the work of a human author, the exact technology is not revealed. OpenAI released a tool to help recognize machine generated text in Jan 2023, but withdrew it recently (July 2023), due to poor performance.

If we can spot it, can we trust machine generated text? The problem is, I believe, we now have to mistrust all text (and pictures too, but that is another topic).

What is bullshit?

Bullshit is a term freely thrown around and colloquially used to discredit and label bad information, or sometimes just to express disapproval. Bullshit, however, is a special kind of mis-information. Surprisingly, though so much bullshit has been written, not so much has been written on the subject of bullshit.

I have found the essay, “On Bullshit”, written by American philosopher, Harry G. Frankfurt, very illuminating. After some considerable analysis he concludes bullshit is a particular kind of untruth. He says (p33): “Her statement is grounded neither in the belief that it is true nor, as a lie must be, in a belief that is it not true. It is just this lack of connection to a concern with truth — this indifference to how things really are — that I regard as the essence of bullshit.”

He continues to comment on why there is so much bullshit (p63). It is produced when people feel compelled to express some opinion on topics they are ignorant on. There are some other reasons for bullshitting, e.g. boasting, or entertainment. Perhaps the simplest of reasons, as put by Ian Hislop on fake-news, is: “It’s cheaper to make things up rather than find things out.” (~8m in).

With so much bullshit polluting our knowledge sources, it is important to filter it out. It is not a new phenomenon, just worse now. “in the early 1960’s, an interviewer was trying to get Ernest Hemingway to identify the characteristics required for a person to be a ‘great writer’. As the interviewer offered a list of various possibilities, Hemmingway disparaged each in sequence. Finally, frustrated, the interviewer asked, ‘Isn’t then any one essential ingredient that you can identify?’ Hemingway replied, ‘Yes, there is. In order to be a great writer a person must have a built-in, shockproof crap detector.” This has never been truer.

Divide and conquer

In the world of intelligence, academia, philosophy and logic; there is a concept of what one knows, can know, and can’t know, and what we don’t or can’t. Before we read something new, we don’t know it. After, we must decide, do we believe it or not?

It is important to aware of what we know and don’t know, especially in high stakes situations. Important decisions should be based on correct information, and incorrect information should be identified. Donald Rumsfeld, US Secretary of Defence in 2002, dramatically brought this concept to the public’s attention in a famous press briefing speech (at 37m43s), where he explained the concept of known unknowns, and unknown unknowns.

Mis-information is an insidious and often invisible enemy. Understanding its nature better, will help. In order to confront the deluge of terms relating to the numerous variations of mis-information, I have attempted to categorize them in a simple 2x2 matrix, classifying between truth and untruth, and whether it’s knowingly or unknowingly communicated, based after Rumsfeld’s matrix.

I was partially successful, but needed to add an extra column and row for grey areas. There are some cases that are not clearly truth or not; or decidedly knowing or unknowingly communicated. Using the matrix, I analyse and categorise the untruths you might come across. First, there is the known untruth — a lie. This is simple, the author or speaker knows they are telling a lie (or partial truth, or variations of this). Then there is the unknown untruth. Here an untruth is unwittingly repeated, or there is a genuine mistake in logic or reasoning. If all the steps are given, then this is an honest mistake. If the untruth is repeated and the source not given, then this is academically ill-considered (claiming another’s poor work as your own), or sloppy journalism. To complete the matrix we have the known truth, simply the truth; and the unknown truth (the truth unknowingly transmitted) — this is a peculiar case where the communicator must think they are telling a lie, but aren’t. See Figure 1 below.

Figure 1. Matrix of truth and untruths (incomplete and not non-redundant), modelled after Rumsfeld’s “known unknowns and unknown unknowns”. Definitions from different dictionaries may vary. Please take this distribution in the general spirit of enquiry.

Examination of the matrix reveals a few things. The main insight may be that the truth appears simple, but there are so many forms of untruth. Everything not true is a form of false-knowledge. These can be grouped into a few pertinent categories, which can guide how you can identify them: Deliberate or not. Originator or propagator. Intention/purpose, honest/dishonest. By omission (of relevant fact, measure of uncertainty). At another level we get a class of knowing unknowingness (deliberate ignorance), e.g. not to check and acknowledge obvious and accessible information.

How to use fig 1. Examining this range of options may trigger some insight. Allocating suspect information in the suspected quadrant will help inform on how to uncover or check it by applying suitable methods.

How does open knowledge help?

Deception can be complicated. Facts can be hidden between lies, and lies among facts. You could restrict your reading to the more reliable world of peer-reviewed research and edited books, which are slower to keep up with the changing world. Current events and recent scientific break-throughs are topical and exciting. Newspapers, magazines, blogs, social media, unreviewed archives, are all valuable sources of up-to-date information. If you are an active researcher yourself, you might want to explore or start research based on the latest developments and ideas, and not wait for others to do this. A good researcher will examine the evidence, and repeat key experiments themselves before committing further time and resources in a new direction. This another reason is why open research and open knowledge are so important.

Navigating the literature at a deeper level

Just reading an abstract, or reported headlines, can be misleading. Below is a kind of protocol I use myself and recommend to my students when researching knowledge in a new area.

If an application could help with the steps below, reliably, this would make navigating the research literature more accessible. Most of them are “mechanical” steps that could be automated. A tool that finds, sorts and organises information, and presents it uncorrupted and linked to the original sources, would be so much more useful for a knowledge researcher than current generative AI applications.

1. Check your sources are reliable. General articles and blogs are fine to get an introduction to a field or subject area. If it has few or no citations, the information is not supported and lacks credibility. Also, they didn’t credit their sources. If an idea is raised that you want to follow up, find further sources related to this. Academic reviews are good, as they give a comprehensive list of references.

2. Check the references, and the references therein. Do you agree with their data, logic, arguments? If you are making an argument or explaining the origin of a fact or theory, you should cite the original paper or article presenting the data and forming the theory. Look at the supplementary methods. Many published manuscripts are condensed to the absolute minimum. Important experimental details and results will be in the available supplementary files.

Read the peer review file if available. There are various forms of open peer review, with a move to publishing the decision letter with the reviewer comments (reviewers can be named or unnamed), and authors’ responses. These will give an insight as to what peer reviewers found needed more evidence, and which aspects could be more controversial. Scientific papers can be boring and flat to read. In the peer review file there is a glimpse of hot-blooded humanity, through tense conversations between the authors and reviewers.

3. Does a paper have many citations (by other publications at later dates)? What are others saying about it. Has the work been built upon, or discredited?

4. Who are the authors? What are their credentials? Not just academic qualifications, life experience, other writings? Who sponsored the research? Are they reliable? Have they any commercial or other vested interests? What else have they published?

Many blogs and articles may be interesting but poor in respect of providing sources. You don’t have to discount these entirely. You can follow up any ideas raised independently, finding good quality information and sources, as directed above.

This brings us back to ChatGPT. You will find that ChatGPT is bad at providing citations. In my personal experience, when pressed to produce sources, it fabricated false but plausible references. On further questioning, it was aggressively adamant about the existence of the fictitious sources. (Bing and Bard provide some genuine sources at the outset). A University course essay without support from citations will fail. A safe way to use ChatGPT text is to generate ideas to explore independently. The problem with ChatGPT is that it has no awareness of truth, untruth or otherwise, but writes with convincingly perfect grammar. It can be very dangerous when mis-used. These dangers go beyond misleading students writing essays.

Bad books

Now it has been uncovered there are published books written using generative AI. A poor quality travel guide may get you lost, or waste your time looking for non-existent attractions. However, an AI generated foraging guide that suggests you may freely gather and dine on the mushrooms found at the base of your garden is far more dangerous. It is very difficult even for experts to differentiate some of the deadly poisonous specimens from the edible.

This makes clear how important it can be to examine the credentials of the author. If basing an important decision on any information, find support for it by additional sources (as above, and where possible look for and examine original data, as enabled by open knowledge), be this related your University grade, professional reputation, or personal health or safety.

The knowing or unknowing disregard for the truth puts machine generated text in the class of bullshit. The blame for the problems does not go onto AI though. AI is just a technology. It is humans who are using the tool badly and maliciously. They can and should know the dangers of the limits of their knowledge. Currently, AI doesn’t have this awareness. No AI generated work should be presented as from a human writer.

Unreliable literature

This is not to say that no peer reviewed publications have been wrong. In the natural course of knowledge advancement, well supported and accepted ideas can be disproven and improved or replaced (it is important to consult recent literature on your topic). Mistakes can happen. In large groups there can be mis-communication (NASA lost a Mars probe due to a mis-communication of measurement units). Reputable journals allow corrections, corrigendums, and will publish retractions. Retraction Watch makes it easy to check or identify retracted discredited works. Known retractions in the fields of biology and medicine are at the level of 0.14 per 1,000. This may be comforting, but what about the unknowns? You will need to use open knowledge resources and your own intelligence.

Conclusion

The tools and resources provided by open knowledge initiatives enables us to access and examine the evidence presented, and makes it possible to identify the reliable and reputable research and knowledge. This is essential for higher education to function. A lack of evidence is not good evidence for anything, and can be seen as suspicious.

People are finding innovative ways to use generative AI in their work. That is fine, but they should take full responsibility and accountability for their work. It should be signed with a statement such as, “I have used machine generated text in producing this article. I have checked it and take full responsibility for accuracy of the content.” A great danger is lack of openness, and machine generated text being taken as the work of a human expert.

Author declaration: This work is my own opinion. I declare no financial interests. No machine generated text has been used in producing this work.

--

--