CAI Misinformation

Published in

TP on CAI

9 min readDec 17, 2019

This story will continue to expand as I find more confusion out there on the Web to highlight. Last edited: 2023–01–28.

They say the road to you-know-where is paved with good intentions. I wonder where the road leads that is ever more commonly paved with a desire to jump on the bandwagon of “newest shiny technology” without paying the price of admittance in the currency of true understanding.

Here I collect examples of writing and reporting gone awry in the land of conversational artificial intelligence (CAI) and data science. Sorry for being so negative in this story. I just can’t help myself. It helps me feel superior to everybody else — superior in terms of pedantry, I admit. :-)

ChatGPT
NLP
Machine Learning, Learning on its Own
Dialog Generation
White Papers vs. Research Papers
Generative Language Models
Accuracy
Media Scrutinizing Media

ChatGPT

There’s so much to say about ChatGPT in January, 2023, less than two months after its public demo release. Where do I start? ChatGPT, itself, is the source of plenty of misinformation. But that is a topic for a different story. Let’s talk about real, live people who are talking about ChatGPT.

After just a month or two, I can’t count the number of times I have read titles of stories or YouTube videos that claim they will tell you something about ChatGPT, only to find out the author has confused ChatGPT with some other deep learning model. In January of 2023, we call this clickbait. Claiming ChatGPT can generate images is one broad category. I know ChatGPT can generate prompts that you can enter into an image-generating model like DALL-E 2. It can probably also generate code in some graphics computer language that will generate images. But saying that it can generate images directly right now is not true, and muddies the water too much: it will make it hard for the public to recognize the significance of when some future version of ChatGPT really can generate images as well as text.

Here’s another kind of error from Kyle O’Sullivan, Assistant Features Editor on The Mirror:

The Guardian got ChatGPT to write an entire article back in 2020 — with mixed results.

Source: https://www.mirror.co.uk/news/us-news/jobs-new-ai-technology-chatgpt-29066279

Sorry, Mirror. ChatGPT did not exist in 2020. The model you refer to is actually called GPT-3. There is a difference.

Article linked inside above story: https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3

NLP

Be wary of googling “NLP” if you don’t already have a rough idea of what it means.

AI or Artificial Intelligence backed bots are more dynamic and advanced in nature. They use NLP or Neuro-Linguistic Programming to enable a machine to be capable of having a varied and emotionally intelligent conversation with humans. Alexa is an example of the same.

https://www.entrepreneur.com/article/343582

Well, at least we can rest assured that chatbots won’t suffer from “phobias, depression, tic disorders, psychosomatic illnesses, near-sightedness, allergy, the common cold, or learning disorders.”

https://en.wikipedia.org/wiki/Neuro-linguistic_programming

Machine Learning, Learning on its Own

As with most of the quotes I grab for this story, I am not trying to pick on the writers of these quotes. They are usually just one writer among thousands who are saying the same thing. But I can’t quote them all. I pick one or two to illustrate the misconception. In the following, I happen to quote from Capital One. But they are not the only one.

Of course, since Eno uses machine learning, it’s always improving and becoming more intelligent over time.

https://www.capitalone.com/tech/machine-learning/capital-ones-intelligent-assistant-why-we-built-enos-nlp-tech-in-house/

Eno uses a natural language processing solution, meaning that the more people text and talk to the bot, the more it will understand and learn how to respond to the questions that customers ask.

https://www.capitalone.com/tech/machine-learning/becoming-a-bot-how-capital-ones-ai-design-team-created-the-character-eno/

Both of these statements strongly imply that any time someone applies machine learning, or NLP based on machine learning, the application will automatically learn, continually, without further human effort. There has certainly been research into developing such things as Tom Mitchell’s Never Ending Language Learning (NELL), which I think is extremely interesting because it is extremely rare. There are also some ML applications based on clear user behavior data like e-commerce conversions that do actually update ML models automatically using training data derived from those user behaviors. But for the most part, using machine learning in real applications does not necessarily imply that the application can continually improve itself. Most likely, even in Eno’s case, there is still a very manual component to the feedback loop. Someone must manually curate if not also manually annotate whatever data is being generated by the application before later versions of the bot may learn from that data.

Dialog Generation

This is where my pedantry really shines.

It is popular — and actually now standard — in the CAI literature to describe a model as performing “dialog generation”. I imagine some researcher a few years ago with a few paper citations who was in search of a pithy way to describe his or her model that generated natural language as part of a dialog system and decided he or she could never just say that his or her model generated natural language as part of a dialog system. So they grabbed one of their discipline’s favorite and generally-productive words and stretched the analogy a bit too far. And thus a new phrase was coined and immediately caught on too fast because no one could think of anything better to call it.

Your model can generate natural language. Your model can even generate natural language as part of a dialog. But, I would argue, your model cannot “generate” a dialog in the original sense unless it generates both sides of the dailog by itself. Instead, I would say your model participates in a dialog.

White Papers vs. Research Papers

I haven’t seen this one in print yet, but I occasionally hear people conflate the term white paper with research paper. I cringe when I hear this confusion because, in common practice, white papers are practically the opposite of a peer-reviewed research paper.

I understand that traditionally a white paper is

… a report or guide that informs readers concisely about a complex issue and presents the issuing body’s philosophy on the matter. It is meant to help readers understand an issue, solve a problem, or make a decision.

https://en.wikipedia.org/wiki/White_paper

Marketing departments have kind of taken over white papers since this tradition was started.

Since the early 1990s, the terms “white paper” or “whitepaper” have been applied to documents used as marketing or sales tools in business. These white papers are long-form content designed to promote the products or services from a specific company. As a marketing tool, these papers use selected facts and logical arguments to build a case favorable to the company sponsoring the document.

https://en.wikipedia.org/wiki/White_paper#In_business-to-business_marketing

White papers tend to be marketing material from a single company designed to persuade you to buy their products and services. They are not peer-reviewed and are often not very objective, although they can sometimes contain useful information beyond their sales pitch.

Research papers, on the other hand, are short academic publications intended to communicate objective research findings using enough technical detail that the results can be reproduced. I’m not saying research papers always fulfill their obligation of being unbiased, but I will say they try, whereas white papers almost never try and must always be read with a grain of salt.

Those Magical Oracles called Pre-trained Generative Language Models

Originally I wrote this section on Large Language Models (LLMs) around 2020. A few things have changed since then. See my update at the bottom of this section. I am interested in clearing up misinformation enough that I am happy to clear up my own misinformation written on the topic of misinformation.

For decades, (probabilistic) language models were designed to provide grammaticality to the rendering of semantic information generated by some other module in the natural language generation pipeline. Now that language models are neural and are performing better than previous generations, suddenly we expect they can do more than serve as a grammaticality filter on the end of that NLG pipeline.

Good example of someone assuming that GPT-3 is supposed to be able to answer questions: https://mindmatters.ai/2022/01/will-chatbots-replace-the-art-of-human-conversation/

Conversational AI, chatbots need to provide accurate and reliable information. So that’s one of the things with generative models [like GPT-3] is you can’t guarantee the accuracy of the [output] information.

https://www.youtube.com/watch?v=-Nfg8sLQ9wo&feature=youtu.be

https://blog.rasa.com/gpt-3-careful-first-impressions/

The better an AI technique or tool becomes at doing one task, the more non-experts assume it can perform tasks for which it was never designed. Take pre-trained generative neural language models like GPT-3. They are designed and trained to produce natural, grammatical output language given some input language. Naturally, in order to do this well, these models must learn the most frequent and salient associations between the input and output text on which they were trained. I am continually amazed how many people jump to the conclusion that since some examples of their output look like accurate facts (which happened to be salient in their training data) that they might possibly be trusted to always produce accurate facts.

It’s nice that the Rasa people and others warn us not to trust GPT-3 output for factualness. I’m just surprised that it was ever a question in anyone’s mind that you might be able to trust a language model in that way when there was no formal allowance of semantics or reasoning in its design and no reason to believe that its neural architecture might accidentally develop such a capability, no matter how much text it was trained on.

In case you want more perspectives on this point than my own, François Chollet has provided a nice quote to support this view:

Deep learning models are brittle, extremely data-hungry, and do not generalize beyond their training data distribution. This is an inescapable consequence of what they are and how we train them. They can at best encode the abstractions we explicitly train them to encode, they cannot autonomously produce new abstraction.

https://www.zdnet.com/article/keras-creator-chollets-new-direction-for-ai-a-q-a/

And in the Soloist paper by a group from Microsoft Research, they say this about GPT-2 which applies to other SoTA auto-regressive language models:

Although after being fine-tuned using conversational data, GPT-2 can respond to users with realistic and coherent continuations about any topic of their choosing, the generated responses are not useful for completing any specific task due to the lack of grounding.

https://arxiv.org/abs/2005.05298

I think much of the recent concern with bias (e.g. gender bias) being found in ML models is consequence of this misunderstanding of what ML models can do. It’s not a model’s job to give you unbiased facts just as it’s not the job of a language model trained on last year’s data to tell you who the president of the United States is this year.

Go read about the No Free Lunch Theorem. Machine Learning does its job because of bias not despite it.

Update in January, 2023

I have to admit, LLMs are now able to learn world models in order to perform better at language modeling. Some of ChatGPT’s impressive behaviors was my first clue. The following post was what convinced me.

https://thegradient.pub/othello/

Accuracy

This one is more about science, data science, and ML in general. phys.org published an article about a machine learning algorithm from Venhound that can predict which startups will succeed. The article reports 90% accuracy.

It is well known that around 90% of startups are unsuccessful: Between 10% and 22% fail within their ﬁrst year, and this presents a significant risk to venture capitalists and other investors in early-stage companies. In a bid to identify which companies are more likely to succeed, researchers have developed machine-learning models trained on the historical performance of over 1 million companies. Their results, published in KeAi’s The Journal of Finance and Data Science, show that these models can predict the outcome of a company with up to 90% accuracy. This means that potentially 9 out of 10 companies are correctly assessed.

https://phys.org/news/2021-09-scientists-ai-success-startup-companies.html

The reporting problem should be obvious to any first-year machine learning student. Maybe the algorithm is really good and useful. The problem is this article, which gives no evidence of success. If 90% of startups fail, I can give you an algorithm that is 90% accurate without any effort at all: predict fail every time.

The news appears on other sites with almost the same wording, so I’m not sure who is copying whom.

Luckily, the original paper is more careful to use precision and recall metrics, instead of the simpler “accuracy” metric, which are less prone to accidental inflation when the class sizes are imbalanced.

https://www.sciencedirect.com/science/article/pii/S2405918821000040?via%3Dihub