This story will continue to expand as I find more confusion out there on the Web to highlight.
They say the road to you-know-where is paved with good intentions. I wonder where the road leads that is ever more commonly paved with a desire to jump on the bandwagon of “newest shiny technology” without paying the price of admittance in terms of truly understanding it.
Here I collect my favorite examples of writing and reporting gone awry in the land of conversational artificial intelligence (CAI) and data science. Sorry for being so negative in this story. I just can’t help myself. It helps me feel superior to everybody else — superior in terms of pedantry I guess.
Be wary of googling “NLP” if you don’t already have a rough idea of what it means.
AI or Artificial Intelligence backed bots are more dynamic and advanced in nature. They use NLP or Neuro-Linguistic Programming to enable a machine to be capable of having a varied and emotionally intelligent conversation with humans. Alexa is an example of the same.
Well, at least we can rest assured that chatbots won’t suffer from “phobias, depression, tic disorders, psychosomatic illnesses, near-sightedness, allergy, the common cold, or learning disorders.”
This is where my pedantry shines through.
It is popular — and actually now standard — in the CAI literature to describe a model as performing “dialog generation”. I imagine some researcher a few years ago with a few paper citations who was in search of a pithy way to describe his or her model that generated natural language as part of a dialog system and decided he or she could never just say that his or her model generated natural language as part of a dialog system. So they grabbed one of their discipline’s favorite and generally-productive words and stretched the analogy a bit too far. And thus a new phrase was coined and immediately caught on too fast because no one could think of anything better to call it.
Your model can generate natural language. Your model can even generate natural language as part of a dialog. But, I would argue, your model cannot “generate” a dialog in the original sense unless it generates both sides of the dailog by itself. Instead, I would say your model participates in a dialog.
White Papers vs. Research Papers
I haven’t seen this one in print yet, but I occasionally hear people conflate the term white paper with research paper. I cringe when I hear this confusion because, in common practice, white papers are practically the opposite of a peer-reviewed research paper.
I understand that traditionally a white paper is
… a report or guide that informs readers concisely about a complex issue and presents the issuing body’s philosophy on the matter. It is meant to help readers understand an issue, solve a problem, or make a decision.
Marketing departments have kind of taken over white papers since this tradition was started.
Since the early 1990s, the terms “white paper” or “whitepaper” have been applied to documents used as marketing or sales tools in business. These white papers are long-form content designed to promote the products or services from a specific company. As a marketing tool, these papers use selected facts and logical arguments to build a case favorable to the company sponsoring the document.
White papers tend to be marketing material from a single company designed to persuade you to buy their products and services. They are not peer-reviewed and are often not very objective, although they can sometimes contain useful information beyond their sales pitch.
Research papers, on the other hand, are short academic publications intended to communicate objective research findings using enough technical detail that the results can be reproduced. I’m not saying research papers always fulfill their obligation of being unbiased, but I will say they try, whereas white papers almost never try and must always be read with a grain of salt.
Those Magical Oracles called Pre-trained Generative Language Models
For decades, (probabilistic) language models were designed to provide grammaticality to the rendering of semantic information generated by some other module in the natural language generation pipeline. Now that language models are neural and are performing better than previous generations, suddenly we expect they can do more than serve as a grammaticality filter on the end of that NLG pipeline.
Good example of someone assuming that GPT-3 is supposed to be able to answer questions: https://mindmatters.ai/2022/01/will-chatbots-replace-the-art-of-human-conversation/
Conversational AI, chatbots need to provide accurate and reliable information. So that’s one of the things with generative models [like GPT-3] is you can’t guarantee the accuracy of the [output] information.
The better an AI technique or tool becomes at doing one task, the more non-experts assume it can perform tasks for which it was never designed. Take pre-trained generative neural language models like GPT-3. They are designed and trained to produce natural, grammatical output language given some input language. Naturally, in order to do this well, these models must learn the most frequent and salient associations between the input and output text on which they were trained. I am continually amazed how many people jump to the conclusion that since some examples of their output look like accurate facts (which happened to be salient in their training data) that they might possibly be trusted to always produce accurate facts.
It’s nice that the Rasa people and others warn us not to trust GPT-3 output for factualness. I’m just surprised that it was ever a question in anyone’s mind that you might be able to trust a language model in that way when there was no formal allowance of semantics or reasoning in its design and no reason to believe that its neural architecture might accidentally develop such a capability, no matter how much text it was trained on.
In case you want more perspectives on this point than my own, François Chollet has provided a nice quote to support this view:
Deep learning models are brittle, extremely data-hungry, and do not generalize beyond their training data distribution. This is an inescapable consequence of what they are and how we train them. They can at best encode the abstractions we explicitly train them to encode, they cannot autonomously produce new abstraction.
And in the Soloist paper by a group from Microsoft Research, they say this about GPT-2 which applies to other SoTA auto-regressive language models:
Although after being fine-tuned using conversational data, GPT-2 can respond to users with realistic and coherent continuations about any topic of their choosing, the generated responses are not useful for completing any specific task due to the lack of grounding.
I think much of the recent concern with bias (e.g. gender bias) being found in ML models is consequence of this misunderstanding of what ML models can do. It’s not a model’s job to give you unbiased facts just as it’s not the job of a language model trained on last year’s data to tell you who the president of the United States is this year.
Go read about the No Free Lunch Theorem. Machine Learning does its job because of bias not despite it.
This one is more about science, data science, and ML in general. phys.org published an article about a machine learning algorithm from Venhound that can predict which startups will succeed. The article reports 90% accuracy.
It is well known that around 90% of startups are unsuccessful: Between 10% and 22% fail within their ﬁrst year, and this presents a significant risk to venture capitalists and other investors in early-stage companies. In a bid to identify which companies are more likely to succeed, researchers have developed machine-learning models trained on the historical performance of over 1 million companies. Their results, published in KeAi’s The Journal of Finance and Data Science, show that these models can predict the outcome of a company with up to 90% accuracy. This means that potentially 9 out of 10 companies are correctly assessed.
The reporting problem should be obvious to any first-year machine learning student. Maybe the algorithm is really good and useful. The problem is this article, which gives no evidence of success. If 90% of startups fail, I can give you an algorithm that is 90% accurate without any effort at all: predict fail every time.
The news appears on other sites with almost the same wording, so I’m not sure who is copying whom.
Luckily, the original paper is more careful to use precision and recall metrics, instead of the simpler “accuracy” metric, which are less prone to accidental inflation when the class sizes are imbalanced.