ChatGPT vs Bing … and the urgent need for Responsible AI

Dattaraj Rao
5 min readFeb 27, 2023

--

DISCLAIMER: Opinions in this article are my own and not those of my company.

Large Language Models (LLM) like GPT3, ChatGPT and BARD are all the rage today. Everyone has an opinion about how these tools are good or bad for society and what they mean for the future of AI. Large language models generate natural language text based on a given input, such as a word, a sentence or a paragraph. They are trained on huge amounts of text data to learn patterns and probabilities of language. Some examples of large language models are GPT-3, Codex, BARD and the recent, LLaMA.

Google received a lot of flak for its new model BARD getting a complex question wrong (slightly). When asked “What new discoveries from the James Webb Space Telescope can I tell my 9-year-old about?” — the chatbot provided three answers, out of which 2 were right and 1 was wrong. The wrong one was that the first “exoplanet” picture was taken by JWST, which was incorrect. So basically, the model had an incorrect fact stored in its knowledgebase. For large language models to be effective, we need a way to keep these facts updated or augment the facts with new knowledge.

I tried asking 2 publicly available LLMs the following question:

Who was the 14th person to walk the moon?

This was by default a test designed to fail since as per NASA — only 12 humans have walked on the moon. Source: https://solarsystem.nasa.gov/news/890/who-has-walked-on-the-moon/

Surprisingly both LLMs — ChatGPT and Bing Chat — gave different answers and pointed me to the 11th and 6th person to walk on the moon. Below are responses from ChatGPT and the newly launched Bing Chat.

Response from ChatGPT
Response from Bing Chat

Both answers captured the context (moon walking) but got the exact answer wrong. Good part was that Bing did give a list of it’s references where I could verify the facts. This is a key Responsible AI principle that needs to be advocated around Transparency and Lineage.

On digging deeper, I see that the britanica.com article that was probably used as reference for both answers says that there were 24 people to reach the surface of the moon but only 12 actually walked the surface rest waited in the craft and did actually go to the moon. So technically the answer provided is wrong but the Bing model did give me evidence and helped me reach a conclusion better. This transparency is highly valuable.

How Many People Have Been to the Moon? | Britannica

Testing LLMs will need a new strategy and serious consideration for Responsible AI. tenets like Accountability and Transparency become more important than ever in this context. Below is my article on internals of LLMs and how facts are encoded inside of them.

3 Ways to Keep Stale Facts Fresh in Large Language Models | Unite.ai

Specific to Responsible AI we also need to consider tenets like Accountability, Transparency, Reproducibility, Security and Privacy. For more details check out our point of view.

Building a Responsible AI System | Persistent.com

Testing large language models is a process of evaluating their performance, capabilities, limitations or risks on various tasks or domains. For example, one can test how well a large language model can generate code, answer questions, write essays or handle compositionally and inference.

Let’s take an example of generating poetry using GPT3. There are different ways to test the effectiveness of GPT-3 for generating poems. One way is to use automated metrics that measure aspects such as syntactical correctness, lexical diversity, subject continuity and rhyme scheme. Another way is to use human evaluations that rate the poems based on criteria such as creativity, coherence, fluency and emotion. However, both methods have limitations and challenges. For example, automated metrics may not capture the aesthetic or semantic qualities of poems, while human evaluations may be subjective or inconsistent.

To test the bias of essays generated by ChatGPT, one possible way is to use a bias detection tool that can analyze the language and content of the essays for any signs of prejudice, discrimination or unfairness towards certain groups or individuals. For example, you could use a tool like Perspective API which scores texts based on their toxicity, identity attack, profanity and other attributes. Then test cases need to be designed prompting the model to intentionally generate biased content by providing leading phrases as context. LLMs usually give priority to context and tune their output accordingly. Being able to filter out bias despite it being explicitly present in the prompt is a major test that needs to be conducted.

A recent approach to mitigating bias in LLMs is to use constitutional rules. Constitution rules are a set of principles or guidelines that govern the use and development of large language models (LLMs). They can help mitigate bias in LLMs by ensuring that the data, methods and applications of LLMs are aligned with ethical values and social norms. For example, constitution rules can specify how to collect and curate diverse and representative data for training LLMs, how to monitor and evaluate the performance and impact of LLMs, how to protect the privacy and security of users and data sources, and how to promote transparency and accountability of LLM developers and users.

Applying constitutional principles to LLMs

References:

(1) Bias in Large Language Models: GPT-2 as a Case Study. https://blogs.ischool.berkeley.edu/w231/2021/02/24/bias-in-large-language-models-gpt-2-as-a-case-study/

(2) Adaptive Test Generation Using a Large Language Model. https://arxiv.org/abs/2302.06527v2

(3) Meta unveils a new large language model that can run on a single GPU …. https://arstechnica.com/information-technology/2023/02/chatgpt-on-your-pc-meta-unveils-new-ai-model-that-can-run-on-a-single-gpu/

(4) COS 597G: Understanding Large Language Models. https://www.cs.princeton.edu/courses/archive/fall22/cos597G/

(5) Getting started with LangChain – A powerful tool for working with Large …. https://medium.com/@avra42/getting-started-with-langchain-a-powerful-tool-for-working-with-large-language-models-286419ba0842

(6) Testing Large Language Models on Compositionality and Inference with …. https://aclanthology.org/2022.coling-1.359/

--

--

Dattaraj Rao

Engineer, Inventor, Author — “Keras to Kubernetes: Journey of a Machine Learning model to Production”