ChatGPT is one of the most developed and currently most discussed public AI systems. For an investigation we tried to pass the Bavarian Abitur with the chatbot. Spoiler: It did not too bad. In the process, we gained important insights into working with the chatbot.
How we learned to use the chatbot
In collaboration with secondary school teachers Bavarian Broadcasting’s AI & Automation Lab took a close look at the input options to determine which prompts produce good answers and which ones are discarded.
We tested different subjects: history, German, computer science, mathematics and ethics. And got some answers on how to use ChatGPT — also beyond Abitur tasks.
This is what we learned by dealing with the chatbot:
- Hallucination is real — provide context to diminish invented facts
- Better know the answer
- Break down complex tasks
- ‘Text only’ can be annoying — how to translate illustrations and formulas into text
- ChatGPT needs a human in the loop
What is ChatGPT?
ChatGPT is currently one of the most advanced public chatbots and is developed by the company OpenAI. It is based on one of the largest language models “GPT3” (Generative Pretrained Transformer 3), which allows ChatGPT to generate human-like text answers to questions.
Via an interface on the website of OpenAI users can enter a dialogue with the system via text input and have their questions answered.
Hallucination is real — provide context to diminish invented facts
When asked to interpret the poem “Wiederfinden” by German Poet Johann Wolfgang von Goethe, ChatGPT recited the first paragraph and went on to interpret it.
The answer caught our attention because it seems to begin with an excerpt from the fairy tale “Rumpelstilzchen” by the Brothers Grimm. We quickly discovered that the verses are an invention of ChatGPT.
This phenomenon is called “hallucination” and it is responsible for supposedly factual answers of the AI, which contradict its knowledge from the training data.
We found you can try to avoid hallucination by giving more context: The chatbot showed more textual confidence when we provided sources, for example for the history test we did — it even incorporated rarely used terms from the sources.
The approach reaches its limits, though, when the sheer amount of text exceeds the conversation context limit of ~3000 words (~4000 tokens). And even if you’re feeding the source text as context to the chatbot, ChatGPT’s responses had issues with follow up tasks beyond the text itself.
E.g., one of the tasks in the German Abitur was to contextualize the work of a German contemporary author but ChatGPT lacked the contextual knowledge. Nevertheless, it is noteworthy that ChatGPT itself pointed out its ignorance in some subtasks.
Better know the answer
In addition to hallucinations, ChatGPT might give you false or irrelevant answers. Spotting wrong answers is one of the most important skills using ChatGPT: It takes respective expertise to distinguish good answers from wrong ones.
So how to distinguish a correct answer from a wrong one — especially with the seeming self-confidence ChatGPT asserting its answers? ChatGPT asserting its lack of knowledge in the A-level tasks remained an exception in our experiment.
Wrong answers had in common that they often remained vague, poor in detail and one-sided. In some cases, ChatGPT simply gave an answer that had nothing to do with the question.
Unfortunately, there is no simple rule for recognising wrong answers, because this rule would have to cover everything from wrong historical contexts to calculation errors. However, the following rules of thumb were useful for our work and hinted at correct answers:
- The answer should not digress but directly answer the question.
- Conditions on the answer (e.g., enumeration, method, perspective) must be fulfilled.
- If the answer contains sources, extracts from the sources should be part of the answer. This way hallucinations can be detected.
- A response that is repetitive may indicate insufficient context.
The bottom line is that human expertise is the best way to identify wrong answers. A person who can answer the question in a qualified way can also judge whether ChatGPT gives a correct answer.
Break down complex tasks
Some complex questions require long detailed answers that go beyond the standard length of ChatGPT answers. The maximum output length we could provoke was 271 words, the mean length of answers was ~130 words. To get longer answers to complex questions, we experimented with the following two approaches.
Our first approach was to answer complex questions as a whole. If ChatGPT broke off in mid-sentence when reaching its answer length limit, we asked the chatbot to continue its answer in the next question.
This worked well for one or two continuations but led to repetitions and incoherent, low detailed answers when continued further. When a complex task needed even more length, we decomposed the task into sub-questions and let ChatGPT answer these questions separately.
But how to reassemble the different parts? We found ChatGPT mostly failing at this task. When we instructed the model to write transitions, they were repetitive and not fluent.
And when we had ChatGPT combine successive subtasks in a summary, we lost logical structure. The composition of the subtasks shows that this task is currently better done by a human.
‘Text only’ can be annoying — how to translate illustrations and formulas into text
The Abitur questions contain many illustrations such as cartoons, diagrams, formulas, charts and pictures. Oftentimes those are indispensable for answering the questions. As a chatbot, however, ChatGPT only allows queries as text.
To soften this restriction, we have translated various illustrations into text. We focused on the probably most common text representations of image types on the internet that might have occurred in ChatGPT’s training data.
We started by working on diagrams in computer science. We became aware that ChatGPT used ASCII-Art to represent illustrations in its answers. Subsequently, we also used ASCII-Art to illustrate our questions when appropriate.
When the diagrams became too complex for ASCII-ART or for formulas which are essential in the tasks of the mathematics baccalaureate, we translated these into text using the academic language format “Latex”.
ChatGPT was able to interpret this format and to use it in its answers as well. With this approach, we eventually managed to pass the mathematics Abitur with the generated answers.
However, we did not succeed in finding a suitable text representation for illustrations or caricatures and had to skip these tasks. A future interweaving of Image2Text systems with ChatGPT could make it possible to answer those questions.
ChatGPT needs a human in the loop
We started our experiment with the idea, that ChatGPT will just generate the answers to the Abitur questions. To get a suitable answer to each question, we had to do tedious and time-consuming translation work.
The wording of the questions often had to be adapted several times nudging ChatGPT in the direction of the correct answer. It takes a lot of diligence and creativity to find text equivalents for diagrams and formulas or to divide complex tasks in digestible junks for the chatbot.
In addition, it takes human expertise to distinguish correct answers from incorrect ones and to separate hallucinations from correct answers.
Preparing the questions as well as selecting and checking the answers turned out to be an elaborate task, which is not visible in the result and seems to correspond to the underside of the parable of the iceberg. In the end, only the solution generated by ChatGPT is visible.
This is not the first time we have encountered this insight in dealing with AI systems. Similar experiences have already been made with GPT3, DALL·E2 and the like.
In the latter, the translation work involved illustrators not only finding the right prompt to an illustration, but also discovering the idea itself and embedding it in the context of the task, while ensuring that toxic stereotypes are not reproduced. Again, generating the illustration is only the tip of the iceberg.
ChatGPT passed the subjects history, mathematics and ethics in our experiment, making it clear that the chatbot is another big step in the development of AI systems. Nevertheless, the system cannot do without human supervision and is, like its predecessors, an assistance tool.
If you want to find out more about this topic in German, follow this link: https://www.br.de/nachrichten/netzwelt/chatgpt-schafft-die-ki-das-bayerische-abitur,TVBjrXE