Digital Socrates: Evaluating LLMs through Explanation Critiques

Blog written by Yuling Gu

Ai2
Ai2 Blog
5 min readAug 12, 2024

--

Looking for an interpretable explanation evaluation tool that can automatically characterize the explanation capabilities of modern LLMs? Meet Digital Socrates at ACL 2024!

A better way of evaluating explanations

While large language models (LLMs) can provide explanations along with their answers, the nature and quality of those explanations are still poorly understood.

A summary of the benefits of Digital Socrates, including its focus on quality and localization of interpretable feedback.

In response, our goal is to (i) define a detailed way of characterizing the explanation capabilities of modern models, (ii) create a sizeable, human-verified dataset for this task, and (iii) train a high-performing, open-source critique model, Digital Socrates (DS), that can generate such characterizations automatically.

Explanation critiquing: task design

To achieve this, we first formalize the explanation critiquing task.

Given a multiple-choice question (together with the answer options and correct answer), as well as a model-generated reasoning chain and answer, our system Digital Socrates gives a critique of the modelgenerated explanation. In its critiques, Digital Socrates provides localized feedback on where and why reasoning chains are flawed (focusing on the main flaw, if any), accompanied by general and fine-grained suggestions to address the identified flaw.

Given a question, along with a model-generated explanation and answer, the task involves giving a critique of the model-generated explanation. The first component of the critique is to identify the most significant flaw (if any) in the explanation. Informed by the systematic and disciplined method of Socratic questioning, our critique focuses on flaws along dimensions chosen to cover the different types of Socratic questions.

The critique also contains general and specific suggestions to ensure that each flaw identified is justified with a direction for improvement, rather than being overly critical. The general suggestion is a statement that addresses a likely misconception underlying the flaw, without giving out the answer to the question. The specific suggestion is a more targeted guide towards the right reasoning chain for this particular question. The task also involves providing a quantitative metric on the explanation quality on a scale from 0 to 5, with 0 indicating that the explanation is very wrong and 5 being completely correct.

Digital Socrates’ Critique Bank (DSCB)

We then introduce Digital Socrates’ Critique Bank, a sizeable, human-verified dataset for this task. Each instance comprises a question, a model-generated explanation and answer, a critique of the model-generated explanation, as well as (any) human annotations collected on that instance.

A summary of the content of Critique Bank.

DS Critique Bank focuses on questions requiring reasoning, in particular science and commonsense reasoning. The explanations are from different models in popular explanation styles. We elicit seed explanation critique data from GPT-4, then perform expert and crowdsourced annotations.

Summary of data composition in DS Critique Bank.

To the best of our knowledge, this is the first dataset of its kind on explanation critiquing, covering nuanced and interpretable (user comprehensible) critiques on different models’ explanations and in different explanation styles.

Our model: Digital Socrates (DS)

Using Digital Socrates’ Critique Bank, we train open-source, automatic critique models (called Digital Socrates). We fine-tune two critique models DS-7B and DS-13B starting from Llama2–7B-chat and Llama2–13B-chat respectively.

The pipeline of content into Digital Socrates.

First, we pre-fine-tune on a set of about 50k training questions from ARC and RAINBOW (αNLI, CosmosQA, HellaSwag, Physical IQa, Social IQa and WinoGrande), doing a simple zero-shot question-answering task. Then we further fine-tune on a curriculum of increasing critique quality, starting with silver data from GPT-4, followed by crowdsourced data, and finally expert annotated data.

The impact

Our explanation critiquing task allows for insightful analysis of student models, looking beyond accuracy.

In student models, (human-annotated) explanation scores ESC vary greatly within cases where models get the answer right (accuracy = 1) or wrong (accuracy = 0). Even when a model gets the answer correct, its reasoning chain can contain varying degrees of flaws. On the other hand, when a model is incorrect in its answer, it could still make some valid points.

Even when a model gets the answer correct, its reasoning chain can contain varying degrees of flaws. On the other hand, when a model is incorrect in its answer, it could still make some valid points. Further, explanation critiquing allows us to obtain interpretable insights on the different categories of errors in models’ reasoning chains.

GPT-3.5 and Llama-2–70B student models achieve comparable Acc on Science datasets, with the latter having slightly lower ESC . They also show different patterns in their explanations flaws, e.g., in the amount of incorrect information vs inconsistent answer.

For instance, in this case study comparing models on Science datasets, looking at the dimensions of explanation flaws tells us that incorrect information is frequent in Llama2–70B. Localizing the flaws further informs us about topics in which the model has incorrect information. The task also provides suggestions to correct the flaw, such as providing the correct information. The general feedback could then offer directions toward streamlining model improvement or serve as a useful retrieval corpus, while the specific feedback helps to correct reasoning for each instance.

The next steps and outlook

In this work, we show how Digital Socrates can, for the first time, provide rich analyses and insights across a range of student models and datasets, without relying on unfaithful proxies, expensive API calls or human annotation.

A diagram of DS models’ performance.

Our paper provides more analysis on how applying DS on all datasets in DS Critique Bank, across student models, reveals a rich diversity of behavior.

Analyzing critiques rated as good by crowdworkers shows that the type of errors in the reasoning chain varies across models and also depends on the type of task dataset. Refer to the legend in Figure 3 for dimensions of flaws. Explanation scores as a summary metric do not capture such nuances and characteristics across models and datasets.

This fills an important gap in evaluation tools for understanding and improving the explanation behavior of models. We encourage other researchers to build on our work by using DS to evaluate model explanations in addition to accuracy on leaderboards and evaluation code bases, explore applying existing DS models on other reasoning types, and develop future generations of DS models.

To read more, see our paper “Digital Socrates: Evaluating LLMs through Explanation Critiques

We make our dataset and models publicly available at https://huggingface.co/datasets/allenai/DS_Critique_Bank.

You can also watch a presentation of the paper at https://youtu.be/kiF7reGYVrQ.

Follow @allen_ai on Twitter/X, and subscribe to the Ai2 Newsletter to stay current on news and research coming out of Ai2.

--

--

Ai2 Blog
Ai2 Blog

Published in Ai2 Blog

Breakthrough AI to solve the world’s biggest problems.

Ai2
Ai2

Written by Ai2

Our mission is to build breakthrough AI to solve the world's biggest problems. We are a Seattle-based non-profit founded in 2014 by Paul G. Allen.

No responses yet