Coffee Time Papers: Can LLMs Learn by Teaching?

Dagang Wei
6 min readJun 30, 2024

--

This blog post is part of the series Coffee Time Papers.

Paper

https://arxiv.org/abs//2406.14629

Overview

This paper explores the concept of “Learning by Teaching” (LbT) for Large Language Models (LLMs). It investigates whether LLMs, like humans, can improve their knowledge and reasoning abilities by teaching other models. The paper proposes three methods for implementing LbT in LLMs, each mimicking a level of human LbT:

  1. M1: Observing Students’ Feedback: This method focuses on improving the answer quality of LLMs without additional training. It involves generating multiple answers and rationales for a given problem, then evaluating the quality of each rationale based on its ability to teach student models to answer similar problems correctly. The rationale with the highest score is selected as the best answer.
  2. M2: Learning from the Feedback: This method aims to improve the inherent ability of LLMs by learning from student feedback. It uses the approach in M1 to score teacher-generated rationales and then fine-tunes the teacher model using the rationale scores. This process helps the model learn from its mistakes and improve its reasoning capabilities.
  3. M3: Learning from the Feedback Iteratively: This method focuses on improving the answer quality of LLMs by iteratively refining prompts based on feedback from multiple students. The LLM reflects on the failure cases of students and generates new positive and negative examples to improve its teaching material.

The paper evaluates these methods on mathematical reasoning and code synthesis tasks. The results show that LbT can improve the performance of LLMs in both tasks. For example, M1 can identify infrequent but correct answers, and M2 can improve the model’s ability to generate correct and concise rationales. Additionally, M3 demonstrates that LLMs can learn from diverse feedback from multiple students to improve their teaching materials and reasoning abilities.

The paper concludes that LbT is a promising approach for improving LLMs. It suggests that LLMs can continuously evolve by teaching other models, potentially reducing the reliance on human-produced data or stronger models. The authors also highlight the potential of using advanced techniques in education to improve LLMs further.

Q&A

Q: What is the main idea of Learning by Teaching (LbT) in LLMs?

A: Learning by Teaching (LbT) is a novel approach where LLMs improve their knowledge and reasoning abilities by teaching other models. This contrasts with the traditional Learning from Teachers (LfT) approach, where LLMs learn from more knowledgeable models.

Q: What are the three levels of LbT in humans, and how are they implemented in LLMs?

A: The paper presents 3 methods. Here’s a breakdown of how they work with the examples:

Method 1 (M1): Observing Students’ Feedback

  • Mathematical Reasoning: The teacher LLM is given a math problem (e.g., “Compute (-49) ÷ 7”) and generates multiple rationales and answers. Each rationale-answer pair is then used to teach a student LLM how to solve similar problems. The student LLM’s performance (accuracy) on these exam problems serves as feedback to evaluate the quality of the teacher’s rationale. The rationale that leads to the highest student performance is considered the best.
  • Code Synthesis: The teacher LLM is tasked with generating Python code for a given problem (e.g., Leetcode problem 877). It produces multiple code solutions with accompanying rationales. Each code-rationale pair is used to teach a student LLM to solve similar coding problems. The student LLM’s performance (passing test cases) is used as feedback to select the best code solution from the teacher’s options.

Method 2 (M2): Learning from the Feedback

  • Mathematical Reasoning: The teacher LLM generates multiple rationales and answers for math problems. The quality of each rationale is assessed using a combination of its correctness and its ability to teach a student LLM (as in M1). This feedback is then used to fine-tune the teacher LLM, improving its ability to generate correct and concise rationales.

Method 3 (M3): Learning from the Feedback Iteratively

  • Logical Fallacy Classification: The teacher LLM is given a set of positive and negative examples of logical fallacies. It uses these examples to teach a student LLM to classify new text instances. The teacher LLM then reflects on the student’s incorrect classifications, identifying potential weaknesses in the teaching examples. It revises the examples based on this reflection, aiming to improve the student’s performance in the next iteration. This process is repeated iteratively, with the teacher LLM continuously refining its teaching material based on the student’s feedback.

In all these examples, the LbT process involves a back-and-forth interaction between the teacher and student LLMs. The teacher provides instruction, the student provides feedback through its performance, and the teacher learns and improves based on this feedback. This iterative process allows the teacher LLM to refine its knowledge, reasoning abilities, and teaching materials.

In the context of this paper, a rationale is a step-by-step explanation of the problem-solving process. It is a way for the teacher model (an LLM) to explain its reasoning to the student model (another LLM). The rationale is crucial for the LbT process as it allows the student model to understand the underlying logic and apply it to similar problems.

Q: What are the examples of rationales in the paper?

A: The paper provides several examples of rationales across different tasks:

Mathematical Reasoning:

  • Teaching Problem (TP): Compute (-49) ÷ 7.
  • Teaching Rationale (TR): “The division operation is the inverse operation of multiplication, so we need to find the number that 7 multiplied by gives -49. This number is -7.”

In this example, the rationale explains the concept of division as the inverse of multiplication and guides the student model to find the correct answer.

Code Synthesis:

  • Teaching Problem (TP): (Leetcode Question ID: 877) Write a python code to solve the following problem: Alice and Bob play a game with piles of stones…
  • Teaching Rationale (TR): “To determine the winner of the game, Alice or Bob, you can use a dynamic programming approach. Let’s think step by step: 1. Create a 2D array…” (The rationale continues with a detailed explanation of the dynamic programming approach)

In this example, the rationale outlines the steps involved in solving the problem using dynamic programming, a common algorithmic technique.

Logical Fallacy Classification:

  • Task: Is this text an instance of the logical fallacy of faulty generalization?
  • Example 1: My father told me that the sky is green. So it must be the truth. [No]
  • Rationale: This is an appeal to authority fallacy, where a claim is considered true because an authority figure said it, rather than based on evidence or logic.

In this example, the rationale explains why the given statement is a logical fallacy, helping the student model understand the concept of faulty generalization.

These examples demonstrate how rationales are used to explain the reasoning behind the solutions or classifications, enabling the student model to learn from the teacher model’s thought process. The quality of the rationale is crucial for effective LbT, as a clear and accurate explanation will lead to better learning outcomes for the student model.

Q: What are the main findings of the paper regarding LbT in LLMs?

A: The paper finds that LbT can improve both the answer quality and the inherent capabilities of LLMs. It also reveals that LbT in LLMs shares similarities with human learning, such as:

  • Weak-to-strong generalization: Stronger models can improve by teaching weaker models.
  • Diversity in students helps: Teaching multiple students is more beneficial than teaching one student or the teacher itself.

Q: What are the potential applications of LbT in LLMs?

A: LbT can potentially enable LLMs to continuously evolve by teaching each other, reducing the reliance on human-produced data or stronger models. It also opens up possibilities for using advanced educational techniques to improve LLMs further.

Q: What are the limitations of the current LbT methods and potential future directions?

A: The current LbT methods rely on the availability of similar problems for teaching and evaluation. Future work could explore automatic identification or synthesis of similar problems. Additionally, the computational cost of LbT-based scoring could be addressed by developing more efficient methods or using efficient inference systems.

--

--