CoT in Large Language Models: Fine-Tuning Based CoT

Michael X
8 min readMay 10, 2023

--

Fine-Tuning Based CoT

We have presented some in-context-based learning methods for Chain-of-Thought (CoT) that enhance prompts with reasoning steps to help the language model learn similar reasoning processes. However, there are other approaches that can also be used to enhance the reasoning ability of language models for complex tasks such as math and physical problems. One such approach involves fine-tuning the language model with reasoning data. These methods leverage CoT data to update the parameters of the language model, which can lead to better reasoning about complex problems and improved accuracy in providing solutions.

ScienceQA

ScienceQA[10] propose a new benchmark that aims to address the limitations of existing science question-answering datasets. It consists of approximately 21k multiple-choice questions with diverse science topics and annotations of their answers with corresponding lectures and explanations. This multimodal dataset is collected from elementary and high school science curricula and has rich domain diversity from three different subjects: natural science, social science, and language science. Unlike existing datasets, ScienceQA provides annotations for the answers, lectures, and explanations, providing general external knowledge and specific reasons, respectively, for arriving at the correct answer.

Figure 11 ScienceQA benchmark

To better understand how language models can learn to generate explanations as a chain of thought (CoT) to mimic the multi-hop reasoning process when answering SCIENCEQA questions, the authors design language models. Specifically, authors design language models to learn to generate lectures and explanations as a chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions. They use two different language models: Few-shot GPT-3 and Fine-tuned UnifiedQA.

To train the language models, they use the ScienceQA dataset, which consists of approximately 21k multimodal multiple choice questions with diverse science topics and annotations of their answers with corresponding lectures and explanations. Each example in the dataset includes a question, multiple choices, multimodal contexts, a correct answer, as well as a lecture and an explanation.

The authors use a CoT approach to prompt the language models. They formulate the task to output a natural explanation alongside the predicted answer. During training, they feed the language models with the question, the available contexts (text, images, and formulas), and the correct answer to learn to generate an explanation that justifies the answer. The language models are trained on a large amount of data and learn to generalize to new questions and contexts.

They find that CoT can improve the question-answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. They also explore the upper bound for models to leverage explanations by feeding those in the input; they observe that it improves the few-shot performance of GPT-3 by 18.96%. The authors conclude that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the training data. Overall, ScienceQA demonstrates the utility of CoT in language models and contributes to bridging the gap in existing datasets in the scientific domain.

Teaching Small Language Models to Reason

While chain of thought prompting has been shown to significantly improve the reasoning capabilities of large language models, the same improvement is not observed in smaller models. This limits the practical applicability of this technique since not all applications require large language models. Therefore, there is a need to transfer the reasoning capabilities of large language models to smaller ones. In this context, [11] propose a method of transferring the reasoning capabilities of large language models to smaller models through knowledge distillation. Specifically, the authors first annotate an existing supervised dataset with chain of thought (CoT) reasoning generated by a larger teacher model. To generate high-quality CoT data, they use LLMs such as PaLM 540B or GPT-3 175B as teachers based on the finding that CoT reasoning improves with model scale. The CoT prompts are modified to include the solution to the task in the few-shot prompts, before providing example CoT in the few-shot prompts. This modification is based on the observation that providing guidance in terms of the answer allows LLMs to correct small mistakes in CoT. The authors remove all incorrect CoT based on the target answer to prevent the student model from learning from bad examples. The second step of the pipeline involves finetuning a smaller student model via teacher forcing. The student model is provided with the question as input and the CoT and answer as the target. As the model is trained on producing a CoT during finetuning, prompting is not required.

Figure 12 Effect of student model (T5) size on accuracy onGSM8K.

As shown in Figure 12, the proposed method is shown to improve task performance across arithmetic, common sense, and symbolic reasoning datasets, with an example of T5 XXL on GSM8K dataset improving from 8.11% to 21.99% when finetuned on PaLM-540B generated CoT.

Large Language Models Are Reasoning Teachers

Large language models (LLMs) have shown impressive performance in various downstream tasks by pre-training on a large corpus and then fine-tuning on specific tasks. However, the efficacy of prompt-based chain-of-thought (CoT) methods is limited to extremely large LLMs, which are not suitable for real-world applications due to their overwhelming computational requirements and inference costs. Therefore, [12] aims to revisit the fine-tuning approach to enable complex reasoning in smaller LMs optimized for specific tasks. By leveraging the capabilities of very large LLMs to generate reasoning samples and teach smaller models via fine-tuning, the proposed approach, Fine-tune-CoT, can enable smaller models to perform complex reasoning tasks while reducing the reliance on prohibitively large models.

Fine-tune-CoT enables smaller language models (LMs) to perform complex reasoning tasks by leveraging the capabilities of larger LMs. The method involves generating chain-of-thought (CoT) reasoning samples using a large teacher model, filtering and curating them into prompt-completion pairs, and then fine-tuning smaller student models using these samples. The most recent Zero-shot-CoT prompting method is used to generate the CoT reasoning samples from the teacher models, which eliminates the need for hand-annotated reasoning explanations. The filtering process involves comparing the final prediction of the teacher model with the ground-truth answer, and only retaining instances where they match. The curated reasoning samples are then used to fine-tune the smaller student models using the autoregressive language modeling objective.

Figure 13 Detailed overview of the proposed Fine-tune-CoT method.

To improve the sample efficiency of Fine-tune-CoT, the method can be extended with diverse reasoning, which generates multiple reasoning paths for each training sample. This is achieved by using a stochastic sampling strategy to obtain D distinct generations for a given sample. The diverse reasoning samples are then curated and fed into the student model for fine-tuning. The authors note that multiple reasoning paths can be used to solve complex tasks, and the diversity in reasoning paths and linguistic templates can substantially aid in fine-tuning for complex reasoning.

The authors evaluated Fine-tune-CoT on various tasks and model sizes using publicly available GPT-3 models. The fine-tuning approach elicited notable reasoning performance in small models on complex tasks, while previous prompt-based methods achieved near-random performance. The authors also found that small models under Fine-tune-CoT can even outperform their very large teachers in some tasks. With diverse reasoning, the performance of Fine-tune-CoT was highly scalable and led to high sample efficiency and notable reasoning performance even with few-shot training examples. The authors conducted thorough sample studies and ablations of Fine-tune-CoT and its performance on a multitude of datasets while demonstrating its value on much smaller models. In doing so, they shed light on important nuances of fine-tuning on CoT reasoning that have not been considered in previous works.

Multi-modal CoT

Recent advancements in language models have shown remarkable performance in complex reasoning tasks by leveraging the chain-of-thought (CoT) prompting technique to generate intermediate reasoning steps. However, the existing CoT studies have focused on language modalities, neglecting the potential benefits of incorporating multimodal information such as vision and audio. To address this gap, MM-CoT[13] proposes a novel approach for multimodal reasoning that integrates both language and vision modalities into a two-stage framework. By leveraging the CoT reasoning technique, the model generates intermediate reasoning steps as rationales to infer the final answer.

Figure 14 MM-CoT

Specifically, MM-CoT generates intermediate reasoning steps or rationales using the CoT reasoning technique in the first stage by taking in both language and vision inputs. The generated rationale is then appended to the original language input in the second stage to infer the final answer. To enhance the interaction between modalities, smaller language models are fine-tuned by fusing multimodal features. This approach allows for the incorporation of both language and vision information, which can improve the accuracy of the model in complex reasoning tasks. MM-CoT achieves state-of-the-art performance on the ScienceQA benchmark, surpassing the previous state-of-the-art LLM (GPT-3.5) by 16 percentage points, and even surpassing human performance. The contributions of MM-CoT include being the first study to investigate CoT reasoning in different modalities, proposing a two-stage framework for fine-tuning language models to fuse vision and language representations for multimodal reasoning, and achieving state-of-the-art performance on the ScienceQA benchmark.

Reference List:

[1] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

[2] Large Language Models are Zero-Shot Reasoners

[3] Guiding Pretraining in Reinforcement Learning with Large Language Models

[4] VISUAL CLASSIFICATION VIA DESCRIPTION FROM LARGE LANGUAGE MODELS

[5] SELF-CONSISTENCY IMPROVES CHAIN OF THOUGHT REASONING IN LANGUAGE MODELS

[6] Least-to-most prompting enables complex reasoning in large language models

[7] Automatic chain of thought prompting in large language models.

[8] Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

[9] COMPLEXITY-BASED PROMPTING FOR MULTI-STEP REASONING

[10] Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

[11] Teaching Small Language Models to Reason

[12] Large language models are reasoning teachers.

[13] Multimodal Chain-of-Thought Reasoning in Language Models

--

--