E7 : Reasoning in Small Language Models

Praveen Thenraj

Published in

Research Papers Summarized

3 min readJul 22, 2023

Transferring the acquired reasoning capabilities of large language model into small language models

Paper Name : Distilling Reasoning Capabilities into Smaller Language Models

Paper URL : https://arxiv.org/abs/2212.00193

Authors : Department of CS, ETH Zurich - Kumar Shridhar, Alessandro Stolfo, Mrinmaya Sachan

Conference : Findings of ACL 2023

Please find the annotated paper here

Problem Statement :

Though achieving reasoning in LLM is feasible these days with techniques like CoT, but its quality improves at the cost of model size. Bigger the LLM models better the reasoning capabilities.
CoT generates rationale for arriving at an answer rather than breaking complex problems into simpler problems and generate solutions to them.
On the other hand, Smaller language models (SLM) as such without fine-tuning are not yet good reasoners.

Solution :

The approach is almost an amalgamation of least to most prompting and distilling step-by-step technique.
Breaking a complex problem into a list of subquestions and generating answers to those subquestions using LLM based on CoT technique or using available human annotated subquestions and solutions.
Two strategies for distillation - unified and iterative.
Unified - a single small language model that generates subquestion and as well solution to these subquestions.
Iterative - two separate small language models one (problem decomposer) for generating subquestion and another (subproblem solver) for generating solution to those subquestions iteratively.

Experimentation :

The approach was tested on three reasoning benchmark datasets - GSM8K, Strategy QA, SVAMP.
GPT-3 (175B) model was used as a teacher model to generate subquestions or solutions depending on scenario.
Family of GPT-2 models (small - 124M, medium - 355M, large - 774M, XL - 1.5B) were fine-tuned using this approach and tested on these benchmark datasets.
A100 GPUs with 40GB memory were used for fine-tuning activities.

Observations :

SLMs (GPT-2 family models) were fine-tuned under different training data settings
- SOC (CoT) - subquestion and solutions generated by LLM
- SOC (GT) - subquestion generated by LLM and ground truth solutions to these subquestion(GSM8K) / ground truth facts(Strategy QA) / ground truth final answer(SVAMP)
Unified and iterative fine-tuning strategies were used to fine-tune the small language models.
The loss function of unified approach is derived by conditioning the final solution on the previous tokens of subquestions generated, their solutions and the problem itself
The loss function of iterative approach is derived by conditioning the the final solution on previous subquestion generated by question model, its solution generated by answering model and the problem itself.
The results of all GPT-2 family models fine tuned using unified SOC (COT) were almost similar with the same models fine-tuned with ground truth(human annotated) step-by-step data/facts/final answers.
The results of GPT-2 family models fine tuned using iterative SOC(CoT) outperforms the same models fine-tuned with ground truth(human annotated) on GSM8K and SVAMP dataset.
The results of GPT-2 family models fine tuned using iterative SOC(GT) outperforms the same models fine-tuned with ground truth(human annotated) on GSM8K and StrategyQA dataset.
SOC(GT) performs better than SOC(CoT) because the subquestions in SOC(CoT) were generated by LLM which noisier compared to human generated subquestion in SOC(GT).
GPT-2 large model (774M) was an emergent winner under most conditions compared to other models.
The subquestions generated during the fine-tuning of small language models were not much accurate. Hence a guidance mechanism that uses seq-to-seq model trained to generate equations that is used to condition the subquestion generation.

Limitations :

The solution is little complex and raises the question if the fine-tuning approach taken is effective enough to transfer the reasoning capabilities.
Given the complexity of the approach, the results are still way behind LLMs and can be improved further with better selection techniques for examples for prompting LLMs to generate subquestions.

Conclusion :

The approach tries to distil the reasoning capabilities of LLM into SLM and has been successful up-to a certain limit.
With evolving techniques like CoT, Active CoT, ToT the approach can be tweaked more efficiently to achieve better results.

E7 : Reasoning in Small Language Models

Written by Praveen Thenraj