E9 : Specialising Smaller Language Models
Published in
3 min readAug 27, 2023
Targeting the capability of small language models towards a specific task rather than generic task yield promising results
Paper Name : Specialising Small Language Models towards Multi-Step Reasoning
Paper URL : https://arxiv.org/abs/2301.12726
Authors : Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, Tushar Khot
Please find the annotated paper here
Problem Statement :
- LLMs have high modelling abilities but are distributed over a wide variety of tasks.
- Small language models have comparatively lesser modelling abilities.
- Reasoning is hypothesised to be an emergent ability of LLMs as small language models show almost flat performance in terms of reasoning.
Solution :
- Enhancing the modelling ability of small language models by concentrating them more on specific task rather than generic task.
- For eg. rather than tuning small language models for reasoning which is a generic task, tuning them for math specific reasoning improves their performance up to levels of matching LLM performance.
- The teacher model was used to generate the CoT and answers for GSM8K dataset.
- Each question was sampled 40 times and only the CoT that led to correct final answer was selected.
- These selected CoT and answer pairs were used to fine-tune the student model.
- During fine-tuning different input formats were considered - in-context answer only, in-context CoT and answers, zero shot answers only, zero shot CoT
- Distribution matching was used as training objective for distillation process.
Experimentation :
- code-davinci-002 was used as the teacher model to generate the training data for model fine-tuning.
- Family of pre-trained T5 models and instruction fine-tuned T5 models (FlanT5) were fine-tuned.
- Math reasoning datasets considered - GSM8K,MultiArith,AsDiv,SVAMP
Observations :
- Specific task vs Generic task
- solution improved the performance of small language models in target task, but gradually reduced their performance on generic tasks.
- Specialised FlanT5-L (760M) was able to outperform UL2(20B) and even LaMDA(137B)
- Specialised FlanT5-XXL (11B) was able to outperform LaMDA(137B) and almost matched PaLM(60B)
- Specialised FlanT5 models showed log linear scaling of math reasoning performance rather than reasoning being an emergent ability
- with target specific fine-tuning, reasoning can be achieved even in small language models. - Raw Pre-Trained models Vs Instruction fine-tuned models
- Specialised FlanT5 models showed better performance rather than raw pre-trained T5.
- In-distribution Vs Out of Distribution (OOD) held out set
- good performance on in-distribution dataset (GSM8K) does not guarantee good performance on OOD data (M-S-A) and vice versa as well. - In-context examples Vs Zero-shot examples as input format
- fine-tuning small language models with in-context examples helped small languages to generalise to zero-shot examples as well during validation
- on other hand, when fine-tuning small language models with zero-shot examples, small language models lost their in-context learning ability.
Conclusion :
- Distillation of knowledge from LLMs to SLMs is a major area of research in future.
- Infusing specific target abilities in SLMs can thus help them perform better than being tuned on generic abilities.
- SLMs would be definitely be a go-to solution for practical application in organisations given the size of LLM and their compute intensive nature.