E9 : Specialising Smaller Language Models

Published in

Research Papers Summarized

3 min readAug 27, 2023

Targeting the capability of small language models towards a specific task rather than generic task yield promising results

Paper Name : Specialising Small Language Models towards Multi-Step Reasoning

Paper URL : https://arxiv.org/abs/2301.12726

Authors : Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, Tushar Khot

Please find the annotated paper here

Problem Statement :

LLMs have high modelling abilities but are distributed over a wide variety of tasks.
Small language models have comparatively lesser modelling abilities.
Reasoning is hypothesised to be an emergent ability of LLMs as small language models show almost flat performance in terms of reasoning.

Solution :

Enhancing the modelling ability of small language models by concentrating them more on specific task rather than generic task.
For eg. rather than tuning small language models for reasoning which is a generic task, tuning them for math specific reasoning improves their performance up to levels of matching LLM performance.
The teacher model was used to generate the CoT and answers for GSM8K dataset.
Each question was sampled 40 times and only the CoT that led to correct final answer was selected.
These selected CoT and answer pairs were used to fine-tune the student model.
During fine-tuning different input formats were considered - in-context answer only, in-context CoT and answers, zero shot answers only, zero shot CoT

Experimentation :

code-davinci-002 was used as the teacher model to generate the training data for model fine-tuning.
Family of pre-trained T5 models and instruction fine-tuned T5 models (FlanT5) were fine-tuned.
Math reasoning datasets considered - GSM8K,MultiArith,AsDiv,SVAMP

Observations :

Specific task vs Generic task
- solution improved the performance of small language models in target task, but gradually reduced their performance on generic tasks.
- Specialised FlanT5-L (760M) was able to outperform UL2(20B) and even LaMDA(137B)
- Specialised FlanT5-XXL (11B) was able to outperform LaMDA(137B) and almost matched PaLM(60B)
- Specialised FlanT5 models showed log linear scaling of math reasoning performance rather than reasoning being an emergent ability
- with target specific fine-tuning, reasoning can be achieved even in small language models.
Raw Pre-Trained models Vs Instruction fine-tuned models
- Specialised FlanT5 models showed better performance rather than raw pre-trained T5.

In-distribution Vs Out of Distribution (OOD) held out set
- good performance on in-distribution dataset (GSM8K) does not guarantee good performance on OOD data (M-S-A) and vice versa as well.
In-context examples Vs Zero-shot examples as input format
- fine-tuning small language models with in-context examples helped small languages to generalise to zero-shot examples as well during validation
- on other hand, when fine-tuning small language models with zero-shot examples, small language models lost their in-context learning ability.

Conclusion :

Distillation of knowledge from LLMs to SLMs is a major area of research in future.
Infusing specific target abilities in SLMs can thus help them perform better than being tuned on generic abilities.
SLMs would be definitely be a go-to solution for practical application in organisations given the size of LLM and their compute intensive nature.

Written by Praveen Thenraj