Photo by National Cancer Institute on Unsplash

E9 : Specialising Smaller Language Models

Praveen Thenraj
Research Papers Summarized
3 min readAug 27, 2023

--

Targeting the capability of small language models towards a specific task rather than generic task yield promising results

Paper Name : Specialising Small Language Models towards Multi-Step Reasoning

Paper URL : https://arxiv.org/abs/2301.12726

Authors : Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, Tushar Khot

Please find the annotated paper here

Problem Statement :

  • LLMs have high modelling abilities but are distributed over a wide variety of tasks.
  • Small language models have comparatively lesser modelling abilities.
  • Reasoning is hypothesised to be an emergent ability of LLMs as small language models show almost flat performance in terms of reasoning.

Solution :

  • Enhancing the modelling ability of small language models by concentrating them more on specific task rather than generic task.
  • For eg. rather than tuning small language models for reasoning which is a generic task, tuning them for math specific reasoning improves their performance up to levels of matching LLM performance.
  • The teacher model was used to generate the CoT and answers for GSM8K dataset.
  • Each question was sampled 40 times and only the CoT that led to correct final answer was selected.
  • These selected CoT and answer pairs were used to fine-tune the student model.
  • During fine-tuning different input formats were considered - in-context answer only, in-context CoT and answers, zero shot answers only, zero shot CoT
B - types of data format used for fine-tuning small language models
  • Distribution matching was used as training objective for distillation process.

Experimentation :

  • code-davinci-002 was used as the teacher model to generate the training data for model fine-tuning.
  • Family of pre-trained T5 models and instruction fine-tuned T5 models (FlanT5) were fine-tuned.
  • Math reasoning datasets considered - GSM8K,MultiArith,AsDiv,SVAMP

Observations :

  • Specific task vs Generic task
    - solution improved the performance of small language models in target task, but gradually reduced their performance on generic tasks.
    - Specialised FlanT5-L (760M) was able to outperform UL2(20B) and even LaMDA(137B)
    - Specialised FlanT5-XXL (11B) was able to outperform LaMDA(137B) and almost matched PaLM(60B)
    - Specialised FlanT5 models showed log linear scaling of math reasoning performance rather than reasoning being an emergent ability
    - with target specific fine-tuning, reasoning can be achieved even in small language models.
  • Raw Pre-Trained models Vs Instruction fine-tuned models
    - Specialised FlanT5 models showed better performance rather than raw pre-trained T5.
Specialised T5 Vs Specialised FlanT5 - both show log-linear performance improvement on math reasoning even with small model scales
  • In-distribution Vs Out of Distribution (OOD) held out set
    - good performance on in-distribution dataset (GSM8K) does not guarantee good performance on OOD data (M-S-A) and vice versa as well.
  • In-context examples Vs Zero-shot examples as input format
    - fine-tuning small language models with in-context examples helped small languages to generalise to zero-shot examples as well during validation
    - on other hand, when fine-tuning small language models with zero-shot examples, small language models lost their in-context learning ability.
A - SLM fine-tuned using in-context examples - displays zero-shot capability as well during validation. B - SLM fine-tuned using zero-shot examples - looses in-context capability during validation

Conclusion :

  • Distillation of knowledge from LLMs to SLMs is a major area of research in future.
  • Infusing specific target abilities in SLMs can thus help them perform better than being tuned on generic abilities.
  • SLMs would be definitely be a go-to solution for practical application in organisations given the size of LLM and their compute intensive nature.

--

--