E5 : Active Prompting with CoT for LLM

Praveen Thenraj
Research Papers Summarized
4 min readJun 24, 2023

--

Uncertainty measure to identify the examples to be annotated for CoT enhances reasoning capabilities of LLMs.

Paper Name : Active Prompting with Chain-of-Thought for Large Language Models

Paper URL : https://arxiv.org/abs/2302.12246

Authors : Shizhe Diao, Pengcheng Wang, Yong Lin, Tong Zhang

Please find the annotated paper here

Analogy — during exams, identifying the questions we feel more confusing and learning them first rather than learning by randomly selecting the questions to learn helps perform better.

Problem Statement :

  • Eliciting reasoning in LLM has been a major challenge in-spite of the growing size of LLMs.
  • CoT prompting - an eg of few-shot prompting addresses this reasoning challenge in LLM with a simple yet powerful solution of breaking the complex problem into simple steps, like how humans solve complex problems.
  • But still, do the examples selected for CoT rationale generation represent the original examples where the model is struggling to answer?
  • Previous methods do not follow any specific methodology to select the examples for CoT prompting.

Uncertainty :

  • Uncertainty metric is a measure of how much a ML model is uncertain about its prediction
  • Active learning is an interesting approach that uses this idea of uncertainty principle to train ML models even when the availability of labelled data is less.

Solution :

  • A sample of 1000 questions is randomly sampled from the training set of each evaluation dataset.
  • LLMs are used to predict ‘k’ responses for each of these questions.
  • The uncertainty metrics is measured on these ‘k’ answers of each questions.
  • The authors have used 4 uncertainty metrics — disagreement, entropy, variance, self-confidence
  • Disagreement - (number of unique answers from ‘k’ answers of each question) / k . Higher number of non-unique answers implies higher disagreement value which in turn means higher uncertainty of the model.
  • Self-confidence - asking the LLM model to predict the confidence (very confident, confident, not confident, wrong answer) of the response it has generated for the questions in the training data.
  • LLMs were always biased about the response it has generated and hence self-confidence was not used.
  • The authors have used disagreement and entropy for the main experiments.
  • Top ’n’ uncertain questions identified based on the uncertainty metrics is then used for human annotation.
  • These ’n’ uncertain questions are annotated by human annotators to generate CoT rationale and then used as examples for few shot prompting along with the test question to generate response.
Disagreement based uncertainty ranking to identify the questions to be annotated as part of CoT prompting to elicite reasoning abilities in LLMs

Experimental Setup :

  • The reasoning ability of Active Prompting was evaluated on 8 benchmark datasets across three different tasks namely arithmetic reasoning, common sense and symbolic reasoning- GSM8K, asDiv, SVAMP,AQuA,SingleEq,CSQA, StrategyQA,Letter
  • code-davinci-002, text-davinci-002 and text-davinci-003 were used to verify the effectiveness of this approach
  • The authors have used the OpenAI APIs directly to measure the performance of this approach

Observations :

  • The results of Active Prompting to select questions was compared against results different techniques like automatically (Auto-CoT), CoT with self-consistency, randomly (Random CoT) selecting questions to be annotated.
  • Most of the cases Active Prompting outperformed all the other techniques.
  • Active prompting achieved an average of 7% and 1.8% improvement over self-consistency method with text-davinci-002 and code-davinci-003 respectively
  • Active selection(entropy) combined with CoT using code-davinci-003 (175B) model was able to outperform self-consistency based CoT using PaLM (540B) almost every time by a bigger margin.
  • The above point sparks the question does eliciting reasoning in LLM really depend on size of LLM as stated in CoT paper or it depends on the the examples chosen for the prompting
  • The performance improvement on asDiV, SVAMP, SingleEq benchmarks were high but still not as high as GSM8K and AQuA benchmarks.
  • This is believed to be due to asDiV, SVAMP, SingleEq not have training data and hence the ’n’ uncertain questions identified from GSM8K training data was used to create the CoT rationales for the three datasets.
  • Hence transferring chain of thoughts prompt from one task to another needs to be handled more efficiently.
  • Ablation study was conducted to study the effects of few shot prompt, active selection prompting, pool size, uncertainty metrics and annotators.
  • Study shows that there is a high negative correlation between the uncertainty and accuracy of LLM. This proves the hypothesis that addressing the uncertainty of models in turn increases the performance of the model.
  • The performance starts converging after pool size (number of ‘k’ responses generated by LLM for calculating uncertainty metric) above 10, as it in turn reduces the number of ’n’ uncertain question identified.

Conclusion :

  • The paper helps to methodologically identify the most important questions to be annotated by humans for CoT rather than randomly(Random CoT), auto (Auto CoT),manually (Manual CoT) selecting the questions to be annotated.
  • code-davinci-002 (175B) with active selection CoT prompting outperforming PaLM (540B) with self-consistency CoT prompting by huge margin is definitely a road ahead to fine-tune smaller language models for task specific activities.

--

--