Evaluating LLM for code generation

Aakash Goel
NLP Experiment
Published in
Dec 29, 2023


Codex Quick takeaway

Codex Data Collection for Pretraining

Codex Data Collection for FineTuning

Why Codex didn’t used GPT-3 Model checkpoint ?

Evaluation through Unit tests

Decoding Model Output



Note: All these models are only pre-trained and not fine-tuned on competitive coding problem datasets.

Codex Loss Analysis

Higher Temperatures are better when the no. of samples is large.

Evaluation of APPS dataset

Analysis of Model Size Vs Pass rate

Selection of solution to evaluate from different solutions due to temperature

Higher BLEU score != Functional correctness

Failure with Synthetic Prompts

Docstring Generation


  1. Paper (Evaluating Large Language Models Trained on Code) — https://arxiv.org/pdf/2107.03374.pdf
  2. Evaluating Large Language Models Trained on Code — https://www.youtube.com/watch?v=1hJdBNYTNmQ&t=1071s
  3. https://youtu.be/QJq9RTp_OVE
  4. https://dev.to/jamescalam/three-decoding-methods-in-nlp-5f99
  5. https://neal-lathia.medium.com/evaluating-llms-trained-on-code-bb2bdab3cb37