Evaluating LLM for code generation
Published in
Dec 29, 2023
Codex Quick takeaway
Codex Data Collection for Pretraining
Codex Data Collection for FineTuning
Why Codex didn’t used GPT-3 Model checkpoint ?
Evaluation through Unit tests
Decoding Model Output
Evaluation
Results
Note: All these models are only pre-trained and not fine-tuned on competitive coding problem datasets.
Codex Loss Analysis
Higher Temperatures are better when the no. of samples is large.
Evaluation of APPS dataset
Analysis of Model Size Vs Pass rate
Selection of solution to evaluate from different solutions due to temperature
Higher BLEU score != Functional correctness
Failure with Synthetic Prompts
Docstring Generation
References
- Paper (Evaluating Large Language Models Trained on Code) — https://arxiv.org/pdf/2107.03374.pdf
- Evaluating Large Language Models Trained on Code — https://www.youtube.com/watch?v=1hJdBNYTNmQ&t=1071s
- https://youtu.be/QJq9RTp_OVE
- https://dev.to/jamescalam/three-decoding-methods-in-nlp-5f99
- https://neal-lathia.medium.com/evaluating-llms-trained-on-code-bb2bdab3cb37