Evaluating LLM for code generation

Published in

NLP Experiment

Dec 29, 2023

--

Codex Quick takeaway

Codex Data Collection for Pretraining

Codex Data Collection for FineTuning

Why Codex didn’t used GPT-3 Model checkpoint ?

Evaluation through Unit tests

Decoding Model Output

Evaluation

Results

Note: All these models are only pre-trained and not fine-tuned on competitive coding problem datasets.

Codex Loss Analysis

Higher Temperatures are better when the no. of samples is large.

Evaluation of APPS dataset

Analysis of Model Size Vs Pass rate

Selection of solution to evaluate from different solutions due to temperature

Higher BLEU score != Functional correctness

Failure with Synthetic Prompts

Docstring Generation

References

Paper (Evaluating Large Language Models Trained on Code) — https://arxiv.org/pdf/2107.03374.pdf
Evaluating Large Language Models Trained on Code — https://www.youtube.com/watch?v=1hJdBNYTNmQ&t=1071s
https://youtu.be/QJq9RTp_OVE
https://dev.to/jamescalam/three-decoding-methods-in-nlp-5f99
https://neal-lathia.medium.com/evaluating-llms-trained-on-code-bb2bdab3cb37

Code Generation

Aakash Goel

Written by Aakash Goel

Editor for

NLP Experiment

Data Scientist

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams