NLP Experiment

We write articles on recent NLP work, summary of research papers,talks etc.

Evaluating LLM for code generation

--

Codex Quick takeaway

Codex Data Collection for Pretraining

Codex Data Collection for FineTuning

Why Codex didn’t used GPT-3 Model checkpoint ?

Evaluation through Unit tests

Decoding Model Output

Evaluation

Results

Note: All these models are only pre-trained and not fine-tuned on competitive coding problem datasets.

Codex Loss Analysis

Higher Temperatures are better when the no. of samples is large.

Evaluation of APPS dataset

Analysis of Model Size Vs Pass rate

Selection of solution to evaluate from different solutions due to temperature

Higher BLEU score != Functional correctness

Failure with Synthetic Prompts

Docstring Generation

References

  1. Paper (Evaluating Large Language Models Trained on Code) — https://arxiv.org/pdf/2107.03374.pdf
  2. Evaluating Large Language Models Trained on Code — https://www.youtube.com/watch?v=1hJdBNYTNmQ&t=1071s
  3. https://youtu.be/QJq9RTp_OVE
  4. https://dev.to/jamescalam/three-decoding-methods-in-nlp-5f99
  5. https://neal-lathia.medium.com/evaluating-llms-trained-on-code-bb2bdab3cb37

--

--

NLP Experiment
NLP Experiment

Published in NLP Experiment

We write articles on recent NLP work, summary of research papers,talks etc.

No responses yet