Member-only story
A Gentle Introduction to Code Generation Evaluation
An overview and comparison of the available code generation evaluation metrics
With the release of the OpenAI Codex model, code generation using deep learning is becoming a hot topic primarily due to the impressive results of some models, and I think this will make us re-evaluate and re-image how we create software in the future.
As with any deep learning model, the training procedure requires a metric to evaluate the performance. Currently, there is a hand full of ways to evaluate such models. However, it is not clear yet (I think) which is the best, which motivated me to write this post.
This post overviews the metrics used for code generation models and compares each using a real example to highlight their strengths and weaknesses.
Running Example
To help evaluate each metric, we will use the example described below, where we have three models tasked to predict a code snippet for the provided prompt, and we want to use the score to select the best model. In a real-world situation, one would compare results on an entire test or validation set, but for the sake of this post, we will use only one sample.