Evaluating code generation agents—LangChain and CodeChain

James Murdza
4 min readJul 27, 2023

--

A few months ago while working on GitWit, an AI code generation agent, I ran into the challenge: How can the quality of code generated by LLM agents be accurately assessed? Such agents are complex, non-deterministic, and often require special evaluation solutions such as running unit tests. Below is a walkthrough of my own system for dealing with this, including the code and links to the tools you need to run it yourself.

A code generation evaluation pipeline.

In the walkthrough I’ll first show how LLMs can easily be used to generate code. Then, I’ll show how I’m using LangSmith as a platform to batch evaluate thousands of generations, which is essential if improvements in these models are to be accurately measured.

If you can’t wait to get started and want to run GPT-4 on an entire dataset of coding problems, you can get started right away with this Colab notebook!

An automated test run of HumanEval on LangSmith with 16,000 code generations.

Technologies used

Before getting started, some of the most important components in the evaluation workflow:

  • LangChain 🦜🔗 is a standard framework for building LLM apps and agents. Once a program is written with LangChain, its normally quite easy to swap out various components and LLM.
  • LangSmith 🦜🛠️ is a data platform for applications built with LangChain. It comes with a client to batch run LLM chains and evaluators, which you define in Python code.
  • Additionally, I’m using HumanEval, a test harness by OpenAI, and CodeChain, a library I created myself to abstract common tasks in code generation.

A simple code generator

Let’s start with a simple case similar to the ones we want to evaluate. By composing a CompleteCodeChain with an ChatOpenAI, we have a working example of a code generator:

from codechain.generation import CompleteCodeChain
from langchain.chat_models import ChatOpenAI

generator = CompleteCodeChain.from_llm(
ChatOpenAI(model="gpt-3.5-turbo", temperature=0.2)
)

result = await generator.arun("""
def is_palindrome(string: str):
# Check if string is a palindrone.
""")

In the example, we prompt the generator with the beginning of a function that should calculate if a string is a palindrome. The result is great:

def is_palindrome(string: str):
# Check if string is a palindrome.
return string == string[::-1]

Because of the composable nature of LangChain, it’s easy to change out the OpenAI model with one of many possible chat integrations.

Creating a problem dataset

The HumanEval dataset is a collection of Python problems, each in the same format as the example above. Each one has an ID, a prompt, and unit tests to automatically verify any attempts at a solution. They also have a canonical solution, which is purely for reference.

Creating a dataset with LangSmith is fairly straightforward. In the code below, we copy the fields we need into a dataset tied to our account on the cloud platform:

import langsmith
from human_eval.data import read_problems

# Connect to LangSmith and create a dataset.
client = langsmith.Client()
dataset = client.create_dataset(dataset_name, description=description)

# Upload each item from HumanEval to the LangSmith dataset.
for key, value in read_problems().items():
client.create_example(
inputs={ "prompt": value["prompt"], "task_id": key },
outputs={ "canonical_solution": value["canonical_solution"] },
dataset_id=dataset.id
)

Defining a evaluator

As mentioned, each item in the problem set includes a set of test cases to check if the generated code is correct. In OpenAI’s HumanEval evaluator, these assert() statements are appended onto the end of the generated code, and then the code is run, checking for success. To use this custom evaluation function with LangSmith, we subclass the RunEvaluator:

from langsmith.evaluation import RunEvaluator, EvaluationResult
from langsmith.schemas import Run, Example

from human_eval.execution import check_correctness

class HumanEvalEvaluator(RunEvaluator):
def evaluate_run(self, run: Run, example: Optional[Example] = None) -> EvaluationResult:
problem = problems[run.inputs["task_id"]]
solution = run.outputs["output"]
result = check_correctness(problem, solution, timeout=5)
return EvaluationResult(
key = "Correctness",
score = bool(result["passed"])
)

Running a batch evaluation

Similarly to the simple example above, we will define a code generator to run the dataset on. The HumanEvalChain is similar to CompleteCodeChain, but adapted for the input fields defined by the HumanEval dataset.

from langchain.chat_models import ChatOpenAI
from codechain.generation import HumanEvalChain

# Factory for the generation chain
def chain_factory():
return HumanEvalChain.from_llm(
ChatOpenAI(model_name=model_name, temperature=temperature)
)

Now that a dataset has been created, and the generator and evaluator have been defined, all that is left to do is run the batch evaluation. LangChain’s arun_on_dataset() runs a chain across a dataset in parallel adding evaluation results in real-time as each instance finishes.

from langchain.smith import arun_on_dataset, RunEvalConfig

# Evaluator configuration
evaluation = RunEvalConfig(
custom_evaluators=[HumanEvalEvaluator()],
input_key="task_id"
)

# Run all generations and evaluations
chain_results = await arun_on_dataset(
client=client,
dataset_name=dataset_name,
num_repetitions=repetitions_per_problem,
concurrency_level=5,
llm_or_chain_factory=chain_factory,
evaluation=evaluation,
tags=["HumanEval"],
verbose=True
)

Running this with ten repetitions per problem will produce 1,640 generations and evaluations. To try it, run this notebook.

Source code

--

--