Can LLM help SWE write test cases?

7 min readJul 17, 2023

Recently, we have experienced several incidents in our organization due to insufficient automated testing. As I happened to be researching LLM, I took this opportunity to also explore the possibility of using LLM to automatically generate test cases for code and improve test coverage. Here are the results of my investigation.

Research Approach

Since our codebase primarily consists of Java, my research focused on Java code.
I primarily focused on unit testing and did not delve into more complex service-level or end-to-end tests.
I focused on GPT as the research subject because it is the most convenient and accessible.

In the future, I may explore other open-source solutions (since we cannot directly use GPT in our organization at the moment). I conducted tests on three different scenarios, ranging from easy to difficult in terms of complexity.

Simple Logic with a Single Class
Medium Complexity with External Class References
High Complexity with Extensive External References

Disclaimer: I recognize this is a somewhat naive and over-simplified categorization, but I think it’s sufficient to have a high-level understanding on what the technology can do and how practical is the idea.

Scenario 1 — Simple Logic with a Single Class

Let’s start with the results for scenario #1. Code Example:

import java.util.List;
public class MyClass {
    public int sum(List<Integer> a) {
        int s = 0;
        for (int v: a) {
            s += v;
        }
        return s;
    } 
}

Test Result:

import org.junit.jupiter.api.Test;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import static org.junit.jupiter.api.Assertions.assertEquals;
public class MyClassTest {
    @Test
    public void testSum() {
        MyClass myClass = new MyClass();
        // Test case 1: Empty list
        List<Integer> emptyList = new ArrayList<>();
        int sum1 = myClass.sum(emptyList);
        assertEquals(0, sum1, "Sum of an empty list should be 0");
        // Test case 2: List with positive numbers
        List<Integer> positiveNumbers = Arrays.asList(1, 2, 3, 4, 5);
        int sum2 = myClass.sum(positiveNumbers);
        assertEquals(15, sum2, "Sum of positive numbers should be 15");
        // Test case 3: List with negative numbers
        List<Integer> negativeNumbers = Arrays.asList(-1, -2, -3, -4, -5);
        int sum3 = myClass.sum(negativeNumbers);
        assertEquals(-15, sum3, "Sum of negative numbers should be -15");
        // Test case 4: List with both positive and negative numbers
        List<Integer> mixedNumbers = Arrays.asList(-10, 5, 3, -7, 12);
        int sum4 = myClass.sum(mixedNumbers);
        assertEquals(3, sum4, "Sum of mixed numbers should be 3");
    }
}

This is a simple function that takes a List<Integer> as input and returns the sum of all elements. The generated test code appears to be error-free and covers various scenarios. I ran it locally, and the tests passed successfully.

In addition to the above example, I tested several others, including some simple LeetCode problems, and the results were satisfactory. Due to space constraints, I won’t include the additional code examples here.

Scenario 2 — Medium Complexity with External Class References

Let’s now look at another example:

package test_generation;

import java.util.Arrays;
public class Case2RelevanceScorer {
    private final Query query;
    public Case2RelevanceScorer(Query query) {
        this.query = query;
    }
    public float scoreDoc(Document doc) {
        String[] queryTokens = query.getNormQuery().split(" ");
        String[] docTokens = doc.getNormKeyword().split(" ");
        int overlap = 0;
        for (String t: queryTokens) {
            boolean found = Arrays.stream(docTokens).anyMatch(x -> x.equals(t));
            if (found) {
                overlap += 1;
            }
        }
        // Calculate jacc similarity.
        return overlap / (queryTokens.length + docTokens.length - overlap);
    }
}

Test Result:

public class Case2RelevanceScorerTest {
    private Case2RelevanceScorer relevanceScorer;
    @BeforeEach
    public void setup() {
        Query query = new Query("java programming language");
        relevanceScorer = new Case2RelevanceScorer(query);
    }
    @Test
    public void testScoreDoc() {
        // Test case 1: No overlap between query and document keywords
        Document doc1 = new Document("python programming");
        float score1 = relevanceScorer.scoreDoc(doc1);
        assertEquals(0.0f, score1, "Score should be 0.0 when there is no overlap");
        // Test case 2: Partial overlap between query and document keywords
        Document doc2 = new Document("java programming");
        float score2 = relevanceScorer.scoreDoc(doc2);
        assertEquals(0.5f, score2, "Score should be 0.5 when there is partial overlap");
        // Test case 3: Complete overlap between query and document keywords
        Document doc3 = new Document("java programming language");
        float score3 = relevanceScorer.scoreDoc(doc3);
        assertEquals(1.0f, score3, "Score should be 1.0 when there is complete overlap");
        // Test case 4: Empty query and document keywords
        ...
        // Test case 5: Empty query with non-empty document keywords
        ...
        // Test case 6: Non-empty query with empty document keywords
        ...
        // Test case 7: Null query and document keywords
        ...
    }
}

In this example, the code calculates the Jaccard Similarity between a query and a document, which is a common relevance metric used in search engines. The added complexity here is:

The definitions of the Query and Document classes are not provided in the code example; they are defined elsewhere.
Some mathematical calculations are involved, although they are relatively simple. However, the GPT series has been criticized for not being good at arithmetic, so this might still be a challenge

In the test result, we can clearly see some issues:

Since the definitions of the Query and Document classes were missing, LLM made incorrect assumptions about their constructors.
Test case 1 and test case 2 are incorrect. Test case 1 should have no overlap, but the provided case has overlap, resulting in a non-zero score. Test case 2 should have a score of two-thirds, not 0.5.

To address the first issue, I added the definitions of the two classes to the prompt.


You're working on a Java project. 
In this project, there are some existing codes like the following.
j
File 1:
```
...
@Data
@Builder
public class Document {
    private final String keyword;
    private final double historicalCTR;
    private final int searchTraffic;

    public String getNormKeyword() {
        return keyword.toLowerCase();
    }
}
```
File 2:
```
...
@Data
@Builder
public class Query {
    private String query;
    private Map<String, String> params;

    public String getNormQuery() {
        return query.toLowerCase();
    }
}
```

This correction helps address the first issue. It even recognizes the correct usage of @Data/@Builder, possibly because the training data included other code that uses Lombok. However, the second issue remains, and there is even a third, new issue:

// Newly generated test case, this absolutely makes no sense
Document zeroDoc = Document.builder()
                .keyword("java programming language")
                .historicalCTR(1.0)
                .searchTraffic(0)
                .build();
float score9 = relevanceScorer.scoreDoc(zeroDoc);
assertEquals(0.0f, score9, "Score should be 0.0 when search traffic is zero");Although the Document class contains historical  CTR and searchTraffic, these two variables are not used in the logic of RelevanceScorer. This makes me seriously doubt whether ChatGPT truly understands the logic of RelevanceScorer or is merely guessing based on variable/class names.

The class Document contains the fields historicalCTR and searchTraffic, but these two fields are not utilized in the logic of RelevanceScorer. This raises serious doubt that ChatGPT actually understands the logic of RelevanceScorer. It seems to rely solely on intuition and make guesses based on variable and class names.

After several iterations, my assessment for Scenario 2 is that it has some usefulness (such as generating numerous template codes and helping to consider possible test scenarios), but it certainly requires manual corrections in the later stages.

Scenario 3 — High Complexity with Extensive External References

For this scenario, I chose to test code from Elasticsearch, an open-source search engine (due to intellectual property concerns, I cannot use our internal code from my company for testing on ChatGPT directly). Since this is open-source code, I expected it to perform better than scenario 2 since the code might have appeared in the training data of ChatGPT (though the training data would not be the latest code). Let’s see the results. And again, to save space, I removed unnecessary code and highlighted only the parts I believe illustrate the issues.

// This example involves testing an arbitrary class, BoolQueryBuilder, 
// from Elasticsearch code
public class BoolQueryBuilderTest extends ESTestCase {

    public void testBoolQueryBuilder() throws IOException {
        BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
        ...

        // Test toQuery() method
        SearchExecutionContext context = createSearchExecutionContext();
        Query query = boolQuery.toQuery(context);
        assertTrue(query instanceof BooleanQuery);
        BooleanQuery booleanQuery = (BooleanQuery) query;
        assertEquals(1, booleanQuery.clauses().size());
    }
}

As anticipated, the generated code includes ESTestCase and createSearchExecutionContext, which are from the Elasticsearch code. These pieces of information were not provided in the prompt, so the only reasonable explanation is that ChatGPT was trained on Elasticsearch code.
However, the createSearchExecutionContext method is missing the required parameters in the generated code, which might be due to the fact that this function did not have parameters during the time of ChatGPT’s pre-training.

New Issues Identified:

Not all dependencies are imported, so the code cannot compile. I would need to manually import the missing dependencies one by one.
Some code produces errors when run, and it is uncertain whether it’s due to outdated pre-trained code or other reasons.
Furthermore, even after removing invalid/error-prone code, the remaining code only covers some straightforward and basic logic, lacking comprehensive testing of the core logic.

For complex scenarios like this, my current evaluation is that while LLM can provide a rough code structure and help consider potential test cases, it cannot fully comprehend the logic of the code or complete thorough testing on its own.

Conclusion

(At least for ChatGPT) The current conclusion is that LLM can handle simple code that can be deduced using common sense. However, it cannot truly understand the meaning of complex code.

The generated code can usually be a good starting point, but in most cases, it still requires manual correction, refinement, or even rewriting.

And in order to apply it to large-scale, complex, proprietary (aka non-open-sourced) projects, it is essential to ensure that the model itself has learned from the codebase (through pre-training or fine-tuning) — otherwise it likely won’t be usable at all.