Check AI-Generated Code Perfectly and Automatically

My Experience Applying Kani’s Formal Verification to ChatGPT-Suggested Rust Code

11 min readJun 9, 2023

A robot checking another robot — DALL·E (openai.com)

Do you love AI coding tools such as ChatGPT and GitHub Copilot? Do you hate that you can’t trust the code they produce?

In some (for now, limited) cases, you can trust their code! Using formal verification tools such as Kani for Rust we can sometimes automatically prove the correctness of AI-generated code with mathematical certainty. My crate RangeSetBlaze now includes some such code.

Rust already provides mathematical certainty for memory management. In this article, you’ll also learn how you can use Kani to expand that certainty to the topic of computer arithmetic and overflows. You’ll find Kani useful for both human- and AI-generated code.

Aside: You may object that, in general, proving a program correct is undecidable. Also, in general, writing a correct specification is as hard as writing a correct program. While these statements are true in general, in many specific domains writing a specification is easy— for example, we just write an obviously correct but inefficient function as our spec. Also, in some useful cases, proving the equivalence of two functions is well within the capabilities of systems such as Kani. Read on for examples.

To understand the promise and limitations of automatic verification of automatically-generated code, I applied ChatGPT and Kani to parts of RangeSetBlaze. RangeSetBlaze is a Rust crate for efficiently manipulating sets of “clumpy” integers. Specifically, I looked at three programming problems with these properties:

Problem 1: A one-line problem verifiable by Kani but too hard for ChatGPT
Problem 2: A one-line problem solvable by ChatGPT and verifiable by Kani
Problem 3: A more interesting problem, too hard for Kani and for ChatGPT

Excitingly, when I benchmarked the ChatGPT/Kani solution to Problem 2, I found it to be 3% faster than my original code. RangeSetBlaze now includes this computer generated and verified code.

Three disclaimers: First, making this one line of code faster has no noticeable impact on the crate’s overall speed. Second, in the course of this work, I found yet another one-line solution to Problem 2— 11% faster than my original code. I prefer the ChatGPT solution for reasons we’ll discuss when we look at Problem 2. Third, see Related Work, near the end, for references to others who have combined formal verification with AI generation.

Let’s look at each of the three programming problems, one at a time.

Problem 1: A one-line problem verifiable by Kani but too hard for ChatGPT

Inside the RangeSetBlaze crate, I often need to find the length of an inclusive range such as 3..=10. Notice that if the type is i8 (8-bit signed integer) this length could be any value from 0 to 256. These 257 distinct length values do not fit into 8-bits, so a length must be something like a u16 or usize (on a 64-bit computer).

So, how should we compute the length of an inclusive range? Naïvely, we try this code:

let len = end - start + 1;

Here is how we ask Kani to verify the correctness of the naïve code:

#[cfg(kani)]
#[kani::proof]
fn verify_len_i8() {
    let start: i8 = kani::any();
    let end: i8 = kani::any(); // end is inclusive
    kani::assume(start <= end);
    let len = end - start + 1;
    assert!((len as i128) == ((end as i128) - (start as i128) + 1));
}

This challenges Kani to mathematically prove the following logical proposition:

If:

start is an i8 with any value, and
end is any i8 at least as large as start, and
len (an i8 by Rust’s type rules) is computed as above, and
((end as i128) — (start as i128) + 1)) is the right answer

Then: len as i128 will equal the right answer.

Kani responds that no proof is possible. It these reports two issues:

“Attempt to subtract with overflow”— counter example: start = -118and end = 35.
“Attempt to add with overflow” — counter example: start = -7 and end = 120.

My explanation: In the first issue, the subtraction produces 153, too large to fit in an i8. In the second issue, the subtraction produces 127, which when we add 1 is 128, too large for an i8.

In contrast with the naïve code, RangeSetBlaze’s actual length-of-inclusive-range code uses overflowing_sub and two casts:

let len = end.overflowing_sub(start).0 as u8 as usize + 1;

Is this correct? Yes. When we ask Kani, it reports: “Success!”. (We’ll discuss how overflowing_sub works in “Aside 3”, below.)

Of course, with i8’s we could have just tested all possible value combinations. Kani, however, also works for larger types. For example, i32 has 4,294,967,296 possible values and i64 has 18,446,744,073,709,551,616 possible values. Kani verifies RangeSetBlaze’s code with mathematical perfection. For example, for i32, we write:

#[cfg(kani)]
#[kani::proof]
fn verify_len_i32_other_sub() {
    let start: i32 = kani::any();
    let end: i32 = kani::any(); // end is inclusive
    kani::assume(start <= end);
    let len = end.overflowing_sub(start).0 as u32 as usize + 1;
    assert!((len as i128) == ((end as i128) - (start as i128) + 1));
}

and it replies “Success!”.

Aside 1: How can we assume that ((end as i128) — (start as i128) + 1) is correct? Well, mathematical proofs require assumptions. In this case, I’m willing to assume that casting to a bigger-than-needed type will give the answer I want. In general, we can often write obviously correct but slow code. We hope to leverage “obviously correct but slow” code to create “obscurely correct and fast” code. Measuring “slow” and “fast” requires benchmarking, a topic discussed in Problem 2.
Aside 2: Even if testing all possible i32 inputs is impractical, can’t we just test many random and/or interesting inputs? Well, I love and recommend testing with random and interesting inputs, for example, with QuickCheck. These methods, however, never give us the perfect mathematical certainty of a formal verification system such as Kani.
Aside 3: How does the one line of code even work? The overflowing_sub subtracts the two numbers but if there is underflow, it wraps the result (and also returns an underflow indicator). The .0 extracts just the result. Because start ≤ end, casting to u32 gives us end-start. Casting to usize (on a 64-bit machine) and adding one gives us the desired length without under or overflow.

Generating with ChatGPT

Now that we can check some code for correctness, could we have used ChatGPT or GitHub Copilot to create the code? I ask GitHub Copilot:

In Rust, I have a range inclusive of i32. Assume start<=end. How can I compute the length of the range?

It replies with the naïve solution (re-formatted for Medium):

// You can compute the length of the range by subtracting the
// start from the end and adding 1. Here's an example code snippet:

let start: i32 = 10;
let end: i32 = 20;
let len = end - start + 1;
println!("Length of range: {}", len);

// This will output Length of range: 11 for the given values
// of start and end.

As we saw above, Kani finds that this code fails with under- and overflows. I ask GitHub Copilot for a solution that doesn’t overflow:

In Rust, I have a range inclusive of i32. Assume start<=end. How can I compute the length of the range without overflow?

It suggests this, which doesn’t always give an answer:

let start: i32 = 10;
let end: i32 = 20;
let len = end.checked_sub(start).and_then(|diff| diff.checked_add(1));
match len {
    Some(len) => println!("Length of range: {}", len),
    None => println!("Subtraction overflowed!"),
}

Switching to ChatGPT 3.5, I ask:

In Rust, I have a range inclusive of i32. Assume start<=end. How can I compute the length of the range without overflow? I don’t want to detect the overflow, I want to avoid it.

It suggests:

let length = end.saturating_sub(start).saturating_add(1);

When I have Kani formally test this, it reports:

“Assertion failed: (len as i128) == ((end as i128) — (start as i128) + 1)” — Counter example: start = -2013200384, end = 402653181.

My explanation: The subtraction produces 2,415,853,565 which saturating_sub truncates to i32::MAX, 2,147,483,647. When we add 1, we get 2,147,483,648 instead of the correct 2,415,853,566.

Attempts with ChatGPT 4 produce similar incorrect or incomplete answers.

Let’s turn next to …

Problem 2: A one-line problem solvable by ChatGPT and verifiable by Kani

At one place in the RangeSetBlaze code, it must check if a + 1 < b where a and b can be any two integers of the same type. We can give this code to Kani:

#[cfg(kani)]
#[kani::proof]
fn verify_less_than_i32_naive() {
    let a: i32 = kani::any();
    let b: i32 = kani::any();
    let condition = a + 1 < b;
    assert!(condition == (a as i128 + 1 < b as i128));
}

It reports that this code is incorrect, specifically:

“Attempt to add with overflow” — Counter example: a = 2147483647 and b = 553648128.

My explanation: The counter example value for a is i32::MAX, so when we add one to it, it overflows.

Here RangeSetBlaze’s original code:

let condition = a < b && a + 1 < b;

which Kani confirms is correct.

Aside: The code works because a can only overflow if it is i32::MAX, but it is i32::MAX then a+1 can never be less than b. The && insures that if the first condition is true, we “short circut” and skip evaluation of the second condition.

The question now is if ChatGPT can find this or something similar.

Generating with ChatGPT

I ask ChatGPT 4 (May 24 Version) this question:

In Rust, a and b are i32s. How can I test that a+1 < b without overflow?

It replies with this expression:

match a.checked_add(1) {
    Some(sum) => sum < b,
    None => false,
}

When I test this with Kani, it confirms that this answer is correct. This is very exciting to me. It is my first code that is both automatically generated and automatically validated.

Benchmarking

To see if ChatGPT’s code is worth using, let’s use Rust’s Criterion package for micro-benchmarking. Each run will generate a and b, the two inputs, randomly and uniformly from the full range of i32. Here are the benchmark results:

ChatGPT 4.0’s code is faster than my original code.

The results show these contestants:

A: cast i128 — the “obviously correct but slow” solution that just casts everything to i128. Benchmarking shows this to be slowest, but slower here is relative. It still runs in less than a nanosecond.
B: short circuit — my original solution that uses Rust’s && operator. It is 18% faster than cast i128.
C: ChatGPT 4 — using checked_add and match. It is 20% faster than cast i128 and 3% faster than short circuit.
D: cast i64 — While creating this benchmark, I got the (in retrospect obvious) idea of the following solution: let condition = (a as i64) + 1 < (b as i64);. It is 27% faster than cast i128 and 8% faster than ChatGPT 4.

Why is short circuit slower than ChatGPT 4? We can use the Rust Playground to look at assembly code (I’ve added annotations):

playground::less_than_i32_human:     
        cmpl    %esi, %edi          # compare a to b
        setl    %cl                 # t1 = a < b
        incl    %edi                # a = a+1
        cmpl    %esi, %edi          # compare a to b
        setl    %al                 # t2 = a < b
        andb    %cl, %al            # t3 = t1 & t2
        retq                        # return t3
     
playground::less_than_i32_gpt4:     
        incl    %edi                # a = a+1
        setno   %cl                 # t1 = "no overflow"
        cmpl    %esi, %edi          # compare a to b
        setl    %al                 # t2 = a < b
        andb    %cl, %al            # t3 = t1 & t2
        retq                        # return t3

We see that both versions of the code are branchless and avoid any overflow error. The short circuit code, however, does two comparisons while the ChatGPT 4 code does only one.

Discussion of Problem 2

In one sense, none of these solutions matter — speeding up this one line of code in RangeSetBlaze causes no noticeable effect on its overall speed. In the end, I decided to use the ChatGPT 4 code — it’s a bit faster. Also, having AI-generated and auto-validated code is fun. I don’t like the cast i64 code because RangeSetBlaze needs generic code that works with all sizes of integers, not just i32. I don’t think making the cast i64 solution generic is worth the added code.

To understand the current limits of automatically validating automatically-generated code, let’s turn next to …

Problem 3: A more interesting problem, too hard for Kani and for ChatGPT

Problem 1 and Problem 2 require just one line of code. How do ChatGPT and Kani do on a more interesting problem? The problem that came to mind involves efficiently inserting a range into a sorted, disjoint list of ranges. For example,

Inserting 3..=10
into 1..=1, 2..=4, 6..=7, 11..=20, 100..=200,
should result in 1..=1, 2..=20, 100..=200.

Here is the informal specification of the problem that I gave to ChatGPT 4:

RangeSetBlaze is a Rust crate represents a set of integers. It represents
the set as a sorted, disjoint list of inclusive integer ranges in a BTreeMap.
The start value is the map's key and the (inclusive) end value is the map's
value. Write a function that inserts a new inclusive integer range into the
BTreeMap. The new BTreeMap must represent the union of the original BTreeMap
and the input range. Its ranges must be sorted and disjoint. It should be
O(m * ln n) where m is the number of ranges in the BTreeMap that overlaps
with the input range and n is the number of ranges in the BTreeMap.

To which ChatGPT proposes:

pub fn insert(&mut self, range: RangeInclusive<i32>) {
    let (start, end) = range.into_inner();

    // Find the first overlapping range or the nearest one before it
    let mut next = self.map.range(..=start).next_back();

    // Find all overlapping ranges
    while let Some((&start_key, &end_value)) = next {
        // If this range doesn't overlap, we're done
        if end_value < start {
            break;
        }

        // If this range overlaps or is adjacent, merge it
        if end_value >= start - 1 {
            let new_end = end.max(end_value);
            let new_start = start.min(start_key);

            self.map.remove(&start_key);
            self.map.insert(new_start, new_end);

            // Restart from the beginning
            next = self.map.range(..=new_start).next_back();
        } else {
            next = self.map.range(..start_key).next_back();
        }
    }
}

This looks impressive, but is it correct? I see an underflow issue (end_value >= start — 1). Also, when I put this code into RangeSetBlaze and apply the existing tests, many of the tests timeout. This code seems to cause an endless loop.

Can Kani check this code perhaps finding counter examples? Sadly, no.

[Kani is] not able to handle code with “unbounded” loops, and what “bounded” means can be quite subtle. It has to have a constant number of iterations that’s “obviously constant” enough for the verifier to actually figure this out. In practice, very few loops are like this. — Kani Documentation

So, this interesting problem seems too hard for both ChatGPT and Kani.

Summary

There you have it: a tiny step toward the dream of automatically verifying automatically-generated code.

I found the Kani verification tool easy to use. Also, Kani did a great job of verifying one-line bits of code related to computer arithmetic. Code involving general loops, however, is beyond its current capabilities.

ChatGPT produced mixed results on the one-line computer arithmetic problems, getting one wrong and one right. It also apparently failed on the larger problem requiring loops.

Working together the two tools solved one problem, producing an answer a tiny bit faster than what I coded by hand. Their solution is now part of RangeSetBlaze.

In the future, I’d love to see progress in proof assistants such as Lean and Coq. These systems don’t (necessarily) automatically prove a math theorem or a program’s correctness. Instead, they automatically verify the correctness of a given proof. Compared to Kani, these systems seem hard to use but can potentially verify more complex code. Progress here would allow programmers, both human and AI, to augment programs with reasons why the program is correct. I believe this would make it easier to automatically verify automatically-generated code.

Aside: If you’re interested in future articles, please follow me on Medium. I write on scientific programming in Rust and Python, machine learning, and statistics. I tend to write about one article per month.

Related work

Writing Code with ChatGPT? Improve it with Kani. | Kani Rust Verifier Blog, May 2023
Can ChatGPT write infallible programs? | by Rex | May, 2023
(Please send me any other related work that you know about. I will add it here.)

Check AI-Generated Code Perfectly and Automatically

My Experience Applying Kani’s Formal Verification to ChatGPT-Suggested Rust Code

Problem 1: A one-line problem verifiable by Kani but too hard for ChatGPT

Generating with ChatGPT

Problem 2: A one-line problem solvable by ChatGPT and verifiable by Kani

Generating with ChatGPT

Benchmarking

Discussion of Problem 2

Problem 3: A more interesting problem, too hard for Kani and for ChatGPT

Summary

Related work

Resources

Written by Carl M. Kadie