AI prompt for making sense of others’ code

Published in

Bioinformatics with Rust

4 min readDec 6, 2023

This post can be easily renamed to: how to calculate GC content in a DNA sequence. Anyway, Rust-Bio is a bioinformatics library for Rust. Let’s say we want to understand a function in the source code here.

use std::borrow::Borrow;

/// Base gc content counter
fn gcn_content<C: Borrow<u8>, T: IntoIterator<Item = C>>(sequence: T, step: usize) -> f32 {
    let (l, count) = sequence
        .into_iter()
        .step_by(step)
        .fold((0usize, 0usize), |(l, count), n| match *n.borrow() {
            b'c' | b'g' | b'G' | b'C' => (l + 1, count + 1),
            _ => (l + 1, count),
        });
    count as f32 / l as f32
}

Prompt Template

My strategy is to politely ask GPT-4 to simplify the code, and then compare it with the original code in terms of what features might be lost. You can fill in the parts with [] and use this for your own use case.

Step 1 (Prompt): Generate the initial simple version

Assume you are a bioinformatics expert with extensive Rust programming knowledge. You are especially good at writing simple code. I will give you a code snippet and your task is to rewrite it in the most simple way possible. It’s okay if the simple version of the code does not capture all edge cases as the original code. The simple code should be easy-to-understand for beginner Rust programmers who are still new to concepts like generic types, structs, and traits. This is the code:

[paste code here]

Step 2 (Questions): Iteratively generate better versions of the simple code

The first output may not be what you are looking for. You will have to keep the conversation going and ask the model to rewrite the code again. For example, ask:

Can you remove the step argument? It doesn’t seem essential.
Can you make the code use characters when checking equality instead of byte strings?

Step 3 (Prompt): Compare the simple version and the original code

Thank you. This is great! Now we have 2 versions of the code: simple code and the original code. For your reference, here is the simple code:

[paste final version of the simple code here]

And here is the original code we started with:

[paste the original code we couldn't make sense of here]

Make a list of what the original code can do but the simple code cannot do. Usually the simple code is more specific and the complicated code is more general.

Sample Output

I used the template above with Rust-Bio code. And I got this.

Output from steps 1 & 2

// This function calculates the GC content of a DNA sequence
// GC content is the percentage of bases that are either G or C
// The sequence can be a string slice
fn gc_content(sequence: &str) -> f32 {
    // Initialize the length and count variables to zero
    let mut length = 0;
    let mut count = 0;
    // Loop through the sequence
    for base in sequence.chars() {
        // Increment the length by one
        length += 1;
        // Check if the base is G or C (ignoring case)
        if base == 'g' || base == 'c' || base == 'G' || base == 'C' {
            // Increment the count by one
            count += 1;
        }
    }
    // Return the ratio of count to length as a floating point number
    count as f32 / length as f32
}

You can play with this in the Rust Playground. There’s a problem on Rosalind you can solve too.

Output from step 3

Here is a list of what the original code can do but the simple code cannot do:

The original code can take any type that implements the IntoIterator<Item = C> trait as the sequence where C represents any type that implements Borrow<u8> trait. The simple code can only take &str as the sequence.
The original code can take a step parameter that allows skipping some bases between each calculation. The simple code does not have this parameter and always calculates the GC content based on every base in the sequence.
The original code uses the fold method to accumulate the length and count variables in a single pass. The simple code uses a for loop and mutates the variables in each iteration.
The original code uses the match expression to check the base against the binary values of G and C. The simple code uses the if expression to check the base against the characters of G and C.

Writer’s Note: I am not a prompt engineering expert. But I know enough to generate useful output (after a lot of trial and error). You can learn more about how to write good prompts here.