How GPT-4 writes tests for me

Trying out GPT-4

6 min readMar 22, 2023

ChatGPT — everybody is talking about it, everyone has an opinion, it’s a lot of fun to play around with but we are not playing here — I want to show some practical use which can help you save time and be more efficient.

Motivation

We’ve all seen it before — that project with a ever falling code coverage percentage — increasingly scary to deploy and as predictable as North Korea.

Meme-”My boss congratulating me on the latest deployment”-”Me who has no idea what the code i changed was for” — TFW

This often happens because of lack of procedure and bad time allocation — aka — not allocation time to write tests.
What if I told you writing unit tests is now a matter of seconds?

Steps:

Put the code you want tests for in ChatGPT
Copy-paste the answer — It’s not compiling
Remember the C++ vet from work that said ChatGPT is a millennial pile of garbage that will never replace him
Come back to this blog post — Easy right?

Best practices

Generally speaking - any time you give the Large Language Model (LLM) some code with no specifics, context, requirements etc. — anything can happen, I mean anything..

Me: Write tests for this code : <code>
Chat: <code output>
Me: This code does not compile
Chat: You are right, sorry for that, let me fix it for you.. <Produces worst code>
Me: This code has more issues than before you “fixed” it
Chat: I’m sorry for that <Produces exact same code>

TL;DR — Start here

What is ChatGPT really good at:

Languages & technologies that have been around for a while
Languages & technologies that are widely used
Standard code with standard usage
Utility code that does something specific

Less efficient with(results vary):

New technologies & libraries — it is only trained until a specific point in time — this is critical to know and pay attention to.
code that was heavily customized or abused for a hard to predict magic reason — things can get messy
Complex code that has complex dependencies

Examples:

The easiest time-saver is generating tests for any utility method that takes some input and does some processing & output. I’ll take the example of the top rated JavaScript question on StackOverflow:

My input: Write tests in Typescript & jest for this code:

export const removeItemOnce = (arr:any[], value:any) => {
  const index = arr.indexOf(value);
  if (index > -1) {
  arr.splice(index, 1);
  }
  return arr;
}
   
export const removeItemAll = (arr:any[], value:any) => {
  let i = 0;
  while (i < arr.length) {
    if (arr[i] === value) {
      arr.splice(i, 1);
    } else {
      ++i;
    }
  }
  return arr;
}

The answer, in seconds, test all “basic functionality” as ChatGPT calls it.

import {removeItemOnce, removeItemAll} from './util'
describe('removeItemOnce', () => {
    it('should remove the first occurrence of the value from the array', () => {
      const arr = [1, 2, 3, 4];
      const value = 3;
      const expected = [1, 2, 4];
      const result = removeItemOnce(arr, value);
      expect(result).toEqual(expected);
    });
  
    it('should return the original array if the value is not found', () => {
      const arr = [1, 2, 3, 4];
      const value = 5;
      const result = removeItemOnce(arr, value);
      expect(result).toEqual(arr);
    });
  });
  
  describe('removeItemAll', () => {
    it('should remove all occurrences of the value from the array', () => {
      const arr = [1, 2, 3, 4, 3, 5];
      const value = 3;
      const expected = [1, 2, 4, 5];
      const result = removeItemAll(arr, value);
      expect(result).toEqual(expected);
    });
  
    it('should return the original array if the value is not found', () => {
      const arr = [1, 2, 3, 4];
      const value = 5;
      const result = removeItemAll(arr, value);
      expect(result).toEqual(arr);
    });
  });

We have 100% line coverage but we want a little more and so I should have been more descriptive, specific and generally less lazy — I should specify I want edge cases checked too (yes it’s not obvious for a LLM).

Me: Add some edge case checks

  it('should return an empty array if the input array is empty', () => {
    const arr: number[] = [];
    const value = 5;
    const result = removeItemOnce(arr, value);
    expect(result).toEqual([]);
  });

  it('should return an empty array if the input array contains only the value to be removed', () => {
    const arr = [5];
    const value = 5;
    const result = removeItemOnce(arr, value);
    expect(result).toEqual([]);
  });

// more cases that are not needed to prove the point

OK we’ve proved a point about specifying what you need rather then just throwing code in and hoping what you’re expecting will actually be tested.
Moving on to more complex scenarios —how does ChatGPT handle dependencies? Well… it depends.. and this is really interesting

This is the sample ES Module code I’m using:

import { SendMessageCommand, SQSClient } from '@aws-sdk/client-sqs';

const client = new SQSClient({
    region: process.env.AWS_REGION,
    credentials: {
        accessKeyId: process.env.AWS_KEY as string,
        secretAccessKey: process.env.AWS_SECRET as string,
    }
});

export const sendMessage = (msg:string) => client.send(new SendMessageCommand({ 
    MessageBody: msg,
    QueueUrl: process.env.AWS_QUEUE
}));

Trying to get tests for a node.js ES Module can break your brain.
ChatGPT will BS it’s way to an answer which 99% of the time — won’t even compile.
The reason? The technique and standard is fairly new. At the time of writing official Jest docs state “Jest ships with experimental support for ECMAScript Modules (ESM)” while OpenAI site states that “ChatGPT’s training data cuts off in 2021” — Viola!

The weak spot of any LLM is that it is trained on a set of data, if this data has some kind of issue (maybe even over time) like — it is no longer up-to-date, it becomes invalid/incorrect etc. — then it is/becomes unusable for that case and this is exactly what is happening with node 19, ES Modules and Jest.

We’ll go with Python then..
Me: Write tests for the following python code

from external_dependency import some_util
from another_external_dependency import method_that_has_a_side_effect

def do_some_action(obj: object) -> object:
    return some_util(obj)

def do_some_other_action(obj: object):
    ans = do_some_action(obj);
    result = method_that_has_a_side_effect(ans);
    return result;

ChatGPT:

import unittest
from unittest.mock import MagicMock, patch
from your_module import do_some_action, do_some_other_action

class TestYourModule(unittest.TestCase):

    def test_do_some_action(self):
        mock_obj = MagicMock()
        with patch("your_module.some_util") as mock_some_util:
            mock_some_util.return_value = "expected_value"
            result = do_some_action(mock_obj)
            self.assertEqual(result, "expected_value")
            mock_some_util.assert_called_once_with(mock_obj)

    def test_do_some_other_action(self):
        mock_obj = MagicMock()
        mock_ans = MagicMock()
        mock_result = MagicMock()
        
        with patch("your_module.do_some_action") as mock_do_some_action, \
             patch("your_module.method_that_has_a_side_effect") as mock_method_that_has_a_side_effect:
            
            mock_do_some_action.return_value = mock_ans
            mock_method_that_has_a_side_effect.return_value = mock_result
            
            result = do_some_other_action(mock_obj)
            
            self.assertEqual(result, mock_result)
            mock_do_some_action.assert_called_once_with(mock_obj)
            mock_method_that_has_a_side_effect.assert_called_once_with(mock_ans)

if __name__ == "__main__":
    unittest.main()

Easy right? Mocks and all!!
That is mainly because this has been done so many times before — standard test for standard use-cases — the data which has been fed to the LLM has many matching results (tokens) to get “confidence” of a correct prediction and it did make the job easy — 10 seconds.

Main takeaways

Before you start you should ideally have an understanding of what you want to test in order to provide clear instructions and be able to evaluate the response
Use a modular approach — break the code down to modules that can be separately handled by ChatGPT
The input you supply should have all the information needed for the tests you want the LLM to output — nothing less, nothing more
Be precise — excess “garbage” input means you are more likely to get a more “noisy” output
The source of the answer is unknown — always make sure you understand the code

Tips

Don’t fight! ChatGPT is just not great at some things
Data may not be up to date — latest topics will be missing
Implementation may not be best practice or even the latest recommended practice — quality is up to you
AI is your aid; not replacing you just yet (: