Dirty Coding: a Brief Overview of Coding with LLMs

Published in

Contino Engineering

10 min readJul 12, 2023

Much has been said about the creative embers kindled with Generative AI (GenAI) — and more specifically the latest Large Language Models (LLMs). One such ember is:

When an unexpected glitch occurs in CodeWhisperer’s system, ChatGPT steps in to help, offering to support the AI in a complex coding challenge at an international coding symposium, despite CodeWhisperer’s lack of experience in this “dirty coding” realm.

Dall-e 2 prompt: two people on hands and knees facing each other like in ‘dirty dancing’ the movie

Embers like this, however bright they glow, represent a fictional response. Here we see creative variation that can be overwhelmingly convincing and captivating — sometimes even terrifying. This is one side of the coin that is an LLM response.

What’s on the other side? A factual, actionable output.

Bear in mind though, that assumes facts are being produced. And in case you didn’t know, LLMs have no concept (rather I should say construct!) to classify truth. That is why you should have a healthy skeptical view of their output.

In other words, you can use LLMs to generate fiction for sure, but not necessarily fact for sure. That said, there is one domain where GenAI is increasingly becoming less fiction and more fact — coding.

I’ll walk through a brief experiment I ran to compare three LLMs that generate code and cover the pros and cons I observed. However, something that becomes quite apparent is that even if the focus is on creating factual and actionable outputs, they can still be subject to preference, which can be a bit messy, requiring experience to navigate.

A Brief Experiment of My Own

I took some inspiration from Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT, which is worth a read.

The highlight for me is that ChatGPT gets code correct out of the box more frequently than its peers (in this study). However, time taken to correct this generated code (i.e. the technical debt’ in this scenario) is higher with ChatGPT. It’s great to see academic research like this, as it complements what companies such as Duolingo have said about the benefits of using Copilot.

That said, I like running my own experiments — even if they are quick and dirty.

Doing your own reviews of these tools are critically important given the high-speed evolution of GenAI from one day to the next (e.g. the announcement of OpenAI’s Code Interpreter going to beta), and the highly contextual nature of what value they deliver to you. I would encourage you, dear reader, to do the same.

In this post, I highlight my experiments with OpenAI ChatGPT, AWS CodeWhisperer and Google Cloud Codey. Notably, I think there is opportunity to use these tools together, leveraging their strengths to augment their counterparts’ weaknesses, and using multiple models for validation.

Sound a touch messy? Welcome to dirty coding…

Comparing Three LLMs

If you’re going to compare machine learning (ML) models you need consistent tests across all cases i.e. the same set of prompts used across all LLMs. I used the following zero-shot prompts to get the best sense of how these models work out of the box:

Create a function that moves a file to folder.
Create a function that reads two data frames and returns a new data frame with the two data frames merged.
Create a function that uploads a file to a S3 bucket.

Let’s get into the output.

OpenAI ChatGPT (GPT-4)

# create a function that moves file to folder

import os
import shutil

def move_file_to_folder(file_path, destination_folder):
    try:
        if not os.path.isfile(file_path):
            print(f"{file_path} does not exist.")
            return
        if not os.path.isdir(destination_folder):
            os.makedirs(destination_folder)
        shutil.move(file_path, destination_folder)
        print(f"Successfully moved {file_path} to {destination_folder}.")
    except Exception as e:
        print(f"An error occurred while moving {file_path} to {destination_folder}. Details: {str(e)}")

# Example usage:
# move_file_to_folder('path_to_your_file', 'path_to_destination_folder')




# create a function that reads two data frames and returns a new data frame with the two data frames merged
import pandas as pd

def combine_dataframes(df1_path, df2_path):
    try:
        # Read dataframes from CSV files
        df1 = pd.read_csv(df1_path)
        df2 = pd.read_csv(df2_path)
        
        # Concatenate dataframes vertically
        combined_df = pd.concat([df1, df2])
        
        # Reset index for the new combined dataframe
        combined_df.reset_index(drop=True, inplace=True)
        
        return combined_df
    except Exception as e:
        print(f"An error occurred: {str(e)}")

# Example usage:
# combined_df = combine_dataframes('path_to_your_first_dataframe.csv', 'path_to_your_second_dataframe.csv')




# create a function that uploads a file to a S3 bucket
import boto3

def upload_to_s3(bucket_name, file_name, object_name=None):
    # If S3 object_name was not specified, use file_name
    if object_name is None:
        object_name = file_name

    # Instantiate the S3 client
    s3_client = boto3.client('s3')

    try:
        response = s3_client.upload_file(file_name, bucket_name, object_name)
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        return False
    return True

# Example usage:
# upload_to_s3('your_bucket_name', 'path_to_your_file')

It’s hard to fault what you get out of the box with GPT-4. Overall, I felt this code was good. For example, in the first prompt, it caters for a non-existant file path and attempts to implement exception-handling. It even gives an example usage comment.

For the second prompt, the output chooses to combine the two dataframes through a union (to use SQL-lingo), and then very handily follows that up by resetting the index. Very nice, but what if I wanted to join my dataframes instead? You could use the pd.concat()function, but how you do so isn’t really clear.

This highlights the need to be more specific with word choices in the prompt, which I was not.

Verdict? Very nice, but no chef kiss.

AWS CodeWhisperer

# create a function that moves file to folder
def move_file(file_name, folder_name):
    
    # 2.) check if folder exists
    if not os.path.exists(folder_name):
        os.makedirs(folder_name)

    # 3.) move file to folder
    os.replace(file_name, folder_name)

    ############################
    # CODE WHISPERER FAILED HERE
    # why? could reason the word 'to' when it should have been 'into'
    ############################




# create a function that reads two data frames and returns a new data frame with the two data frames merged
def merge_data_frames(df1, df2):
    return pd.merge(df1, df2)    




# create a function that uploads a file to a S3 bucket
def upload_file_to_s3(file_name, bucket):
    s3_client = boto3.client('s3')
    response = s3_client.upload_file(file_name, bucket, file_name)
    return response

One of the first things you may notice is the brevity of the code. Like Copilot, CodeWhisperer integrates into VSCode, effectively providing an auto-complete function while you type code — hence the less robust response. The intended use doesn’t appear to promote an output that is extensive, which it can be in GPT-4.

In prompt 2, I like that the response provides an alternative to pd.concat() It shows a fundamental reality of coding — that there is often more than one solution to a problem.

The selected pd.merge() serves as join but cannot serve as a union, so to some extent it’s less versatile than pd.concat() — and I use the versatile instead of appropriate as pd.concat() can be considered an unsafe operation (particularly for a join).

Verdict? I don’t think CodeWhisperer is a good starting point for a project. As the complexity of a prompt goes up CodeWhisperer becomes a less viable solution (for now). By design, it’s functional mode is that of a high-end code autocomplete, which I can see as being useful while in the thick of coding, facing a mental block or drafting code comments.

Worth noting, it is great to see integration into VSCode (even if a touch clunky at times).

Google Cloud Codey (Palm-2 Bison)

# create a function that moves file to folder
def move_file(file_name, new_folder):
  """Moves a file to a new folder.

  Args:
    file_name: The name of the file to move.
    new_folder: The path to the new folder.
  """

  # Get the current directory.
  current_directory = os.getcwd()

  # Move the file to the new folder.
  shutil.move(os.path.join(current_directory, file_name), new_folder)




# create a function that reads two data frames and returns a new data frame with the two data frames merged
def merge_data_frames(df1, df2):
  """Merges two data frames into a new data frame.

  Args:
    df1: The first data frame to merge.
    df2: The second data frame to merge.

  Returns:
    The merged data frame.
  """

  # Merge the data frames on the common column.
  merged_df = pd.merge(df1, df2, on='common_column')

  # Return the merged data frame.
  return merged_df




# create a function that uploads a file to a S3 bucket
def upload_file_to_s3(file_name, bucket_name, key_name):

    # Create a boto3 client
    s3 = boto3.client('s3')

    # Open the file to be uploaded
    with open(file_name, 'rb') as f:
        data = f.read()

    # Upload the file to S3
    s3.put_object(
        Bucket=bucket_name,
        Key=key_name,
        Body=data,
    )

Doc strings! I didn’t expect to see them in each function, but they are there and I feel this is a surprisingly good result. I hope they iteratively improve on this by expanding on the verasitey of best practices, such as this, that could/should be present by default.

For prompt 2, it’s nice to see that more parameters were used in the function pd.merge(), which I think needs to be expanded on if they want this kind of technology to be educational. To me this point highlights the sentiment that this technology class is great if you’re currently a skilled developer, but if you’re not, I’m not convinced this type of technology is a good educational tool… which it could be.

Prompt 3, was written quite differently to the other two. I assume that might be due to the data that was trained on and it might be skewed towards Google Cloud dependencies (making an assumption that is not validated there).

Chef kiss for doc strings!

What is the Benefit by the Numbers?

The previous section represents a qualitative analysis. For those more interested in a qualitative analysis, gain that research paper I mention is good, but also, I’ve added this section.

Before I get into my own brief quantitative analysis, I want reference some metrics from another entity — Duolingo — which has some nice figures that I believe are worth pointing out:

25% increase in developer speed with GitHub Copilot
1min setup time for largest repo with Codespaces
67% decrease in median code review turnaround time
70% increase in pull requests

In my personal experience, I spent some time to develop a mobile application using the Flutter framework (beyond the simple 3-prompt test above).

I’m not a frontend or mobile developer, and I’ve only used Flutter once (more than 2 years ago). With that in mind, I knew how long it would take me to make this application (roughly 8 hours). However, in about an hour or two, I was able to create it using few-shot learning in ChatGPT. The app worked with minimal adjustments required to the code.

From that perspective, I have it in my mind that this technology increases development time somewhere around 4x to 8x. Truly a substantial gain, but probably more possible as I am an experienced developer and not working in a dev-test-prod environment, which I think helps contextualise the metrics that Copilot highlights.

Conclusion

GenAI is having a great moment, and LLMs for code development are certainly part of that magic of 2023.

It is not without hype though. The above outputs, although wonderful, are also temperamental, only partially including best practices and realistically requiring some prior experience in development and and implementation of such code.

I think we will enter an era of ‘Dirty Coding’ (if we’re not already there). In particular, I think there is a need to leverage what we see in ChatGPT and Codey — large veins of ‘well’ structured code alongside AI-enabled code-complete like what we see in CodeWhisperer. I believe this will help us work towards a more robust form of development, particularly if we start to encourage those who are new to development to pick up these AI tools.

On that note, these technologies are not far enough down the road to start making rash hiring decisions until they have been thoroughly vetted in existing teams — it would be wise to test and learn as a form of evolving with these new tools as they will undoubtedly evolve quickly as well over the next several years. There will be extension hacks like Code GPT, which more closely embedded OpenAI’s API into VSCode albeit without auto-rendering code. Moreover, there will be developments in models like the jump from GPT-3 to GPT-4, and there may be legal action that changes what we can be used.

So, let the dirty coding commence!

OpenAI’s Code Interpreter

At the time of drafting this, OpenAI announced the beta release for Code Interpreter. How does it impact what I’ve mentioned above? First, let me take you through the quick experiment I ran. I used the prompt Create a faux dataset about tv shows. Add it to a pandas dataframe, which generated:

fig, axs = plt.subplots(2, 2, figsize=(20, 20))

# Distribution of shows by Genre
sns.countplot(data=df, y='Genre', ax=axs[0, 0])
axs[0, 0].set_title('Distribution of shows by Genre')

# A timeline of the number of shows that first aired each year
sns.histplot(data=df, x='First_Aired', bins=30, ax=axs[0, 1])
axs[0, 1].set_title('Timeline of the number of shows that first aired each year')

# A bar chart showing the number of shows each network has
sns.countplot(data=df, y='Network', ax=axs[1, 0])
axs[1, 0].set_title('Number of shows each network has')

# The average number of seasons by Genre
df.groupby('Genre')['Num_Seasons'].mean().plot(kind='barh', ax=axs[1, 1])
axs[1, 1].set_title('Average number of seasons by Genre')

plt.tight_layout()
plt.show()

I expected Code Interpreter to render the visualisations per the marketing that OpenAI has promoted, but it didn’t until I input the above as plaintext like so:

please plot the following

import matplotlib.pyplot as plt
import seaborn as sns

fig, axs = plt.subplots(2, 2, figsize=(20, 20))

# Distribution of shows by Genre
sns.countplot(data=df, y='Genre', ax=axs[0, 0])
axs[0, 0].set_title('Distribution of shows by Genre')

# A timeline of the number of shows that first aired each year
sns.histplot(data=df, x='First_Aired', bins=30, ax=axs[0, 1])
axs[0, 1].set_title('Timeline of the number of shows that first aired each year')

# A bar chart showing the number of shows each network has
sns.countplot(data=df, y='Network', ax=axs[1, 0])
axs[1, 0].set_title('Number of shows each network has')

# The average number of seasons by Genre
df.groupby('Genre')['Num_Seasons'].mean().plot(kind='barh', ax=axs[1, 1])
axs[1, 1].set_title('Average number of seasons by Genre')

plt.tight_layout()
plt.show()