AI Rubber Ducking

You got a new partner in coding

Published in

The Elegant Code

13 min readNov 20, 2023

Developers are starting to integrate AI tools in their workflow, but they’re often skeptical of how AI can improve their daily experience as a whole. In this article, I’ll introduce Generative AI as an on-demand addition to the team.

One of the criticalities of the work of a software developer is the so called “white page syndrome”. It happens when, at the start of a project, or even when a new source file is created, the editor window is empty, and that makes the effort of starting writing code even more difficult — I guess professional writers face a similar problem.

Other times, when the code gets tangled and difficult, “writing the last straw”, that is, writing the final lines of code can seem like an overwhelming task: will that work in harmony with the other 999 lines preceding it, or is the new line I am going to write breaking the delicate equilibrium I have achieved up to this point? That question is constantly in the back of the mind of any developer, and as the work progress, it becomes more and more pressing.

Finally, even when everything goes smooth, the “perfection bias” sets in. Even when we know we’re working towards a solution, the doubt may set in that isn’t the best way to solve the problem. What if down the line we need to add parameters to an already complex function? What if the scope of the problem changes, and we’re not dealing with operations per seconds, but per microseconds? Isn’t there a better way to get this code done?

To solve these and other setbacks that could stifle the productivity of an even professional developer, literature, industry practices and companies have developed several methods, such as planning meetings, daily scrums, pair programming, and others; but one technique is taught to every developer as a fallback when everything else fails, or worse, is not available: rubber ducking.

Briefly, that is just voicing out loud your ideas, doubts, thoughts about possible alternatives and problems down the line. It’s most effective if you imagine addressing someone that is really eager to listen to you… like a physical rubber duck on your desk.

That is a surprisingly effective technique, which helps moving past the white page, last straw and perfection bias (and other) problems very effectively already. But… what if the rubber duck could even answer you?

We can (almost) programmatically use Generative AI to help us in our daily job in the following domains:

Exploration: helping us to analyse the problem domain and devise a solid plan.
Advice: offering guidance during implementation and potentially reducing the code we write manually.
Review: checking our code to identify potential areas for improvement (as well as plain old bugs).

To show how Generative AI can help in these tasks, we’ll follow Franco, a junior developer in charge of writing a web page fetcher (spider).

The project

Franco’s task is to write a web spider. Its purpose is to download updated official documents, such as parliamentary acts, from web pages. These pages’ URLs are stored in a configuration file. The pdf documents should be saved in a way that allows their identification (i.e. from which governmental entity they come from), and any update should be notified via e-mail, for review by other employees.

Franco is very confused, being a junior developer and — as usual — being given far less information than what would be needed for a junior developer to figure out the best way to bring this small project home. But luckily, Franco has a super rubber duck as an additional team member.

Exploration

So, Franco asks a Generative AI how better to address this problem using this prompt:

I need to write a spider program fetching web pages; the program should download PDF documents linked in the pages, and alert a given mail address when some of the document change. Can you suggest which commonly used programming language would be the most recommended — use the size and complexity of a program with this requirement as the criterion.

As you can see, the description of the problem is very high level, which allows the AI to explore a vast solution space. A possible answer is…


For a web scraping program that downloads PDF documents and
sends email alerts, Python is a highly recommended programming 
language due to its several advantages:

Ease of Use and Readability: Python's syntax is simple and 
straightforward, making it easy to learn and use, even for
beginners. This can significantly reduce the development time
and effort required.

... and other fluff...

Franco is very happy, as he knows Python very well, but he never used it to scrape web pages or send emails. To get a better idea of how to organise the work, he asks:

Once I download the PDFs, I am required to store them in a way that references the site they come from and the update date. Consider that the sites I have to scrape come from a configuration file. How would you go about it?

Our friendly rubber duck gives Franco a range of suggestions (cutting fluff text with “…” for brevity):

To store the downloaded PDFs in a way that references the site they 
come from and the update date, you can utilize a combination of file 
naming conventions and a database or metadata file. Here's a 
step-by-step approach:

1. File Naming Conventions:
  Site Identification: Incorporate the website's domain ...
  Update Date: Include the update date in the filename ...
For instance, if the PDF was updated on 2023-11-20, 
the filename could be "example.com_document_2023-11-20.pdf."

2. Metadata Storage:
  Database: Utilize a database like SQLite or PostgreSQL ...
  Metadata File: Alternatively, you can create a separate JSON or YAML file...

3. Configuration File:
  Site List: Maintain a configuration file that lists all the websites 
  you need to scrape...
  Update Frequency: Specify the update frequency for each website...
...

Franco infers from the product owner’s body language that they would prefer the directory structure method, that would allow them to find the desired documents at a glance, copy them from a network server and store them locally in their messy desktop without losing track of which document comes from and since when; the AI has also provided detailed suggestions on how to achieve that.

Despite Franco’s expertise with Python, he has never used it to scrape a site or send an e-mail, and thus he seeks help to minimise the time spent researching:

Ok, let’s do it with Python. What modules would you suggest for the job?

To create a spider program in Python that can fetch web pages, 
download PDF documents, monitor for changes, and alert a given 
email address, you can use the following Python modules:

1. Requests: For fetching web pages over HTTP.
2. Beautiful Soup: For parsing HTML and extracting links.
3. PyPDF2 or PDFMiner: For handling PDF files and possibly checking for content changes.
4. hashlib: For creating hash signatures of the PDFs to easily detect changes.
5. smtplib: For sending email notifications.

...

Here's a brief overview of how these modules could fit into the workflow 
of your spider program: 

...

You might also want to consider using a more comprehensive 
framework like Scrapy, which is specifically designed for web 
scraping and crawling.

...

Franco notices this last information the super rubber duck provided. Learning to use one module looks like less work than learning to use five, and always in search of a way to do less work, he asks:

Suppose I’d want to try out this Scrapy. How would that work?

Scrapy is an open-source and collaborative framework for extracting
the data you need from websites. It is built on top of Twisted, 
an asynchronous networking library, which allows it to handle a 
large amount of data and requests simultaneously. Here's a basic 
rundown on how you would set up a Scrapy project to scrape PDF files, 
store them with references to the source site, and maintain update dates

... follows a powerful but complex setup ...

And after having read all the steps necessary to use this collaborative framework, Franco has one last question:

My requirements are very simple, and this spider would just be run manually when necessary, downloading a couple of PDFs (at most, if they are changed) from 10–20 sites. Would you recommend using scrapy nevertheless, or wouldn’t be faster to just use Requests to get the sites manually?

To which, the AI rubber duck replies:

For a simple task of manually running a script to download a few PDFs 
from 10-20 sites, using Scrapy might be overkill. Scrapy is more 
suited to larger-scale scraping operations where you need to ...

... do a lot more stuff ...

So, after all, that wouldn’t be less work.

Armed with this information about the best way to get his project done, now Franco has a plan, and knows what specific knowledge he needs to acquire.

With this, Franco overcomes his white page syndrome and begins coding.

Advice

In the last 4 hours, Franco has happily coded away a solution that he thinks it’s valid, and everything looks like it should work, once tried out. But deep in the code, he discovers a flaw in his logic.

So, he decided he wants to save the PDFs with a prefix representing the site name, taken from the configuration, and the date at which the document is downloaded.

If there isn’t a document with that name, or if there is a document but it has a different hash, the dowloaded PDF is saved. But then, he realises that in this way, he will download a new file every day, no matter if the previous one is different or not!

How to fix this problem? Let’s rubber duck it with Franco:

I wrote a save_pdf function that saves the downloaded pdf if there isn’t a file with that name, or if there is already a file with that name but with a different hash. But… I want to avoid saving the file if the latest one already has the same hash!

To which, the AI replies:

To prevent saving a PDF if the latest one in the directory
has the same hash, you need to:

1. Determine the most recent PDF file in the directory.
2. Compute its hash and compare it to the hash of the new PDF content.
3. Save the new PDF only if the hashes are different.
Here's a modified save_pdf function that implements this logic:

With a code that goes more or less like this:

def get_latest_pdf_file(directory):
    """Get the most recent PDF file in the given directory."""
    pdf_files = [f for f in os.listdir(directory) if f.endswith('.pdf')]
    if not pdf_files:
        return None
    latest_file = max(pdf_files, key=lambda x: os.path.getmtime(os.path.join(directory, x)))
    return os.path.join(directory, latest_file)

def get_pdf_hash(pdf_content):
    """Calculate the hash of the given PDF content."""
    return hashlib.sha256(pdf_content).hexdigest()

def save_pdf(pdf_content, source_url, directory='pdfs'):
    """Save the PDF only if there is no file with the same hash."""
    
# ... this goes more or less as Franco wrote it already ...

With this, Franco is able to overcome the “last straw” block: knowing what he has to do and how to do it, he can simply fix his little mistake without additional burden of having to think also at how to find the most recent file — which would require a push-pop sequence in his mind that may disrupt his coding flow.

However, another conundrum strikes Franco: where to perform this check?

Should the save_pdf() function perform the check? Or would it be more elegant to avoid calling it altogether when there’s nothing new to save?

It’s a trivial matter, and for such a small project it makes no real difference, but Franco wants to make a good impression on the code reviewers, and the doubt keeps distracting him as he’s trying to focus on the rest of the code. His “perfection bias” becomes a distraction, compelling him to seek clarity through further rubber ducking:

I was thinking… wouldn’t it be more elegant to simply skip calling save_pdf at all when the last pdf didn’t change? On the other hand, save_pdf is already in the process computing the name, opening the directory, checking for the file to be there… what do you think?

Indeed, it would be more elegant and efficient to determine
whether to call save_pdf based on the hash comparison
beforehand. This way, you avoid unnecessary calls to save_pdf 
if the latest PDF has not changed, which keeps the logic for 
saving separate from the logic for change detection.

Here's how you could refactor the code:

1. Create a function to compute the hash of the latest PDF file in 
   the directory.
2. Before calling save_pdf, compare the hash of the new PDF 
   content with the hash of the latest PDF file.
3. Call save_pdf only if the hashes are different or 
   if there is no latest file.

Here's an example refactoring based on this approach:

And the code generated looks like this:

def get_pdf_hash(pdf_content):
    """Calculate the hash of the given PDF content."""
    return hashlib.sha256(pdf_content).hexdigest()

def get_latest_pdf_hash(directory):
    """Get the hash of the latest PDF file in the given directory."""
    pdf_files = [f for f in os.listdir(directory) if f.endswith('.pdf')]
    if not pdf_files:
        return None  # No PDF files in the directory
    latest_file = max(pdf_files, key=lambda x: os.path.getmtime(os.path.join(directory, x)))
    latest_file_path = os.path.join(directory, latest_file)
    with open(latest_file_path, 'rb') as f:
        latest_pdf_content = f.read()
    return get_pdf_hash(latest_pdf_content)

# ...  and more glue code to make it work ...

This resolves the “perfection bias” afflicting Franco. He now knows that it would really be better to avoid calling save_pdf() unless strictly necessary, and the AI rubber duck provided him with a simple means to do it.

Franco is able to code away a working, and mostly elegant and functional solution (for a simple script written in a couple of hours), that he’s proud to present to the reviewers… or is he?

Review

After several hours of intensive coding, Franco has a working prototype, something that looks like this:

import requests
from bs4 import BeautifulSoup
import hashlib
import os
import sys
from datetime import datetime

SAVE_DIRECTORY = "pdfs"


def download_pdf(url, session):
    response = session.get(url)
    pdf_content = response.content
    return pdf_content


def save_pdf(pdf_content, source_url, site_name):
    # should include ".pdf" by construction, as we get only "*.pdf" links
    target_pdf = source_url.split("//")[1].split("/")[-1]
    date_str = datetime.now().strftime("%Y%m%d")
    directory = os.path.join(SAVE_DIRECTORY, site_name)
    os.makedirs(directory, exist_ok=True)
    filename = f"{site_name}_{date_str}_{target_pdf}" 
    filepath = os.path.join(directory, filename)
    with open(filepath, 'wb') as f:
        f.write(pdf_content)
    print(f'Saved PDF from {source_url} as {filepath}')
    return target_pdf


def load_sites(file_path):
    config_list = []
    with open(file_path, 'r') as file:
        for line in file:
            # Split each line into logical name and URL.
            parts = line.strip().split(' ')
            if len(parts) == 2:
                config_list.append((parts[0], parts[1]))
            else:
                print(f"Skipping invalid line: {line}")
    return config_list


def get_pdf_hash(pdf_content):
    """Calculate the hash of the given PDF content."""
    return hashlib.sha256(pdf_content).hexdigest()


def get_latest_pdf_hash(directory):
    """Get the hash of the latest PDF file in the given directory."""
    if not os.path.exists(directory):
        return None
    
    pdf_files = [f for f in os.listdir(directory) if f.endswith('.pdf')]
    if not pdf_files:
        return None  # No PDF files in the directory
    latest_file = max(pdf_files, key=lambda x: os.path.getmtime(os.path.join(directory, x)))
    latest_file_path = os.path.join(directory, latest_file)
    with open(latest_file_path, 'rb') as f:
        latest_pdf_content = f.read()
    return get_pdf_hash(latest_pdf_content)


def send_alerts(changed_files):
    print("Sending an EMAIL is left as an exercise to the reader.")
    for f in changed_files:
        print("Changed file: " + f)


def main():
    changed_files = []
    sites = load_sites(sys.argv[1])
    session = requests.Session()
    for site_name, site in sites:
        response = session.get(site)
        soup = BeautifulSoup(response.text, 'html.parser')
        for link in soup.find_all('a'):
            href = link.get('href')
            if href and href.endswith('.pdf'):
                pdf_url = requests.compat.urljoin(site, href)
                pdf_content = download_pdf(pdf_url, session)
                pdf_hash = get_pdf_hash(pdf_content)
                if pdf_hash != get_latest_pdf_hash(os.path.join(SAVE_DIRECTORY, site_name)):
                    changed_files.append(save_pdf(pdf_content, pdf_url, site_name))

    if changed_files:
        send_alerts(changed_files)


if __name__ == '__main__':
    main()

Franco thinks that the code looks good enough for the task at hand, but before submitting it for review to his senior developers, he wants to know if he missed something important.

The rubber ducking process for the review phase might go as follows:

Ok, this is my complete work. Would you mind checking if this file has any evident errors, or could be improved in any way?

Followed by the whole of the above program cut & pasted in the prompt.

Our rubber duck partner has some ideas on what would make this better (again, I am just showing the most relevant part of the reply):

Your script appears well-structured and covers the 
functional requirements you've mentioned. However, 
there are a few points that could be improved or require attention:

1. Error Handling: There's no error handling for network requests ...
2. Session Timeout: ...
3. PDF File Naming: The save_pdf function assumes that the last part of the 
   URL (after the final '/') is a valid filename for the PDF. ...
4. Content-Disposition Header: Some PDFs might be served with a 
   Content-Disposition header that suggests a filename.  ...
5. Redundant Comments: Comments like # should include ".pdf" by construction, 
   as we get only "*.pdf" links could be removed if the code's 
   behavior is clear enough.
6. Logging: Instead of printing messages to stdout, ...
7. File Hashing Efficiency: For large PDFs, consider streaming ...
8. Script Configuration: Consider using a command-line argument parser ...

Here is a slightly improved version with some of these points addressed:

... an improved version follows ...

And with this, Franco is able to turn his effort into a professional-looking small program.

Luckily, on one side, Generative AI models have become sophisticated enough to accept prompts as large as small programs, and on the other, Franco’s code wasn’t that long to begin with; however, it’s very well possible that a whole source file won’t fit a prompt.

In that case, we can still ask our rubber duck partner to review the code one function/method/slice at a time. A complete code unit should very well fit a modern GenAI prompt; if not, I would consider that a “code smell” in itself. As such it should be possible to feed any piece of code into a prompt, especially if it’s code we want to double-check with the help of a super rubber duck, before we send it for our colleagues’ peer review.

Conclusions

Rubber ducking was already a powerful technique any developer should know, in order to solve three of the main blockers they normally face in their daily jobs:

white page syndrome
writing the last straw
perfection bias

But with Generative AI in the mix, the rubber duck process becomes interactive, and provides the developer with new input that translates into a massive help.

Also, this new tool allows for automated code reviews that can improve it before being submitted for colleagues’ peer review, reducing the amount of review work they’re called to perform, and thus, increasing the chance they spot actual logical errors in the submission.