The new era of AI software development
Software ate the world and now AI will eat software.
The AI revolution will bring profound changes, especially to how we interface with computers and develop new programs.
Example from the trenches
A customer needed to load data faster into the database. They used a large CSV file. The file could be split p in smaller chunks for faster parallel ingestion.
This is a trivial problem and several solutions are possible.
After some research, I decided to let ChatGPT have a go at it.
First attempt
The first attempt was made with the following prompt:
A data file is in CSV format and can be very large, multiple Gb, and doesn’t fit in memory. The data file must be split in smaller files of 250 Mb. Each file must have a header from the original file. The lines of the CSV must be preserved. The encoding is in UTF-8. The program is started from the command line. The parameters like file names and paths, out file size, and other are placed in the top of the script as constants. A command line argument is passed with the CSV file to read from and with a prefix for the output file. The memory footprint must be as low as possible and the program should run as fast as possible. Write this program in Python.
I had downloaded sample CSV files from Kaggle for testing. The data was in various encodings, CSV formats and sizes.
I ran the generated program against the test data amd the results were decent. There were some errors with recognizing the file encoding. I gave ChatGPT the error and it fixed it promptly, and even apologised! No developer ever apologises after pointing out bugs to them so I was pleasantly surprised :-)
Each chunk of 250Mb was splitted in about 16 seconds.
Second attempt, a performance improvement
The program needed to run faster so I had the idea to add a buffer and asked ChatGPT the following question:
Can this program run faster when an internal buffer is used so the amount of IO is reduced?
The program now processed each chunk of 250Mb in about 6 seconds.
This was good, but we can do better.
Third attempt, let the AI do the work
I asked ChatGPT to give me ideas:
Can you come up with ideas to make this run faster?
As far as prompt engineering goes, this is by far not the most sophisticated prompt you can engineer. The results, however, were more than satisfactory, it gave me 10 good ideas:
Certainly! Here are some ideas to optimize the program and make it run faster:
1. **Parallel Processing**: Use Python’s `multiprocessing` module to process chunks of the CSV file in parallel. This can be especially beneficial if you have a multi-core CPU.
2. **Memory-Mapped Files**: Use the `mmap` module to map the CSV file into memory. This can allow for faster access and manipulation of the file, especially for very large files.
3. **Efficient File Splitting**: Instead of reading the file line by line, you can read large chunks of the file (e.g., 250 MB chunks) and then find the last newline character to split the file without breaking any lines.
4. **Avoid Flushing Too Often**: The `flush` method forces the system to write the buffered data to disk. If you’re calling it too often, it can slow down the program. Consider flushing less frequently or relying on the system’s automatic flushing.
5. **Use Faster CSV Parsers**: Consider using faster CSV parsing libraries like `pandas` (with its `read_csv` function) or `Dask`, which can handle large datasets efficiently. However, keep in mind that these libraries might increase memory usage.
6. **Optimize Disk I/O**:
— **Use SSD**: If possible, run the program on a machine with a Solid State Drive (SSD) rather than a Hard Disk Drive (HDD). SSDs have much faster read/write speeds.
— **Avoid Disk Fragmentation**: Ensure that the disk where you’re reading from and writing to is not heavily fragmented.
— **Write to a Different Disk**: If possible, write the output files to a different physical disk than the one you’re reading from. This can reduce contention for disk I/O.7. **Optimize Encoding Detection**: The `chardet` library can be slow, especially for large files. If you know the encoding in advance, you can skip the detection step. Alternatively, you can sample a smaller portion of the file for detection, but this might reduce accuracy.
8. **Use Built-in CSV Dialects**: If you know the specific format of your CSV (e.g., Excel, Unix), you can specify it using the `dialect` parameter in the `csv.reader` and `csv.writer` functions, which might speed up parsing and writing.
9. **Batch Writes**: Instead of writing each row immediately to the output file, consider collecting a batch of rows in memory and writing them all at once.
10. **Profile the Code**: Use Python’s built-in `cProfile` module or other profiling tools to identify specific bottlenecks in the code. This can give you insights into which parts of the code are consuming the most time.
Remember that the effectiveness of each optimization can vary based on the specific dataset and the hardware you’re using. It’s a good idea to test the performance after implementing each optimization to ensure it’s having the desired effect.
Not all of them were equally promising so I selected options 3, 4, and 5:
3. Eficient File Splitting
4. Avoid Flushing Too Often
5. Use Faster CSV Parsers
I started with option 4 and asked ChatGPT to improve the previous implementation.
Flushing didn’t improve the run time in my test so I discarded this variant.
I asked to use the suggested supposedly faster CSV parser and the run time went up from 6 to 11 seconds and this program variant was also discarded.
I then asked to implement variant 3, efficient file splitting, and the run time went down from 6 to 0.2 seconds!
Final version
The last program variant was obviously the most promising and I made several small improvements so that it could be delivered as a command line program to a customer.
The program involved some file and IO arithmetic:
# Find the last newline character and split the chunk
last_newline = chunk.rfind('\n')
out_file.write(chunk[:last_newline])
# Move the file pointer back to the position after the last newline
f.seek(f.tell() - len(chunk) + last_newline + 1)
The idea is pretty simple and straightforward and it had crossed my mind, but I’m not very familiar with Python IO and it’s easy to mess up such arithmetic.
It would have taken me several hours of research, programming and testing to get this working correctly, but ChatGPT produced this in seconds.
I added a file integrity check to count the number of input and output lines and compare them. After a few initial mistakes, like double counting headers in the output files, ChatGPT produced an integrity check. Now the program not only splits super fast, but can also tell you that the output files contain the same number of lines as the input file.
The new programming approach: selecting variants
This is a trivial example and ChatGPT outperformed my expectations. The ideas were mine, well, most of them. Where ChatGPT really outperformed my expectations was in variant creation and selection.
Step 1, test driven development and problem description
Describe the problem as exact as possible. Get test data and prepare a test bed on your computer.
Step 2, ask for variants
Let ChatGPT come up with ideas. State the objective and select promising variants.
Step 3, verify results
Run each program variant against your test bed and compare results. Select promising variants.
Step 4, select, test, assert, and improve
Select variants and improve them. Add assertions to verify results. Test results on your predefined tests. Repeat step 2 to 4 until your objectives are met.
Step 5, select the final version and tidy up
Select the final version and fix up the code, logic, user interaction, documentation, testing and all those things so that you have a production quality program. ChatGPT can help you here as well with every step.
The new era of programming: variant selection and optimisation
Software development will change fundamentally.
- You will state the objectives in a requirements document.
- The AI will produce variants.
- You provide tests and test samples and data. The AI might help you with this as well in the future.
- The AI will produce program variants and selects the most promising by itself.
- The final result will be verified by you and made ready for delivery.
AI’s will do program variant generation and selection according to predefined objectives and create the most optimal program for you and give you the test results to verify that it met its objectives.
This will fundamentally change how we develop software.
You can follow this process now but one day in the near future, the machines will do this fully automated.