I recently had a “Python is slow” problem which I finally managed to track down and solve. The application parsed some C line coverage data and made it accessible in various ways. The data set was large: 200k tests that might hit any or all of 3M lines of code. It would consume 100GB+ while running for hours to consolidate the data.
The application seemed to get slower and slower as it progressed, the runtime proportional to the size of the in-memory data. (Some of you may already be able to guess what the problem was.) I was using Python 2 at the time (it was a few years ago) and the runtime was really too large in some cases: days. I decided to move to Python 3 (an interesting exercise on it’s own) and it had the same problem, but the runtime was better and at least acceptable.
Flash forward 2 years, but I needed to update the script to handle a new data input format. By some fluke I read up on Python Garbage Collection. There are 2 types of “garbage collection” in Python: reference counting (mentioned above) and Generational Garbage Collection (GGC). You can’t turn the reference counting GC off, but the Generational Garbage Collection is optional. GGC can recover deleted objects even when they reference themselves, which is not possible with naive reference counting algorithms. It’s only useful when you have data with reference self loops, but which is no longer referenced by any object other than itself. It’s on by default because you may not realize you have loops, but it’s runtime is proportional to the amount of memory in use. That is exactly what I was seeing: the script slowed down as memory grew.
So, in the end I turned off GGC (import gc; gc.disable()) and my runtime went from 1/2 done after fivehours (I killed it) to completely done in two hours. And no noticeable increase in memory use. Amazing!