After using Python for years, I began to form a more critical opinion around its ‘multiprocessing’ when I had found a particularly humorous, yet painful case. The situation was Python telling me a process was alive even though the process had died already. Let me explain further about this particular “adventure” I had:
While doing this I also found an explanation of how this works on the operating system level. Basing on what I learned, I added the next chunk of code to ‘pebble’:
Despite the code appearing to be correct, I was dismayed to find my tests were failing, but I eventually started to laugh when I found out the root cause of the failure. The truth is, after adding this patch to ‘pebble’, Python began to think that the worker process was still alive, despite it already being deceased.
Let’s take a look at the primitive code as an example:
$ python script.py
is process alive? False
is process alive? True
WHAT? How is this possible?
Okay okay, so let me explain what the problem was — this was happening due to an unreliable code in
Pay attention here to line #166. On this line you can see that if there’s no
returncode the process continues working and is alive, but if that’s not the case,
returncode if it was captured after the process finished. It then returns nothing if the process is still running or hasn’t been launched since PID can’t be found (look at lines #28–31).
Popen.poll will also return empty if the process was already finished and
Popen.poll didn’t capture the status. This is because
os.waitpid had already been called and grabbed the PID! This is what happened in my code example above. Now let’s debug my code line by line to find a solution:
When the process finishes, Python signal calls handler:
def handle_chld(signum, frame):
Next it waits for PID to finish and then erases it.
Next I call:
returncode = self._popen.poll()
And this calls:
pid, sts = os.waitpid(self.pid, flag)
The process is finished already! The PID doesn’t exist and there’s no info about it. This is why
proc.is_alive keeps telling me that the process is alive — damn!
This entire situation is why I sometimes think of Python as sly and looking mischievous. With it being unaware that
os.waitpid makes multiple calls with the same PID which then leads to broken application behaviour. This is all because the next
os.waitpid call doesn’t know that the requested process is already finished.
Once the process has finished, the operating system keeps the PID and status, waiting for its reaping (holding the process as zombie). This data is then erased and the zombie disappears.
But wait… if
os.waitpid is called in
proc.is_alive(), then why does one need to register a handler and create the problem of
os.waitpid double call from scratch? Maybe this seems reasonable to some, but for me it’s more a question of philosophy and code quality. I don’t need implicit knowledge of which additional method should be called in order to reap a process correct — if Python can control this instead of me, that’s fine by me.
In order to help
self._popen.poll() return trusted results for
proc.is_alive(), I decided to cache information about all the finished processes that happened during application work. For this I just need to monkey-patch
os.waitpid a bit:
The idea itself is simple — put all captured data to a dictionary, but the Devil is in the details. Perhaps you are thinking, why do I need to call the original function for PID if it’s already cached. It’s because operating systems can recycle PIDs and that’s why it first checks if the PID is launched and if not, checks if it was captured already. In the cache I added only positive numbers because they correspond to lone processes (and no process children or groups).
After this, multiple calls of
os.waitpid will return the same result for the same PID. Of course, this logic change can be dangerous for code compatibility so use it at your own risk, but I think this logic is more correct. Let me show you a small example of what I mean:
$ python script.pyException: BOOM!
Is process alive? False
Is process alive? True
As you can see, after resetting the object property
returncode(which also is a kind of cache), on the second call,
is_alive() starts to show an incorrect result. I don’t believe this is normal since it shouldn’t matter how many times I’ve ask
os.waitpid (one or even ten times) — the result should be consistent, as a minimum on the application level, and the
os.waitpid patching solves this.
After this patch, applied zombie processes were successfully defeated in
Many kudos for text review to David Lorbiecke.