Encountering Some Python Trickery

Sergei
Pipedrive R&D Blog
Published in
4 min readApr 29, 2020
[From Readwrite.com]

After using Python for years, I began to form a more critical opinion around its ‘multiprocessing’ when I had found a particularly humorous, yet painful case. The situation was Python telling me a process was alive even though the process had died already. Let me explain further about this particular “adventure” I had:

One day I was looking around and I found a code example in a gunicorn web server on how to prevent zombie processes so I figured I could add this to my fork in the pebble library:

While doing this I also found an explanation of how this works on the operating system level. Basing on what I learned, I added the next chunk of code to ‘pebble’:

Despite the code appearing to be correct, I was dismayed to find my tests were failing, but I eventually started to laugh when I found out the root cause of the failure. The truth is, after adding this patch to ‘pebble’, Python began to think that the worker process was still alive, despite it already being deceased.

Let’s take a look at the primitive code as an example:

$ python script.py
is process alive? False
is process alive? True

WHAT? How is this possible?

Okay okay, so let me explain what the problem was — this was happening due to an unreliable code in multiprocessing.Process.

Pay attention here to line #166. On this line you can see that if there’s no returncode the process continues working and is alive, but if that’s not the case, Popen.poll returns returncode if it was captured after the process finished. It then returns nothing if the process is still running or hasn’t been launched since PID can’t be found (look at lines #28–31).

Popen.poll will also return empty if the process was already finished and Popen.poll didn’t capture the status. This is because os.waitpid had already been called and grabbed the PID! This is what happened in my code example above. Now let’s debug my code line by line to find a solution:

When the process finishes, Python signal calls handler:

def handle_chld(signum, frame):
os.waitpid(-1, os.WNOHANG)

Next it waits for PID to finish and then erases it.

Next I call:

proc.is_alive()

Which calls:

returncode = self._popen.poll()

And this calls:

pid, sts = os.waitpid(self.pid, flag)

The process is finished already! The PID doesn’t exist and there’s no info about it. This is why os.waitpid raises OSError, self._popen.poll returns None and proc.is_alive keeps telling me that the process is alive — damn!

This entire situation is why I sometimes think of Python as sly and looking mischievous. With it being unaware that os.waitpid makes multiple calls with the same PID which then leads to broken application behaviour. This is all because the next os.waitpid call doesn’t know that the requested process is already finished.

Once the process has finished, the operating system keeps the PID and status, waiting for its reaping (holding the process as zombie). This data is then erased and the zombie disappears.

But wait… if os.waitpid is called in proc.is_alive(), then why does one need to register a handler and create the problem of os.waitpid double call from scratch? Maybe this seems reasonable to some, but for me it’s more a question of philosophy and code quality. I don’t need implicit knowledge of which additional method should be called in order to reap a process correct — if Python can control this instead of me, that’s fine by me.

In order to help self._popen.poll() return trusted results for proc.is_alive(), I decided to cache information about all the finished processes that happened during application work. For this I just need to monkey-patch os.waitpid a bit:

The idea itself is simple — put all captured data to a dictionary, but the Devil is in the details. Perhaps you are thinking, why do I need to call the original function for PID if it’s already cached. It’s because operating systems can recycle PIDs and that’s why it first checks if the PID is launched and if not, checks if it was captured already. In the cache I added only positive numbers because they correspond to lone processes (and no process children or groups).

After this, multiple calls of os.waitpid will return the same result for the same PID. Of course, this logic change can be dangerous for code compatibility so use it at your own risk, but I think this logic is more correct. Let me show you a small example of what I mean:

and launch:

$ python script.pyException: BOOM!
Is process alive? False
Is process alive? True

As you can see, after resetting the object property returncode(which also is a kind of cache), on the second call, is_alive() starts to show an incorrect result. I don’t believe this is normal since it shouldn’t matter how many times I’ve ask os.waitpid (one or even ten times) — the result should be consistent, as a minimum on the application level, and the os.waitpid patching solves this.

After this patch, applied zombie processes were successfully defeated in pebble.ProcessPool 😎

Many kudos for text review to David Lorbiecke.

--

--

Sergei
Pipedrive R&D Blog

Software Engineer. Senior Backend Developer at Pipedrive. PhD in Engineering. My interests are IT, High-Tech, coding, debugging, sport, active lifestyle.