Why can’t I fill this queue?

A brief foray into the hidden complexity of cross-platform programming that rarely surfaces in scripted languages

Evan Paul
Engineering @ BuildZoom
4 min readSep 16, 2020

--

While working through some refactoring of one of our data pipelines, I encountered code hanging in the following block of our multiprocessing code. This block of code is responsible for distributing work to be processed (usually represented as file names) to subprocesses via a multiprocessing.Queue.

That’s strange, I was under the impression that there wasn’t an upper limit to queue size (or at least not on the order of something I should ever encounter). Taking a look through Python’s multiprocessing Queue docs:

maxsize is an integer that sets the upper bound limit on the number of items that can be placed in the queue. Insertion will block once this size has been reached, until queue items are consumed. If maxsize is less than or equal to zero, the queue size is infinite.

put(obj[, block[, timeout]]) Put obj into the queue… If timeout is a positive number, it blocks at most timeout seconds and raises the queue.Full exception if no free slot was available within that time

An unspecified queue size will lead to an ‘infinite’ queue, but this error implies we’ve reached infinity. So… what’s infinity then?

To infinity, and hopefully beyond

Obviously infinity is a useful mathematical concept, not a concrete number. There is a finite amount of storage on any computer and thus limits need to be set to avoid errors. This raises a few questions:

1) What is the implicit/undocumented queue size limit?

2) What’s determining this limit?

Let’s try to initialize some Queues with different maxsize parameters:

Interestingly, it appears the maximum representable number of a 16-bit signed integer (n-bit signed integers range: [-2^(n - 1), 2^(n - 1) - 1]) is limiting our queue size. Checking on one of our workhorses confirmed my suspicion that this was platform specific. So, now we've answered the first question: 32,767 items is our maximum queue size on macOS. But, why?

As an aside, a new set of related questions arose here that I never fully answered:

Why is a maximum queue size being represented as a signed integer? What would a negative value mean?

The good news is that macOS is the development platform and Ubuntu is the target platform, so there isn’t much of an issue here (since we won’t enqueue anywhere near Ubuntu’s limit of 2³¹-1 items.

At this point I realized that the reason I hadn’t encountered this before is because I had never tested putting 32,768 items in the queue on macOS. Any time I had wanted to test at that scale I switched over to one of our more powerful Ubuntu instances. I decided to ignore this bug and continue working, but eventually curiosity came calling.

Down the rabbit hole

Let’s start with the traceback of the OSErrorwhen we try to set a maxsize great than or equal to 2¹⁵ on macOS:

After some brief reading through the documentation I expected the issue would probably center around pipes or semaphores since both are the main primitives that power python’s multiprocessing Queues. This traceback points to semaphores being the issue. I poked around some of the files to get a sense of what’s was going on, but then dove into synchronize.py as I assumed that is where the answer would be:

Okay so we’ve found a semaphore’s maximum value is the likely culprit and sort of answered the “what’s determining this limit” question, but why is this the value? What’s going on here?

As an aside, all I really remember about counting semaphores from my OS class I took ~5 years ago is that they are a way of tracking available resources to know when a process/thread needs to block in order to await resources (in this case the resource being managed is the queue size). I think that’s enough context to know why semaphores factor in here and why they would even have a maximum value associated with them, but I could use a refresher lesson in this domain regardless.

In Python, underscores prepended to variable names typically denote some sort of private nature (by convention). It has the same meaning at the module level, and in cpython’s case this is typically used to denote a C/C++ extension module.

For easier navigation I decided to download the cpython source and see what I could find.

Modules/_multiprocessing/multiprocessing.c

This partially answers why a queue size can be negative, but I’m still unsure as to why we are using it as an integer here rather than an unsigned integer.

Eureka, maybe?

Modules/_multiprocessing/multiprocessing.h

There aren’t any specific references to macOS / OS X / Darwin as far as I can tell, but this macro seems to be the likely suspect.

Eventually I searched for _POSIX_SEM_VALUE_MAX and found a POSIX 2004 standards limits.h definition:

_POSIX_SEM_VALUE_MAX: The maximum value a semaphore may have. Value: 32767 (source: https://pubs.opengroup.org/onlinepubs/009695399/basedefs/limits.h.html)

Another more carefully worded Google search yielded an even stronger piece of evidence: an open source Apple XNU implementation of semaphore.h has SEM_VALUE_MAX defined conforming to the POSIX standard:

https://opensource.apple.com/source/xnu/xnu-201/bsd/sys/semaphore.h.auto.html

This makes sense as macOS has been POSIX-complaint for some time now. More exhaustive testing could be done to see if this is actually what’s setting the limit, but I’m convinced and my curiosity at this point was satiated. There are still some unanswered questions, but I think I found enough to explain what’s going on.

Addendum

Our usage of the Queue may be slightly off. We’re running into this issue because we aren’t consuming the queue as we’re producing. Which makes sense since the contents of the queue are known from the start, and we currently we don’t add to it once the subprocesses have spawned (which could change). Using a multiprocessing.Pool, a more abstract multiprocessing class, might sidestep this issue and simplify our implementation. But for now, the Queue works just fine on Ubuntu and I can rest easy knowing this problem only affects large samples on macOS.

--

--