I’ve decided to try something new: Rubber Duck Stories.
The idea is to invest time into doing a write-up of a problem I’m facing before I’ve solved it. Why not put it on Stack Overflow? Because before posting there, I believe one should have put in the work to pinpoint the problem to a small piece of easily digestable demo-code that demonstrates the actual problem, not the rough space where the problem is expected to be.
That’s where Rubber Duck Stories come in, they are the stories of the road to the discovery behind the problem, the internal monologue of the developer unraveling the layers of the application, following, searching and prodding the issue until the actual problem reveals itself.
So here we go:
I have a weird issue with Guzzle / Curl, possibly introduced through side-effects with the aws-sdk / s3 stream wrapper or php-fpm:
We’re currently migrating away from a very old hosting setup that still runs on mod_php towards an aws ec2 installation with php-fpm running.
So far everything was running smooth, we’ve had to do some adaption regarding the filesystem, in the course of which I’ve introduced the aws-php-sdk and the s3 stream wrapper so we can use our existing fopen/is_file calls with s3 as the storage backend directly, as going via an dedicated AWS Storage Gateway introduced even more problems.
Having done all the necessary changes and testing all functionality so far is throwing no more errors, except one:
We run into a 504 Gateway Timeout when a user switches between different accounts to which he as access to (it’s a multi account feature we support).
Here’s how it looks when it works:
Network wise what happens is this:
Our VueJS SPA sends a request to the /switch route of our Account Management Microservice, requesting a new session cookie for our PHP Application and our SPA. This is returned successfully so the front-end triggers a location change to the PHP Application which, after taking a second to ponder the deeper philosophical questions that lurk below the surface of an account switch, returns a redirect to the Dashboard of our application for the correct user/account combination.
Here’s a recording of the broken scenario in the new AWS environment with php-fpm and the aws-php-sdk in use.
Network wise we see what the error message is telling us:
Digging into it this is triggered when our php backend uses guzzle to send an event to our internal user-tracking microservice signalling that the user is visiting the dashboard, at least that is what I thought up until now.
Making these recordings for this RDS(* Rubber Duck Story) just proved valuable since I realize now that the initial call already fails, not the call to the actual dashboard.
Still commenting out the whole event-send code, which is the same place for all page-visits, stops this error from occurring.
More specifically the fatal error shows the code being stuck at this line at the call of curl_multi_exec(): https://github.com/guzzle/guzzle/blob/master/src/Handler/CurlMultiHandler.php#L114
On the app server top shows php-fpm running at 100% cpu when in this state until the timeout is reached:
There are multiple things that are strange with this:
1. The same event tracking code works perfectly when running on our old environment (mod_php, without the aws-sdk loaded), allowing the account-switch to occur without issue.
2. The Guzzle Client is initialized with a 1 second timeout, why would it keep running for longer than 1 second, going for the full 180 seconds?
3. The issue disappears if we simply don’t call ->wait() on the promise that is returned by requestAsync().
4. The very same event tracking code works flawlessly in all other cases, just the account switch triggers this behavior, events arrive perfectly fine at the micro-service and get logged. There is not really anything special(*) about that account switch, the event tracking isn’t session aware, its just taking simple ids/strings not resolving anything or doing any other wizardry. (*during writing this I discovered that the first call that is not a call to a specific page but is expected to return a redirect is the offending one, so that IS special!)
5. From my understanding of the guzzle code here, this behavior shouldn’t happen in any case, but I’m probably wrong here.
6. In all the cachegrind files written when profiling this action, nothing shows up that looks anything like something is stuck, so either the offending request doesn’t write the file at all or xdebug is “overlooking” this for some mysterious reason.
7. Commenting out the initialization and even the use statement of the aws-php-sdk doesn’t prevent the issue from occurring.
8. The only disabled functions on the new environments are the pcntl functions, so it is (probably) not related to this issue: https://github.com/guzzle/guzzle/issues/2114
9. This error did only occur after adding the aws-php-sdk and disappears if I deploy older versions of the code. The biggest change alongside this is additional error handling for the promise that requestAsync() returns, but commenting that error-handling code out and returning it to the status it had before the change still throws that error, additionally this version of the code is already running in production flawlessly in the non-aws environment.
When talking to my great friend Stefan Adolf about this issue, he pointed out that our call to $promise->wait(); is happening in the destructor of the event tracking sdk we’re using.
The idea behind this was that since it’s just user tracking and nothing mission critical for the customer, we’re collecting the return values of those events at the last possible time to report any errors to Sentry.
This is a bad anti-pattern as it is very hard to reliably predict when and in what environment the destructor is going to get called. Doing anything besides freeing memory, deleting objects or closing connections should be avoided and is prone to create issues.
Even throwing errors can get unexpected behavior, as in this case shown by Swiftmailer: https://github.com/swiftmailer/swiftmailer/issues/819
So what is happening in our case?
Remember that I found out that it is the call to / that should return a 301 to our Dashboard that is failing? That proved crucial to explain why this bug was happening only in the account-switch case.
When we construct a page like the dashboard, a lot of calls to the database get made, a lot of objects of all shapes and sizes get instantiated, most of which don’t get explicitly freed before the request ends and PHP starts to deconstruct them. When the destructor of our event tracking sdk was called, it dutifully made called the $promise->wait() functions of the promises that it kept registered.
Sadly, because at this point in the time the cURL instances were already destructed, the Guzzle Promises were dutifully marching towards their doom, as the aforementioned Line 114 in the CurlMultiHandler is waiting for the handle to return anything but CURLM_CALL_MULTI_PERFORM.
This doesn’t happen, although one would assume at some point in time to receive CURLM_BAD_HANDLE this is not the case, as the handles weren’t yet destructed and this endless loop stopped this from happening.
So, it might dawn on you why this is only happening in the account-switch case: It’s the only place where we expect a redirect from / to anything else. By instantiating so much stuff in all other cases we’re giving PHP so much stuff to destruct that it simply didn’t get around to destroy the cURL instances before the destructor of the event tracking sdk is called.
Creating a lot of stuff to destruct simply masked this issue, the simple redirect doesn’t do much, so it laid bare the issue.
The shortest way to fix this was simply to rename the destructor in our application and turn it into a normal public function which I called “waitForResponses()” and stitched into the application after the page / api-response was sent.
That squatted that bug.