A two week search for the missing body of a Lambda function response

Published in

Life and Tech @ LifeOmic

8 min readJul 10, 2018

At LifeOmic, a handful of engineers and I recently spent about two weeks debugging some unexpected behavior from the services that support our LIFE Fasting Tracker mobile application and our Precision Health Cloud. Specifically, the responses being returned from some of our AWS Lambda functions were causing errors for our users, and I thought I’d share our experience to spread the word about what we learned in the process. Hopefully our approach to debugging will help you in your own search for understanding unexpected behavior.

Day 0: Null

It all began when we started to receive support requests from users of our LIFE Fasting mobile app saying that they were not able to view the progress of other people’s fasts. When the first report came in it was tempting to dismiss such a strange message as a fluke, but over the next couple of days we saw a steady stream of reports about error messages that look like this:

Cannot read property ‘body’ of null.

That’s not the kind of message you want a user to see, and it was happening often enough that it quickly got our attention. A search for that message in SumoLogic was the first step. That led right to the source of the error. The relevant code looked something like this:

The error was being thrown from line 3 because it assumed that Payload was always a stringified JSON object with a body attribute. That might seem like a big assumption to make, but the function being invoked was one used as a HTTP Proxy Integration with an API Gateway and the documentation for the output format of Proxy Integration functions requires those assumptions to be true. The problem appeared to be in the invoked Lambda function, not in the client code. Now, to find out why it would return a null Payload.

The Lambda function in question was built using the serverless-http library, which allows you to write your services as typical Koa or Express applications and deploy them as Lambda functions behind an API Gateway. This ensures that the required Proxy Integration output and input formats are translated correctly. But no matter what my Koa routes were returning, the serverless-http library should have been returning a valid output format, so we started to dig into the library code. Here is a pseudocode version of the implementation:

The library appeared to be doing just what it should. If there was no error, it would return an object literal with the required fields, and if there was an error it would return an error. There appeared to be no way for the response to be null. This led to many hours of re-reading the code to find the case where null would be returned, but the object literal meant that no matter what was returned from Koa, some object would be returned.

Day 3: A false sense of a progress

After a day or two of being stuck, I woke up one morning with an insight! What would happen if a falsy object was thrown as an error? That would be an incredibly bad practice for any code to follow, but it was a theory, it was testable and it was better than nothing.

A small test case confirmed that falsy errors would cause the behavior we had been seeing. An error was thrown, so the callback(e) block was executed, but the e was falsy so the Lambda runtime was treating it as not an error. The second argument to callback(e) was undefined and that seemed to be translated to a null automatically. It felt like success. A possible cause was found, a patched version of serverless-http was coded to log if that situation happened, the Lambda function was re-deployed with the patch. We waited.

Pseudocode for patched version of serverless-http

The error wasn’t being seen every day, so we had to wait until it happened again. In the meantime, we thoroughly searched our codebase for any code that would throw null, undefined, false or any other falsy value, but no such code was found. Finally, after a day or two, the error happened again, but there was no log message about a falsy error. I banged my head against my desk and yell profanities. We were back to having no ideas. By now, the group of interested developers had grown, and we had a dedicated Slack channel full of crazy ideas about what might be happening.

Developer Advice #1: While this wasn’t the cause of our troubles, it could still cause trouble for others. Throwing anything other than an Error instance is a very bad practice and it’s almost guaranteed to cause behavior that you do not intend.

Day 7: Starting fresh with week two

If serverless-http was always returning an object, and the application was not throwing bogus errors, maybe the bug was inside the Lambda callback implementation. Seeing the implementation of callback would give us a new place to explore for bugs, but it wasn’t immediately obvious how we could look into AWS’s implementation details. It turns out that you can see the implementation thanks to the extremely useful docker-lambda Docker images published from the LambCI project. The images are built by extracting the files from an actual AWS Lambda runtime so the actual callback implementation is available inside the docker-lambda images. Here is how you can view the implementation:

docker run \
  --entrypoint /bin/cat \
  lambci/lambda:nodejs8.10 \
  /var/runtime/node_modules/awslambda/index.js

The relevant method from the index file turns out to be InvokeManager.finish. Here is a slightly simplified version:

You can see how falsy errors are treated as success on line 4 and how undefined becomes null on line 5, but nothing jumps out as the cause of a null response. The only case when the response will be null is if data was null, and we’d convinced ourselves that serverless-http would not send a null response.

More searching in the Lambda implementation code revealed this interesting line, way down at the bottom of the file, which did send a null data argument:

That was the light at the end of the tunnel. We needed to explore the details around when the beforeExit event would be emitted, but it meant there was hope for understanding the strange behavior we had been seeing.

Day 8: Confirming the early exit theory

The beforeExit event is an event that is emitted when the Node.js event loop is empty and before the process exits. It’s a chance for the process to keep itself alive by scheduling more work. The Lambda runtime uses the beforeExit event to complete the last request and start processing the next request. If the beforeExit event was emitted while a request was still being processed, a null response would be returned. The new mystery was how a beforeExit event could be emitted before our request was finished executing. With many people trying to debug the problem now, we split the work into two parts that could be worked on in parallel.

The first step was to verify that the beforeExit handler was really the source of our null responses. We used some clever tricks to wrap a Lambda function handler and event registration to detect if this case was happening in our deployment. After deploying the monitoring and waiting a day or two, we successfully detected that this was the source of our problem. We’ve deployed it across other functions in our system and have already caught a couple other cases of the early exits that would have gone undetected. We published the code for that detection in our lambda-tools project and you can use it to detect if your functions ever hit this situation. If you have a Node.js Lambda function, I’d highly recommend you also include monitoring like that in your project.

Developer Advice #2: When debugging a problem, it can be very helpful to implement enough logic to verify that your theory is correct before spending too much time on a solution. In our case, our initial guess at the problem was a potential source of trouble, but not our real source of trouble. It took us two different ideas before we found the real source and had confidence that we should invest in more exploration.

Day 10: Find the source of early exits

The second step was to track down why the event loop could be emptied while a request was still being processed. A very valuable clue in our search for a cause was that out of the many functions we have deployed, only this one function demonstrated the problem. It also happens that this was the one service that was using a memcached-plus client, and the memcached-plus client contains code that bridges between callback style APIs and Promise style APIs. That felt like a good source of strange behavior since it’s so easy to make a subtle mistake in that kind of code. It turns out that when creating new Promise instances, it’s possible to write a program that will exit sooner than a person reading the code would expect. For example:

At first glance, it would appear that the program will always output either Done or Error. However, the program will exit without any output and without any error code. Because the Promise does not schedule any new work to be done and does not call resolve either, the Javascript engine thinks that there is nothing more to execute and exits cleanly. This proof of concept code plus the confirmation that the beforeExit handler was being run before it should gave even more weight to the theory that something was wrong with the Promise bindings in the memcached-plus library.

Developer Advice #3: When constructing Promise instance manually, be very careful to always call either resolve or reject in all cases. In the case of a Promise that has a code path without either, being called will result in unexpected behavior for the consumer of the Promise.

Day 12: Living with a broken Promise

We continued with the hunch that there was a poorly behaving Promise in the memcached-plus client code. We wanted to improve our code so that it would continue to detect and behave correctly if our guess was right. With the help of the pFinally and pTimeout modules, a small helper was created to address both goals:

That safePromise helper can be used on any Promise to make sure that the event loop will never be empty earlier than expected, and to log enough information to find the source of the bad Promise. All Promises from the memcached-plus client were wrapped with that helper function. We redeployed and we waited.

Developer Advice #4: If you are using a third party library, especially one that makes network calls, it’s a good idea to wrap calls with some sort of timeout to protect against unexpected behavior. The addition of timeouts also has the advantage of forcing you to consider what execution times would be acceptable to your users.

It turned out that our guess about memcached-plus being the source of problems was correct. After wrapping the Promises from that client, the beforeExit handler no longer executed early, and no more null payloads were returned to users. With the logging from the wrapper, the source of the bad Promise was tracked down and a PR was created in the memcached-plus client.

Summary

Hopefully sharing this story will save others some time and maybe teach somebody something new. My takeaways from this experience are:

Only throw actual Error instances. It’s almost guaranteed that other code will not be written to handle non-Error values.
Always try to confirm your theories before working on a fix, because you might be wrong.
Be very careful when writing code that constructs new Promises to make sure that either resolve or reject is always called.
When using a third party library, adding timeouts can lead to better experiences for your users while protecting your from bugs in the library.

Good luck debugging your own hard-to-understand problems — don’t give up!

Matt Lavin is the lead developer on the mobile backend services team at LifeOmic. We love solving difficult problems while building services to support our mobile health tracking LIFE apps. Feel free to contact me on LinkedIn or Twitter to learn more.