I mentioned in a previous article that since being indoctrinated into the Serverless world, I have not had such visibility on everything going wrong and that it was a scary prospect. The truth is, not being able to have that level of visibility now that I know I can have it, would scare the bejesus out of me.
So, how do you debug and link a multitude of events and their failures across a disparate wealth of functions and link these back to user behaviours? We use three tools to great effect to get the job done, these are,
- Sentry provides us with error reporting, stack traces and quantification/qualification of error volumes.
- IOPipe provides us with insight into invocations, metrics on function activities and dependency timing.
- Cloudwatch provides us with a raw source of truth on what is happening within our lambdas, this is pretty much a given with Serverless.
We wrap all of our functions using our Lambda wrapper, this provides standard tooling across all of our services and no extra code.
We generally get alerted to all errors via slack, these come in several forms. The first being an IOPipe error, as seen below,
The immediate first action for us from seeing an error like this is to click on the alert, which will then take us to IOPipe, we will then look at the number of invocations/type of alert and try to get some understanding of the magnitude of the error.
We tag all of our known errors, which allows us to understand the type of error much quicker. From all of our functions, we will generally try to catch all known and unknown errors and provide decent responses from our API’s to our frontends and other services. In the below case, the request properties provided by the user were invalid, this should have been caught by our frontend validation, there is definitely some dodginess happening here.
At this point, we would then look at the decoded data from the user agent and try to replicate the error on the browser and device that we saw the error happen. We use the user-agent NPM module to translate the user agent into device data as part of our Lambda wrapper, which is then reported as custom metrics to IOPipe and into Cloudwatch.
We can also dig into the functions that the user has hit across all of our services based on their IP address during a time range, allowing us to quickly replicate the user behaviour that led up to the error event.
Another way that we understand functions using IOPipe is by adding custom timing code to all of our external dependencies. This allows us to track long running or failing third-party components quickly and efficiently on a per function invocation basis.
We will also hop from IOPipe to Cloudwatch to get a raw log of what has happened within the function, to try and gain a better understanding of the event.
The other main way that we receive errors in slack is via Sentry, we get sentry errors from both our frontend applications and serverless backends.
Our first action is to look at the error in Sentry and try to get a handle on the number of occurrences and the nature of the error. On the night of TV we get a large number of errors flooding through at once, this is, therefore, a very necessary tool to allow us to focus our efforts quickly and efficiently, separating crappy browser errors from things we really need to be worried about.
We also run Sentry (and IOPipe) across all our environments (staging, pull request, production and sandbox) for every service and so catch quite a few errors before they even hit production from automated tests and developer pushes to pull requests across devices using Browserstack automate.
Depending on the severity of the error, we will then create a ticket in Github from the error report or assign a user to the report (if the error is severe) and get all hands on deck to look into it immediately.
The other route we will take from here is to take the IP address from Sentry and then search IOPipe to try to understand the user journey further, by looking at the Lambda invocation flow.
On top of Sentry and IOPipe, we have quite a few CloudWatch alarms, a load of Grafana dashboards and InfluxDB monitoring of how deltas are being processed through our systems. For our donation system, we very actively load balance between providers on the night of TV, it’s super important for us to spot emerging issues with payment service providers and to see dropping basket conversions at speed, allowing us to alter provider weightings on the fly to ensure the best monetary return.
That's really a summation of how we tackle bugs and errors at Comic Relief, the workflow is ultra simple and ultra powerful. What lambda and the third-party tools that we implement provide us with is a more obvious path to the user behaviour that resulted in an error event. Being able to spot these quickly means that we can feel confident to understand the context in which the error happened, fix bugs at speed and then move on to the next pressing issue.
Also, be sure to watch this presentation by our Engineering Lead Peter Vanhee talking through our current architecture at Serverless Computing London. Also, check out our technology blog for more stories of how we do what we do.