When apps go wrong

How to monitor your app for real-time incident response

Amadeusz Zolnowski
Engineering at Runa
9 min readJul 2, 2021

--

If you’re an engineer looking to join a mission-driven company with an amazing engineering culture — check out our careers page for openings!

When a web application gains a widespread user base it is expected to work 3600 seconds an hour, 24 hours a day, 365 days a year.

Software is complicated, and as you gain users and features, the chance of failure grows. While testing is critical, it’s even more important that you can react quickly to outages as and when they happen.

Did I just delete that table in prod?
Did I just delete that table in prod?

In this post we talk about: our product, what kind of failures you might see in a production app, what and how to measure for good performance and finally, how to get notified when something goes wrong so you can react quickly. If you want to build a product your customers can trust, read on…

WeGift Connect

For context, let’s understand what we build at WeGift. Our core product is WeGift Connect. It offers payout solutions through a web portal and JSON API. The two main product journeys are split across our customers (the business) and consumers (their customers):

  • the order journey — where a customer sends value to end-users,
  • value collection — where a consumer uses the value they’ve been sent.

WeGift Connect also deals with accounting, invoicing, reporting, etc and connects to a plethora of third parties to provide payout through things like subscriptions and cryptocurrencies.

Connect is deployed to an AWS Kubernetes cluster, taking advantage of AWS services like Aurora, Lambda Functions, S3, et al. It’s implemented in Python 3 / Flask / Celery / TypeScript & Vue.js.

What’s service availability?

It is not enough to check if an application is up and running. Everything may work perfectly for 99% of your users, but may not for the remaining 1%. At WeGift, that 1% matters. Moreover, a problem impacting just a single user may grow to impact everyone.

What can go wrong?

Everything. Even hypothetical flawless software with high availability and load balancing can be disrupted. Where’s the risk in your application? Let’s look at some possibilities.

Software bugs

Bugs that abort a user’s journey, e.g. an order which results in a 5xx error or otherwise impedes the product’s usage all need to be tracked. We classify bug severity to help us respond appropriately. Classification depends on the use case and usage. Bugs related to order and value redemption processing, for example, would be Sev-1 or Sev-0 depending on impacted users.

Networking issues

Internal connectivity problems between gateway and application, between application and supporting services like DB, or between application microservices are likely to impact your customers. For us, these are usually handled by our cloud provider with only a handful of issues affecting our end user in the last four years.

External networking provides a greater challenge. WeGift Connect integrates with tens of API-s around the world from which we source value in real-time. And to keep focus on the core value WeGift delivers, we integrate with other services like email providers. That means we send a lot of outgoing requests to 3rd party servers and networking errors may impact the order & redemption flows.

External service failures

In addition to any networking issues that could happen on route, third party services may simply fail. With tens of integrations, there’s a high chance of failure at any given time. There are providers that are more reliable, and there are providers that frequently face technical problems. Their failure becomes our failure. While we put measures in place to mitigate this, we need to monitor them at all times.

High load

While order volumes usually fit our forecasts, there are events that trigger higher than typical order rates. While we are prepared to handle larger volumes than we currently do, we still need to monitor the impact of a higher request rate on the rest of the system and also verifying that the traffic is legitimate.

Scaling problems

As we grow there are parts of the system that might not be ready for that growth. A SQL query that worked perfectly a year ago, may no longer return data in a timely manner. An algorithm analysing in-memory data may have plenty of room today, but by the end of the year may end up terminating a pod, or maybe it will start burning CPU and affecting other processes in a container.

Alerting

Knowing what can potentially go wrong is a good start. The next step is knowing when, and understanding how to fix it.

General availability

This is one of the easiest things to set up and monitor in our current system. We have a lightweight endpoint that handles a request every minute. If the endpoint doesn’t respond, then we could be down. It’s that simple.

To do this we use Pingdom, which we’ve set up to make a HTTPS request to the /health endpoint on our application.

As well as availability, it provides some insight into application response times. If the application doesn’t respond in 30 seconds, it notifies us via PagerDuty and Slack. Pingdom also performs checks from multiple territories, giving us greater confidence in its reliability.

Monitoring application exceptions

An application with low usage could intercept an exception and send the traceback by email. This doesn’t scale well, though. With multiple users hitting the same bug, we would be flooded with tons of emails in no time, impossible to follow. Central logging system can help here, by aggregating exceptions. But there is more to an exception than just the traceback. To effectively triage bugs and fix them, a developer needs more context. Although this can be achieved with central logging systems and incident management dashboards — we decided not to reinvent the wheel and adopted Sentry.

There’s a lot that Sentry captures out of the box, but Sentry’s SDK allows you to provide additional context like an order ID. This allows Sentry to aggregate instances of the same exception and keeps track of events and users who’ve been affected.

There’s one particular feature of Sentry we rely on: alerts. Sentry is able to detect a sudden spike of errors which could mean:

  • there’s a new critical bug (likely in recent deployment),
  • a bad actor is to exploit the application,
  • an external service is having an outage and we’re not handling it well.

We use Sentry to track both platform and front-end metrics, reducing the tooling we use as well as providing a more holistic view during an incident.

Container utilisation

When the system is used continuously, some long term and short term patterns will emerge across your metrics. One of the most generic metrics is resource usage across, in our case, the Kubernetes cluster. Resource usage data needs to be combined with an understanding of what the system is doing at that time.

Many of the problems laid down in “What can go wrong” could manifest themselves in resource usage. So this is a useful metric to track, but how do you alert on this when high CPU usage can be valid in certain situations?

First of all, we aggregate stats from all of our nodes. At WeGift we evaluated many solutions and we went with Datadog for the ease of integration and versatility. Datadog provides an agent which gathers metrics for each node and sends them in a structured fashion to the central Datadog instance. This is possible with logs but requires extra processing where the Datadog agents do this without configuration. Once the data has been collected, it’s very easy to create a dashboard from such metrics.

This can easily be achieved with other solutions, but as we’ve discussed, these graphs alone are not enough. We don’t want a human staring at them 24/7, we have better things to do, like writing this blog post!

Datadog has a Monitors feature that allows you to create an alert, from a metric, in various ways. For memory issues, a simple threshold-based monitor can be created, but for CPU usage or network I/O, hitting both low and high values can be valid. Instead, you want to be alerted when the pattern becomes unusual, which is when Datadog’s anomaly detection comes in. The tolerance of anomaly detection needs to be maintained over time as patterns change as you scale, but this feature greatly improves your setup by reducing false alarms and ensuring valid alarms aren’t filtered out.

Domain-specific alerts

So far we’ve only covered generic monitoring that could be applied to any web application. But often an issue is specific to the applications business logic. When one of our orders fails to process, it rarely manifests itself as an exception in Sentry. Often the order is failing because of a supply issue or a third party, e.g. a third party has gone down or there are issues with how quickly we can fund them. We need to know about these issues before a customer writes to us.

Having our logs in Datadog allows us to build monitors based on them. In many cases, these are very simple, like stock level checks, which is a simple log line matching. There are also issues that are identified when a particular log line hits a high failure rate. For these, we use events grouped by certain fields and set thresholds based on historical data. We also structure events and attach relevant context whenever possible to improve how we respond to an incident.

Positive metrics

Application observability takes time. It’s a bit like the Whac-A-Mole game: something fails, we didn’t know about it, we add monitoring, something else fails, etc…To help spot issues we’re not specifically looking for, we monitor our normal usage patterns. For example, we expect a certain number of orders or a certain amount of customer activity. If this drops significantly, we need to investigate. Here, again, Datadog’s anomaly detection comes in handy.

From alert to incident

Just having an alert is not enough. We don’t want a whole company to be woken up and run in circles. We established a process to handle alerts swiftly and at the same time cause minimal disruption to our daily activities.

Rather than having every engineer looking at the issue, we appoint an engineer from each domain team to be on stand by — these engineers form the Operations Team. On-call engineers rotate within each domain team on a weekly basis.

The on-call rotation is managed by PagerDuty. PagerDuty notifies only engineers that are on-call and, in addition, posts events to a dedicated Slack channel where we have related discussions. The Slack channel provides visibility to the whole company, so they can follow the incident and respond accordingly.

We also use Allma to help us huddle around any suspected high impact incidents. It’s quick to declare a new incident from the Slack channel and easy to keep up to date.

Depending on the impact, we post incidents to WeGift’s status page. Subscribed customers would get notifications about operational issues we’re having.

Even if the issue is resolved for the customer, there’s still more work to do, but that’s the subject of another blog post because this one is already too long…

Summary

Incidents will always happen. At WeGift we’ve built a strong process so we can handle them swiftly, learn from our mistakes, and improve the product and our processes. This allows us to get stronger, and continue to serve our customers as we scale.

Come work at WeGift!

We’re building an engineering team where people can do their best work. We want to hire talented engineers who see WeGift as an opportunity to grow and to make a huge impact! We have structures in place from hiring and levelling to how we make technical decisions that really allow people to thrive. If you’re interested to find out more about life at WeGift and how you can join us on our mission to revolutionise the $20 trillion B2C payout market, then take a look at our careers page or get in touch with the Talent Team talent@wegift.io.

--

--