A Post Mortem on August 2022 incidents

Darwin D Wu
oVice
Published in
11 min readSep 17, 2022

First and foremost, I would like to apologize for the inconvenience with the recent incidents. We would like to share more details with our customers and users regarding the outages that occurred in late July and throughout August and what we’re doing to prevent these from happening again.

July 29 incident

At 11:49 AM JST, we received an alert stating that our API pods were becoming unavailable and the service went down for roughly 5 minutes.

Pod unavailability monitor

Pods were getting restarted very frequently until eventually all of them started going down, causing a full API outage. We were initially a little confused because the time of the incident is when the number of requests typically wind down rather than peak.

Typical traffic patterns for a week day — Red is when the incident occurred during the day

When looking at the API latency, we also observed a severely slow response for certain endpoints.

How a specific API latency is going up over time

The endpoint shown in the screenshot above is one we added for our calendar integration feature that we released just a few weeks prior. The API response time was slowly increasing after the release so there’s definitely a correlation between the feature’s release and the incident itself.

Investigation

The endpoint in question was actually doing two things.

  1. Accept the code from the OAuth provider and exchange it for an access_token
  2. Attempt to create resources required for the calendar integration to work (typically some sort of subscription so that we can be notified on changes in user’s status (Teams) or calendar events (Google Calendar)

While the former is generally a pretty quick operation, the latter can occasionally be an extremely slow one. Microsoft Teams Graph API can take up to ten seconds to respond, causing a time-out when attempting to create the necessary subscription. Google Calendar is typically not that bad but could still take a couple of seconds to complete the requested operation. On top of that, the subscription process could randomly fail and require the user to re-attempt the request, meaning the response from our API to our front-end could take an extremely long time.

Mitigation

We rolled back one of our integration releases to reduce the load on our API, bumped up the API resources and moved the subscription logic into background jobs. This way, it doesn’t block the API, and the endpoint is now only responsible for the token exchange process and firing up the background job.

We thought this one was a fairly easy incident to resolve, but it turns out the issue we addressed was not the root cause. The cause described above actually just made the real issue worse to a point that it finally caused an outage.

After further research, the actual cause was a default setting in our email client Swoosh that is causing troubles when ran in production. You can see the adventures we went through to reproduce it in this blog post.

This issue is now resolved at the time of writing and we are not expecting to see similar issues again. Furthermore, when developing new endpoints or adjusting old ones (which we’re still doing both as part of our ongoing effort to migrate away from PHP to Elixir, as well as whenever we identify endpoints that could or should be improved), we have now set rather strict limits on ourselves in terms of how fast an endpoint has to typically respond (Apdex 100ms), and if a long-lasting operation is required for the endpoint’s operation, it has to be developed as a background job with our job scheduler.

August 3 incident

At 8:51 AM JST, Datadog alerted us that our API was down. Five minutes later we started getting customer reports that people couldn’t access any workspaces. So at this point, it was clear that we were dealing with a full site outage.

Pod unavailability alert

Attempted fixes

Learning from previous incidents, our initial attempt was to rollback the most recent release of the API. There was a recent change that fixed a performance issue, and even though we didn’t think that it could be related, we proceeded with the rollback regardless because our first goal is always to return to a known good state.

Unfortunately, rolling back the API didn’t resolve the issue and also made it worse because the performance fix was now gone. During this time, we continued to observe high CPU load, memory usage (up to 3 GiB) and latency issues with our API.

The latency issue was actually what caused the API pods to keep getting restarted. Due to the health checks we set in Kubernetes, if Kubernetes doesn’t get a response within the predetermined time frame (in our case, 10s), it will consider the pod unhealthy and will force a restart of the pod in an attempt to fix the situation.

This restart caused a vicious cycle for the APIs to start up because,

  1. Pod latency increases
  2. Kubernetes thinks that the pod is dead because it doesn’t respond in time even though it’s still alive
  3. Pod gets restarted but gets hammered with requests again and repeats from #1

Sentry

During all the scrambling, we noticed one interesting notification from Sentry.

Errors depleted for the month? After only two weeks into the Sentry subscription month?

Something looked extremely weird, and after digging through logs of crashed pods, we did see the following. And we saw it often, very very often.

00:04:31.923 [info] JOINED user_call:3694632 in 10msParameters: %{}00:04:31.923 request_id=a48af8b8029476faa05538f691f299d4 trace_id=708587351750908735 span_id=874386591673519646 [info] Sent 200 in 3086ms00:04:31.924 [error] an exception was raised:** (FunctionClauseError) no function clause matching in OviceWeb.UserChannel.join/3(ovice 0.1.0) lib/ovice_web/channels/user_channel.ex:10: OviceWeb.UserChannel.join(“user:xxxxxx”, %{}, %Phoenix.Socket{assigns: %{user_id: xxxxxx}, channel: OviceWeb.UserChannel, channel_pid: #PID<0.1117.1>, endpoint: OviceWeb.Endpoint, handler: OviceWeb.UserSocket, id: “user_socket:xxxxxxx”, join_ref: “xxxx”, joined: false, private: %{log_handle_in: :debug, log_join: :info}, pubsub_server: Ovice.PubSub, ref: nil, serializer: Phoenix.Socket.V2.JSONSerializer, topic: “user:xxxxxx”, transport: :websocket, transport_pid: #PID<0.4437.0>})(phoenix 1.6.6) lib/phoenix/channel/server.ex:376: Phoenix.Channel.Server.channel_join/4(phoenix 1.6.6) lib/phoenix/channel/server.ex:298: Phoenix.Channel.Server.handle_info/2(stdlib 3.17) gen_server.erl:695: :gen_server.try_dispatch/4(stdlib 3.17) gen_server.erl:771: :gen_server.handle_msg/6(stdlib 3.17) proc_lib.erl:226: :proc_lib.init_p_do_apply/300:04:31.926 [info] JOINED user_call:3583827 in 8msParameters: %{}00:04:31.930 [info] JOINED user_call:3563518 in 8msParameters: %{}

Sure enough, we were generating tons of similar errors in a very short amount of time.

This helped us pinpoint the issue to a couple of things.

  1. A recent front-end code refactoring accidentally removed a check that was supposed to act as a guardrail to not hit our API if the conditions were not met (or to be more exact, if a parameter is missing due to how the space is configured, this API should not be called as it doesn’t make sense without the parameter in place)
  2. API accepting the request relied on Elixir’s function pattern matching and didn’t have a fallback implementation to handle requests without the required parameters

Once the former was identified, we quickly released a front-end fix to add back the missing check. At this point, the site was able to restore since the UI started limiting the requests it should be making — to give you a more concrete idea, well over 99% of spaces we maintain are currently configured in such a way that they should not be calling the API, yet without the check in place, they all were calling the API unexpectedly.

The latter was quickly fixed as a follow-up in the backend as well to make sure we don’t just error out when an invalid request gets sent.

Learnings

We reconfirmed how well Elixir can scale due to this incident. At the end of the day, it still went down but it was able to hold itself together even when generating 5M+ errors in the span of 30 minutes. Even though it still crashed, this just gives us even more confidence in the language choice we made.

August 8 incident

1:07 PM JST, an alert came in telling us the API was down again. But this time with a slightly different twist. The database (MySQL) CPU is at 100% usage.

Database CPU usage climbing and eventually alerted after going over 80%

This can only mean a couple of things.

  1. Someone (or we ourselves) is DDoSing us
  2. Some very inefficient query is running and is locking up the whole database (e.g. large DB migration)

Cause

Resolving a DB issue is usually pretty straightforward. You just need to login to the database and check what existing queries are causing the problem and then kill them, right?

Not so easy this time, especially when the database is using 100% of the instance’s CPU time. Getting into the database itself took a couple of attempts because the CPU usage was so high, it didn’t have any more resources available to grant external access.

Once the database let us in, it’s a matter of checking its running processes.

| 98293642 | ovice_admin | 10.xx.x.xxx:60756 | ovice | Execute | 439 | removing tmp table | <QUERY> || 98293835 | ovice_admin | 10.xx.xx.xxx:51358 | ovice | Execute | 438 | Sending data | <QUERY> |

Removing a temporary table…? That means that there was a scan going on somewhere that was so large it couldn’t fit into memory and had to be written to the disk first.

Another query (omitted from the listing above) seems to be called thousands of times — each of them are just queued up on the database side, waiting to be processed.

Upon further inspection, the query was related to fetching user’s personal history of most recent away memos. I’ll explain how that’s possible in just a few paragraphs.

Remediation

As an urgent and temporary measure, we immediately disabled the API endpoint that was issuing the inefficient query by returning an empty response instead, and we started seeing the CPU usage of the database instance drop significantly.

The table that caused this incident was already somewhat of a known issue internally. Since it was pretty quiet for most of the time and wasn’t causing any trouble, we decided to leave it alone until we had some bandwidth to address it in the future.

This is a typical example of technical debt getting out of hand if left around for too long. We’ve identified the following steps that we either have taken already or are in the process of implementing:

  1. Identify that the query that caused the problem itself was necessary, but the underlying database schema could be altered in such a way that the query can be written in a much more efficient matter. As we’re slowly migrating not only from PHP to Elixir, but also from MySQL to Postgres, and because the access to the service was restored, it was decided that this endpoint will get a much better database schema to operate on, with one efficient table instead of two inefficient, and the new table will reside in our shiny new PostgreSQL instance. In the coming days, we have tested the new implementation thoroughly and finally last week, re-enabled the feature altogether, this time backed by schema that will hopefully work for a long time.
  2. Unfortunately, the table that caused this incident is not the only known problem in our legacy database — there are two more tables with whose only purpose is to accumulate historical data, so we have started the process of identifying which data are truly important and must be kept, as well as the process of moving this data from MySQL to the appropriate data store.

August 23 incident

9 AM JST alert for workspace not being accessible was once again being triggered. Immediately we observed our API pods running out of memory.

There were a couple of endpoints that we were working on recently so we started rolling some changes back. The issue seemed to disappear momentarily but resurfaced after just a little while.

Cause

While the rollbacks we did seem to have resolved the issue at the time, there was still some doubt as to whether it was actually the root cause due to the amount of resources the endpoint generally uses.

After digging around, we realized an API endpoint that was supposed to be only used internally was being called pretty frequently during the incident.

What the problematic API endpoint does is generally querying all the users associated with a certain workspace. This is initially created for our own needs so while performance of the API is not good, it’s considered acceptable since the intention was never to be used by UI clients or by customers.

Unfortunately, it turns out that wasn’t the case. With some recent UI improvements, we were looking to see if there were existing endpoints that we could reuse in order to expedite the updates.

This internal API was found and was deemed as something that will meet our needs for the time being. And while we were aware of the performance issues, we didn’t expect it to be bad enough to cause a site outage.

So what happened? Turns out we had a workspace that has over 400,000 users associated with it. And the problematic API’s job was to retrieve all associated users in a space so it ended up querying all 400,000 users, attempting to store them in memory and of course, utterly failing to do so, leading to the Erlang VM crashing due to not being able to allocate the necessary memory, causing the incident.

Mitigation

We immediately limited the maximum number of users retrievable for any space being accessed. The fix itself was simple enough but I have to admit we didn’t expect to see 400,000 users associated with a single space. In the next round of fixes, we’ll implement proper pagination so that all records are still accessible as before, but without affecting the performance.

Learnings

Not bugs

While a couple of the causes were careless mistakes, these were all not bugs. When tested individually in Staging environment and our Release Candidate environments, things worked as expected.

Only happens at scale

Our current environment is set up to be able to test and verify features and bugs but it’s not built in a way to help verify scaling issues.

In order to address this, we’re looking into setting up a Canary environment where we’ll be doing load testing before making the releases to production to our paid customers via a rolling release to a subset of users.

Furthermore, we’re looking into ways to lower the impact area should similar issues happen in the future. One misbehaving endpoint, or one endpoint that suddenly becomes painfully slow due to the environment or data it relies on, should not bring down the entire system.

Technical debt

A couple of the causes were somewhat known issues that had been deprioritized due to higher priorities. They all happened to come and bite us this month, which was unfortunate timing.

Moving forward, any known issues will be addressed either before release or as an immediate follow-up. If it can’t be immediately resolved, we’ll at least mitigate the impact to make sure it doesn’t blow things up.

Summary

Thanks to you, our customers, for using our service. These problems only start occurring with high usage so this is a good problem to have in our eyes.

However, we understand how critical our service is for our customers and we will continue to improve on the process and automation of delivering changes. Our goal is to minimize disruptions and outages for our customers regardless of the origin of the issue.

--

--

Darwin D Wu
oVice
Writer for

Engineering reliability and scalability / VP of Engineering @oVice / SWE & SRE