The oVice incident report on Dec 1, 2021

Published in

oVice

8 min readDec 17, 2021

On December 1, 2021, from 7:38 to 12:41 UTC, there was an outage that made some functions of oVice unusable. I’d like to write about the reasons why it took 5 hours longer than initially expected to recover, and the remediations we took during that time, in chronological order. The Japanese version of this report is here.

A brief failure report was posted on the status page at https://status.ovice.io/incident/60976 immediately after the incident began.

What happened in oVice

The Echo server (see below), which is responsible for real-time processing functions, could not handle the high frequency of requests and stopped, making the following functions unavailable in the oVice space.

・Cannot enter and leave the meeting room
・Cannot update emoji on your avatar
・Cannot change the layout of a workspace

The overall architecture of the infrastructure is shown in the figure below. The API server, which handles functions other than those mentioned above, and the Core server, which handles meeting functions, were operating normally in oVice, so log in, calls, and other essential functions were available.

As described below, the API server was also temporarily overloaded during the recovery process, making it difficult to access oVice.

When an abnormality is detected

The first thing we noticed was that at 7:34 AM, we received an alert that the response time of the API server had become abnormally slow. The API server, which usually returns a response in about 150ms at p50, was 5–6sec at p50.

When a user logs in to oVice, about 20 requests are sent to the Echo server, and from there, the Echo server makes further inquiries to the API server. At that time, the number of requests from the Echo server to the API server was abnormally high, and the API server was under heavy load. We saw that the response time of the API server returned to near normal around 7:37, but it was actually just the Echo server process was down and not sending any requests to the API server, thus reducing the load.

When one Echo server goes down, the client requests to reconnect every 30 seconds (at the maximum rate), which increases the load on the other Echo servers. As a result, all Echo servers were overloaded, and the process went down on all Echo servers. At the same time, they were disconnected from LoadBalancer and could not accept requests. (see future prevention 1)

It was 7:44 when we received the notification that the Echo server failed from the external health check monitoring.

Remediation Timeline

From here on, since there is no record of timestamps in the remediation log, we will only describe the execution procedure.

We would like to restart the Echo server, but as mentioned above, we imagine that the server will be in a high load state again right after the server starts because the clients keep sending requests every 30 seconds while trying to reconnect. Therefore, we decided to increase the spec of the server before starting it up. We chose vertical scaling instead of horizontal scaling here because we are not in a situation where we can immediately create an equivalent environment through containerization or AMI. (see future prevention 2)

However, contrary to expectations, the process was killed again and the same phenomenon as before. We assume this is because the Node.js process was started without specifying the--max-old-space-size option, and the memory usage limit was set to the default 1456MB regardless of the server’s spec. But this is just a guess since the memory usage is not excessive, so there is no OOM error message in /var/log/message .

We then tried to start it with the --max-old-space-size option enabled, but the Echo server’s startup script uses Shebang

#! /usr/bin/env node

so it was not easy to pass the option. After a bit of research, We found some articles that said it was possible to pass options by rewriting the code as follows:

#!/bin/sh
":" //# ; exec /usr/bin/env node --<option> "$0" "$@"

and we were aware of this solution while remediating, but it seemed a bit too hacky, so we decided not to adopt it. In hindsight, we should have prioritized the recovery and used this way as soon as possible.

This means that we thought the Echo server could not scale both vertically and horizontally at that time.

The number of active users had decreased due to the incident and the time (7 PM in Japan, where many users are located). We decided to restart all Echo servers again, although we hadn’t made any specific improvements. It was about 10:10 at that time.

The Echo server was able to endure, but the number of queries to the API server skyrocketed, causing the API server’s p50 response time to be around 20 seconds. As a result, when the client requested to reconnect every 30 seconds, the Echo server could not return within 30 seconds and kept sending requests to reconnect forever.

The following graph shows the metrics of the API server from the start of the outage to the end. It shows the number of requests increasing suddenly at around 10:10. (In the graph, the time is described in JST, so 10:10 UTC is 19:10 in the chart)

We needed to resolve this situation as soon as possible, as the slow response time of the API server slows down almost all functions, including login to oVice and chat functions.

As a solution, we had two options

1. vertical scaling of the API server
2. release a patch to extend the retry interval for reconnecting from the client

Since the API server is currently running on PHP and has no issues preventing vertical scaling like the Node.js, we decided to take option 1.

We scaled up the API server sequentially from around 10:30, and the scaling up was completed for all servers at 11:00. During the process, all Echo servers died again at 10:41, so we restarted the Echo servers at 11:00. At the same time, we completed vertical scaling of the API servers.

As a result, as you can see from the metrics above, the number of requests to the API server spiked again, but the API server could handle it and seemed to get out of the 30-second reconnect loop from the client.

We found that the Echo server responded within 30 seconds when we accessed the service. But also, some users could not get responses from the Echo server. The graph above shows the relationship between the number of requests to the API server and the latency. You can see that between 11:30(20:30 JST) and 12:30, the number of requests to the API is not decreasing even though the API server is returning responses with low latency, and requests keep coming from the Echo server. (In other words, the Echo server could not return a response to the specific group of clients within 30 seconds, and the reconnection loop continued.)

You can also see this situation from the CPU usage of the Echo server. Since we have enabled IP address-based Sticky Session in LoadBalancer, each client requests a specific Echo server. The metrics below show the CPU usage of the Echo server, and you can see that it keeps a constant CPU usage for a long time. Also, the number of network connections per server is quite uneven.

We couldn’t figure out the cause of the problem for a specific group of clients and decided to restart the Echo server as the situation didn’t improve over time. (Around 12:35)

As a result, it returned normal responses to all users, and at 12:41, we determined that the situation had recovered to normal.

Future Prevention

Although it started with just the failure of the Echo server due to a sudden increase in the number of logins in a short period, we were able to find various issues and improvements from this incident.

1. Fixed the process to be restarted automatically

Echo server was not configured to be automatically restarted by system management daemons such as systemd. We will change the setting so that the process will be restarted automatically even if it dies.

At the same time, modify the startup script to pass the --max-old-space-size option so that the Echo server is ready for vertical scaling.

2. Change to a horizontally scalable architecture

In the current server configuration, we change the code in the VM directly during deployment. Therefore horizontal scaling is not easily possible. We are in the process of containerizing it and running the service on Kubernetes.

3. Improving the logic for querying the API server

We found that we could reduce the number of queries from the Echo server to the API server to about 1/20 by improving the logic of the requests when the client reconnects. This modification will reduce the impact on the API server even when the Echo server is overloaded.

4. Extend the reconnection interval

The problem that caused this failure is well known as the Thundering herd problem. One of the solutions is to add randomness to the retry interval and to extend the retry interval exponentially. Socket.IO, which we use, implements this by default, but the maximum interval is as short as 30 seconds, which made this recovery more difficult. We will extend it to an even longer time.

Conclusion

The architecture of the infrastructure has a lot of room to grow, and we were afraid to disclose it to the public. Still, we decided to write this incident report to ensure transparency and let users know the current situation.

In a physical office, you don’t suddenly lose the ability to talk. The same should be true for virtual offices. We would like to enhance the reliability of our service so that people can use oVice as sure as night follows day.

Once again, we would like to apologize to all the users affected by this outage.