If you’re unaware, we had a bit of an issue at probably one of the most popular times to hang out in VRChat and traditionally where we set concurrent user records — New Years Eve, around the time that EST hits midnight. VRChat went down. Great timing, yeah? Let’s talk about what happened.
First off, before the issues started, we were chugging along with no problem. In fact, we shattered our previous concurrent player record with over a mind-boggling 40,000 players online at the same time. Our DevOps team had made sure that our servers were buffed up and ready for a huge boost in players over the holidays. Both our API servers and our real-time networking servers were reporting green across the board. Everything was looking good!
About twenty minutes to midnight (EST), people suddenly were unable to get any file assets from our servers (no avatars, worlds, icons, nothin’). Our API stopped talking to people. This meant that your menu wouldn’t work, if you left the world you were in you were stuck in limbo, and the client thought that your internet was dead because it couldn’t see the configuration file it needs to get every time it starts up. Several team members noticed something was wrong in-app very quickly. Players who had been partying seconds before started alt-tabbing over to Discord. Our chat mods began sweating, excessively and immediately.
Our first thought is that our services were having problems keeping up with the numbers we had. However, every single metric, statistic, alarm, siren, alert, and klaxon we had set on our own services were saying that things were fine, and had been fine, and our servers were wondering where all those cool people had gone. That wasn’t it.
So, VRChat uses a bunch of services to ensure that our servers are safe and protected from things like DDoS attacks and various other security concerns.
One of these services had, unknown to us, set a hard limit on the number of requests that we could receive per second. From what we have found out, there was no way we could have known what that limit was, and no way we could have set it higher without their help.
When we passed that number (around 10 minutes before problems started being widely reported), our security partner assumed that we were being hit by a denial of service attack, and started taking automated measures. Of course, we weren’t being attacked — we were just quite popular.
As users attempted to log in, get their social lists, join worlds, switch avatars, and do all the things that VRChat does, the automated system began to decide that because all these requests were happening over that hard limit, they were now all bad requests. Legitimate users turned into attackers in the eyes of our security partner’s automated systems.
The automated system’s response was to immediately shut down all traffic to our systems. Obviously, this is not the correct response to things being totally fine, but very busy.
We were able to implement a short-term solution to get services back online within an hour and a half.
On any other day, 90 minutes of downtime wouldn’t be too awful, but 90 minutes on this particular day meant that a ton of people hanging out with their friends waiting on the EST and CST New Years missed out on their turn to see the ball drop. Considering the state of things, missing out on your New Years countdown with your friends made a good number of people understandably frustrated.
Going forward, we have established a new, much higher set of limits with our security partner. We’ll be aware if we begin to approach that limit again, and can warn them in advance. There are some things the VRChat client can do to help improve this behavior as well, and those changes will be going out with the next few releases.
Finally, we are looking into ways we can improve our security detection and response systems. Our traffic patterns can look quite unusual next to other applications, so we’ll work hard to ensure our relationship with our security partner takes that into account.
Needless to say, we want to apologize profusely for this. We know how important the New Years celebration is to everyone. Out of any time of the year to have an issue like this, this is quite literally the worst possible time. Thankfully, we were able to move quickly (yoinking some team members out of VRChat parties in the process) and get it handled.
We saw a lot of people log back on after the problem was solved — just a few thousand short of the peak before we got knocked offline. This means that y’all are extremely patient. Thank you for that. Trust us, we know stuff like this is frustrating.
We’re already working with this particular service partner and all of our other service providers across the board to ensure this class of issue does not occur again.
Thank you for your patience, and we are working hard to ensure this particular issue never happens again.