Incident Analysis: 24/11/2017 Game Server Crash

Published in

Simplay | Blog

3 min readNov 26, 2017

On Friday, November 24th 2017, we experienced a severe system issue and lost one of our game servers.

Simplay’s cloud infrastructure is characterised by superb availability and uptime, so our dev team takes such matters to heart. We worked throughout the weekend to resolve this issue as quickly as possible, and we’re happy to update that the system is fully operational again.

The analysis of the crash and the resulting attempts to fix the system led to some interesting insights, insights we’ll share in this brief post.

Cloud Gaming 101

Simplay’s infrastructure is comprised of game servers, storage servers and additional infrastructure components (Gateways, load balancers and others). We use several game servers (as opposed to stuffing GPUs in a single server) to achieve redundancy — a practice that proved useful in this recent crash.

When one of our game servers crashes, our GPU slots allocation is significantly hampered. That being said, the system SHOULD have been able to quickly recover and use the still operational game servers in the cluster normally. Unfortunately, it did not.

Incident Analysis — Ghost Server

Our crashed game server could obviously no longer serve users. It SHOULD have been dropped from the cluster, but instead it remained in, as a “ghost server” of sorts. Thinking these slots were available to serve users, our virtualization agent allocated the ghost slots to users who requested service. The allocations of non-existing GPU slots resulted in connection errors, as there was no Cloud PC to connect to. And indeed, some of our users experienced connection errors (error 4 and error 6), or endless connection attempts (“Warming up your virtual PC”) throughout the weekend.

In addition to that, restoring the crashed game server and adding it back to the cluster turned up to be surprisingly complicated for the team. Some design choices made along the way in pursuit of development speed proved to be costly in the aspect of recovery time. These types of crashes and errors are to be expected during beta, and we used this opportunity to learn and re-examine the system inside and out.

Moving Onwa

Moving on from this incident, we want to make sure we make the most of it. We learned a great deal, and formed several tasks that would improve our system availability. Among others:

Add a standalone server-health monitor as a redundant method to remove corrupt game servers from clusters.
Revamp game server installation/recovery process to allow faster (and simplified) set up and recovery.
Revisit previously made design decisions on GPU drivers installation in new game servers.
Enforce a limiter of new trial signups during cases of reduced availability, to ensure paying users would retain great quality of service.

We want to thank our amazing community for displaying patience and comradery over the weekend. We view these type of events as opportunities to learn and improve, and we’ll use these insights to deliver a better Simplay experience.

Incident Analysis: 24/11/2017 Game Server Crash

Cloud Gaming 101

Incident Analysis — Ghost Server

Moving Onwa

Written by Simplay