Shifting a highly loaded game project from Photon to custom solutions

War Robots Universe
MY.GAMES
Published in
8 min readApr 20, 2023

The Photon engine provides a ton of solutions for creating multiplayer games, allowing us to spend less time developing routine features like matchmaking and balancing, and more time focusing on gameplay itself.

But, as is often the case in the world of product development, universal solutions still require some tweaking to get things just right. At this point, our game, War Robots, has been around for almost nine years, now with over 200 million installations, and the server infrastructure has changed several times as the project has scaled up. In our case, “polishing” involved our own implementation of certain components: matchmaking and social features were migrated to separate services, and new game mechanics were implemented on the server for the sake of consistency. As a result, the only things left from Photon were the transport layer, a PUN layer on the client side, alongside some associated costs in the form of a license, dependency on Windows and the .NET framework, and excessive memory allocations on the client.

It became clear that, for us, the cost of the Photon framework exceeded its value, and something needed to be changed.

My name is Andrey Makhorin, I’m a Server Developer at Pixonic, MY.GAMES, and today I’ll tell you how we solved this problem.

Taking a square peg out of a round hole

Let’s lay out what we had in terms of the project server structure: a Master Server, which handled the balancing, and a Game Server, which, apart from processing battles, also allowed us to make API requests to microservices through a special Photon “hangar” room. This room has since been moved to a separate service, but still looked like something out of place.

Our workflow looked something like this:

Pay attention to the two points of balancing: the Master Server distributes clients to API servers, and the Matchmaking Server chooses the Game Server. (This diagram deliberately omits the process of connecting to special rooms in the Master and API servers, otherwise it would’ve been even larger.)

Let’s digress a little here: no matter how well the business logic is isolated, transitioning a project like War Robots to a new transport layer is a lengthy process. In addition to refactoring itself, this requires:

  • Carrying out testing
  • Changing the deployment process
  • Preparing test environments
  • Buying and readjusting new servers
  • Updating documentation
  • Adjusting dashboards in the monitoring system, etc.

This amount of work would not conceivably fit within any sprint, and, frankly speaking, the value of such refactoring for business itself isn’t obvious at all. But if you break the transition into stages, so that at the end of each sprint there is a tangible result, it’s easier to promote this idea. That’s because, between stages, it’s possible to deal with monetized features, and refactoring won’t stay for a long time in a separate VCS branch, which needs to be kept up to date.

Starting off the process

Taking into account all the above, our first step was getting rid of Photon in the API Server.

It was moved to TCP and protobuf. Previously, like on the game servers, there was RUDP and Photon serialization. This made it so that we did not need to rewrite the Master Server, and instead, we could simply delete it and switch to balancing using HAProxy.

Of course, you can’t just replace the client transport and give it to the players because errors can appear in completely unexpected places, and the release of a new application version in the stores can take hours or even days. Therefore, it was necessary to keep the ability to quickly switch to the old, proven transport.

For that reason, abstractions were added on the server, and it was divided into two parts: the one working with Photon, and the new one. Thus, after the complete transition, we only needed to remove the unnecessary project from the solution, with no need to spot clean the code. Ultimately, the code responsible for interacting with Photon changed neither on the server nor on the client, so we weren’t afraid that something would suddenly go wrong.

During the transition period, when the Master Server requested a list, the Profile Server gave either the addresses of Photon masters or HAProxy, depending on the settings. And after the complete transition, the scheme looked like this:

Routine

The next step was getting rid of Photon in the game. The actions were the same as with the API Server:

  • We created a benchmark to compare the two libraries implementing RUDP
  • We chose the most productive library
  • We carried out refactoring
  • We divided the server into two services working with different transports
  • We added a switch
  • We passed things over to the QA department

Finally, after several iterations of testing and polishing, we were ready to update the production servers.

The War Robots project routinely conducts “external testing” — this is when players download a separate version of the game and try out the content that is still in development, and we’re able to collect feedback and metrics without worrying that something may break in production. Similarly, before release, we decided to test the new transport performance.

Trouble on the horizon

We were in for a surprise during external testing. Indeed, everything was going quite badly: robots teleported, no damage was being done, players were thrown out of the room. It was clear that the problem was the load, because during our internal tests, we had no similar experiences. In fact, the benchmark indicated that everything was fine — so what was the issue?

The root of the problem was within the benchmark itself. The fact is that, in addition to changing the transport, we planned to remake the protocol. Without going into the details, let’s just say that there should have been fewer messages, and they also should have been larger, without exceeding the MTU (Maximum Transmission Unit). The transition to the new protocol should have taken place in the next iteration, but the benchmark was designed to take into account the new protocol, not the current one. The current one had many small messages, and the selected library didn’t support merging these into one UDP (User Datagram Protocol) packet, with the goal of sending data as quickly as possible. Calls were being sent too frequently, this affected throughput capability, and this led to problems.

Fortunately, the library interfaces were very similar, and shifting to the one more suitable for our protocol took less than an hour. After this, we conducted another testing session and saw that everything worked as it should.

The only thing left was to pack everything into images for Docker. There was another catch waiting for us: the services used ServerGC, but there was a bug in .NET that completely prevented the containers from activating the garbage collection, which led to a reboot. We first encountered this when we wrapped one of the helper services in an image and ran it in K8s. Of course, we were not the first to encounter this problem: this article contains some details on the topic.

So, we just switched to WorkstationGC for our helper service: it runs on a separate machine, consumes less resources, and doesn’t affect the players in any way. But everything is more complicated with the Game server: there, extra collector activations could affect the user experience.

We were lucky: by the time we were ready to deploy new services to production, a stable version of .NET 6 had been released, with this bug fixed. So, we just reworked the base images, did a few playtests, and waited for release.

Release and relief

With regards to release itself, in a nutshell, everything went smoothly. At first, we turned on the new transport during the day but transferred the game to the old configuration at night in order to react quickly in case of any emergencies. After just a couple of weeks, we were able to analyze the results and compare the updated charts with the previous ones. Let’s take a look.

The number of failed connections to API server decreased:

Before:

After:

And, the main indicator we use to track server impact on UX quality has also improved. But let me explain here, the transport doesn’t directly affect it, because it shows the time it takes for the command containing the position of the robot to pass through the server. Nevertheless, this was a nice bonus for us:

.999 quantile before:

.999 quantile after:

.9999 quantile before:

.9999 quantile after:

CCU before:

CCU after:

I must admit that the comparison isn’t totally fair, because the machine configuration has changed: the frequency has increased from 3.6 GHz to 4 GHz, although the number of features (meaning the number of RPCs) has also increased:

Number of RPCs before:

Number of RPCs after:

Also, the time for connecting to a battle has noticeably decreased:

Wrapping it all up nicely

All these performance improvements weren’t actually our goal, so it would be strange to expect drastic changes — anything achieved was just the icing on the cake. Instead, our main goals were moving away from Windows, gaining the ability to use new versions of .NET, and getting rid of licenses.

Photon as a network library is an excellent solution even for highly loaded games, and if you use all the features provided by the developers, then you could even consider it to be irreplaceable. But such complex frameworks always entail overheads in one form or another, and at a certain stage of project development, you can win by abandoning them in favor of your own, or simply more custom-tailored solutions.

--

--

War Robots Universe
MY.GAMES

Behind the scenes of gamedev. Creators of War Robots franchise from Pixonic team at MY.GAMES share their secrets and experience.