Redesigning the network architecture of a popular mobile PvP shooter

Published in

MY.GAMES

13 min readJan 19, 2024

War Robots, a popular mobile PVP shooter, grew out of a Photon prototype that hadn’t been rewritten for 9 years. We couldn’t afford to change things too much to avoid impacting the player experience, so we needed a good plan. This is that story.

Hello, I’m Sergey Kamin, senior programmer with Pixonic studio at MY.GAMES. In this capacity, I primarily solve network-related problems between the client and the server. In particular, for the last few years, I’ve been improving the network stack of our flagship project: War Robots.

Since War Robots is a multiplayer shooter, naturally, everything related to network interaction plays a big role. It goes without saying that players should be able to play the game comfortably — and it falls on our shoulders to provide the adequate conditions to make that happen. To that end, in this article, I’ll share how we reworked an outdated architecture of network interactions at the game’s core, which we did after getting rid of Photon (a process which you can read about in this article). Let’s begin by discussing how things used to be.

The network stack before refactoring

For historical reasons, the War Robots project grew out of a Photon prototype and had not been fully rewritten for the last 9 years. Network interaction initially worked using a P2P model choosing one main client from among the clients who managed the gaming room; this architecture made it possible to quickly release the project and infrastructure costs were also minimal.

In the game, each player controls their own mech, simulates it (that is, controls position, physics, abilities, etc.) and then sends the result to other players via the main client. Over time, this model ceased to be adequate in terms of connection reliability and protection from cheaters. So, the project (without any special architectural alterations) was moved to interaction through the server, which took on the functions of the room leader as well as some mechanics, like dealing with damage and effect statuses. That said, most of the game simulation remains on the client, and the server knows nothing about the physics of the world.

Previously, object synchronization occurred via two Photon mechanisms: remote procedure calls (RPC) and the regular transfer of the object serialized state. The object itself was independent and consisted of many components, each of which included gameplay logic and network interaction.

As a result, it turns out that the state of the world is divided between a large number of entities and is synchronized through unrelated RPCs. Because of this, it became necessary to send and receive these events in a strict order, because if they were to arrive at random, the state might fall out of sync.

So, almost all synchronization took place through a reliable channel with a guarantee of delivery and consistency. If, at some point, a client were to lose connection with the server, the server had to send all the latest current events in order to restore the state of the world on the client.

The problems of the old system

This need to send all the messages through a reliable channel placed a heavy load on the network, and this led to unnecessary delays in the game. The delay was especially noticeable when receiving voluminous messages with a large amount of data. For example, a message about a mech spawn (which contains all the information about the mech, its abilities, balance, and so on.) This was all exacerbated because we were working on a mobile network, which is characterized by frequent packet loss and high delays due to real-world obstacles, user movements, and long distances.

Since each message stores some kind of atomic change, it’s very important to apply them in a strict order so the state doesn’t go out of sync with the players. Accordingly, we had to stop parsing network messages until a mech spawned (that is, for quite a long time); this is because the logic for processing these messages is located on the mech instance itself, which can only receive them after spawning is complete. This approach didn’t allow for processing all the world changes in parallel, since it required maintaining the order of parsing messages, and for the player at that moment, this looked like network lag.

Also, with such a system, each new network mechanic also requires us to think about how to restore it when the connection is broken; this is where errors usually occur because testing different rupture cases in complex mechanics can be difficult.

For example, abilities in War Robots are a state machine with several states. To ensure the reliability of connection restoration, you need to somehow break the connection in each of the states and at each transition between states; this is almost impossible to do manually, and would be very difficult to automate in our architecture. This approach is also generally hard to scale because it’s increasingly difficult to test and fix so many state combinations and transitions with an increasing amount of content. Nevertheless, we still wanted to simplify the support of such content, because the game is constantly growing.

So, here’s what we did to solve the problem.

Coming up with a rough plan

We identified the main goals of our refactoring:

Reducing costs for scaling content in the game
Freeing ourselves from the need to maintain packet order
Gaining the ability to easily restore the states of the world (for example, in the case of reconnection)
Providing the ability to apply changes to the game world when it is more convenient for us, not at the same moment we received the data

First, we decided that all parts of the world state needed to be collected into one entity so that the client always had a full understanding of what was currently happening in the game. This would solve the problem of restoring the world — we simply display everything that’s in this state. (Accordingly, in order for the game to update its state, we needed to regularly send updates to this entity by tick.)

Next, in order to reduce the size of each state and put less load on the user’s network, we opted to move heavy, immutable data into a separate entity — RuleBook — and we decided to send this data much less frequently, and only when necessary. For example, this data includes all the IDs and player nicknames, the configs for mechs that players can spawn, and stuff like that. We also decided to move all the string values there in order to store only numerical values and indices in the main state.

To solve the problem where the network queue was stopping during spawns, we came up with the idea of making the synchronization of the main state and big data independent of each other. In other words, the main state is updated in its cycle regardless of the expectation of large data; in theory, this would allow us to update already-spawned mechs without noticeable delays, and the new mech will spawn once the big data has finally arrived.

Once we have a complete world state, old network messages will no longer be relevant to us and we’ll be able to afford to send updates through an unreliable channel, which should reduce delays in the game.

On top of that, we also had the idea of adding bit packing and delta compression to further reduce the size of regularly sent data.

Ultimately, we wanted to have an independent storage of the complete state of the game world that would be optimally updated regardless of the mechanics in the game, and that could be easily expanded without explicitly changing the network synchronization logic.

Difficulties we had to mind

The advanced age of the project brought with it a certain number of difficulties.

First, even though the game was released a long time ago, a lot of people are constantly playing it. So, we couldn’t afford to change things too much because this might’ve impacted the player experience, thus leading to problems with product metrics.

For the same reason, we needed a plan to return to the old logic if critical errors were discovered with the new implementation — downtime is unacceptable and updates must be unnoticed by the players.

The game is large and the transition to a new stack should occur iteratively with the gradual rollout of new logic. To do this, the new protocol will have to provide the ability to work with different versions so that the server can communicate with old and new clients during the transition period of the update.

A big problem was the distributed simulation of the world between all clients and the server. Switching to a traditional shooter architecture with a full server authority with physical simulation won’t work out just like that, because the feel of the game will greatly change and will require rewriting the entire game at once.

Every change in network interaction mechanics poses huge risks that we could not afford to cover with regular testing, but something still had to be changed, because the network logic was deeply embedded in the game mechanics logic. We had to refactor these places, create new abstractions without changing external behavior if possible.

Moving on from plans and difficulties, let’s discuss implementation.

State entities

All network data was divided into three large entities: RuleBook, ServerState, and ClientState.

The RuleBook contains large data that rarely changes during a match; this includes player data, spawnable-robot parameters, AI parameters, and immutable match data (such as start and end times). All entities inside the RuleBook have their own unique index which can be referenced from the states below.

ServerState contains the complete general state of the world that needs to be displayed to the client; this includes the mechs, their durability, abilities, position and movement, gun condition and everything like that. All data is stored as numbers, Boolean values, enums, and references to entities in the rule book (in the form of indexes).

ClientState is similar to ServerState, the difference is that it contains data that the client itself is responsible for: the position and movement of the player’s mech, and the state of the player’s mech’s abilities. We must also take into account that fields are distributed between ClientState and ServerState that allow interaction in the request/response format. For example, in the client state there is a collection of requests, where ClientState adds them, and in the server state there is a collection of responses, from where the client reads the result. This mechanism is used, for example, to request the spawn of a mech with certain parameters.

Remote clients don’t receive the client state directly; instead, the server transfers data from the received ClientState to its ServerState, which is sent to all players.

RuleBook synchronization

The client regularly receives an index of the latest RuleBook with server state updates. If the index is different, this means that the client needs to download a more current version from the server. This works in a request/response format, transmitted through a reliable channel and serialized using MessagePack. If for some reason the update fails, the client will see a discrepancy in the indexes and request it again.

Synchronization of changeable state

Synchronization of ServerState and ClientState works similarly using the same mechanism directed in opposite ways. Synchronization occurs regularly on a tick, the frequency of which is derived from the old logic for updating mech position: 10 times per second. This frequency is sufficient to ensure visual smoothness using interpolation and it doesn’t overload the server or client in terms of traffic and calculations.

There are two kinds of state update events: full and delta. A delta is sent when the sender knows that the recipient has recently acknowledged a received update, and the delta is calculated based on its index. Otherwise, the sender sends the full state until it receives confirmation from the recipient; this is done by sending the last received index through the counter channel. That is, the server confirms the receipt of the client state in the server state and vice versa; there are no additional messages for confirmation.

Both the sender and recipient keep a limited state history in a circular buffer to calculate and apply the delta. The sender counts the delta from the desired state in the buffer to the actual state and sends it. After sending, it copies the actual state to the next cell in the buffer and switches the simulation to it. The receiver takes the desired state from its buffer, applies the delta to it and thus gets the actual state, which also remains in the buffer for further delta calculations.

Let’s look at an example: we’ll say there is a confirmed state number of 10, and the current state of the sender is number 14. The sender takes state number 10 and state number 14 from the buffer, then calculates the delta between them and sends it. The recipient knows that it has already received state number 10 and a delta before state number 14. The recipient takes 10 from the buffer, applies a delta to it and puts it in the buffer at index 14.

Serializer

To implement the protocol described above, we needed to develop our own serializer with support for delta compression. As a result, the serializer supports the following operations:

State cloning
Serialization/Deserialization of the full state
Serialization/Deserialization of state delta

This architecture is similar to MessagePack for C#; there is a set of serializer classes, each of which can work with one specific type and which calls other serializers recursively for all fields of the class. Each serializer writes its data to a bit stream and reads it in the same order. (Cutting off fields for certain conditions is also supported, like versioning, to name an example.) To calculate deltas, in the first iteration we decided it would be enough not to serialize the read-only and fields that had not changed relative to the previous state.

To automate serializers making, we wrote a generator that accepts a type with attribute-marked fields. They specify:

Interval of protocol versions in which the field is serialized
Bit size for numeric values
Size limit for collections
Variability of the field within the protocol (written only in the full state)
Predicates for cutting off data (for example, in order not to send information about the position of the mech to the client who is simulating it)
Ability to store null value
Serialization order (needed for version compatibility)

Our serializer also takes all class instances and memory for collections from the pool to minimize allocations on the client under the condition of frequent copying and receiving states. The pool is reset when a slot in the state buffer is reused.

Here is an example of serializer code for one complex type produced by a generator:

[SerializeType]
public class ClientState
{
 [SerializeField]
 public StateTime ServerTime;

 [SerializeField(CollectionSize = CollectionSize.Bits32)]
 public StateList<ClientMechState> Mechs;

 [SerializeField(CollectionSize = CollectionSize.Bits8, IndexSize = CollectionSize.Bits128)]
 public StateEvents<PlayerSpawnRequest> SpawnRequests;
  
 [SerializeField(CollectionSize = CollectionSize.Bits512, IndexSize = CollectionSize.Bits524K, FirstVersion = Versions.Version94)]
 public StateEvents<HitEvent> HitEvents;
  
 public StateEvents<OrbitalRequest> OrbitalRequests;
}


public class ClientStateSerializer
 : IDeltaSerializer<ClientState>
{
 private readonly StateListSerializer<ClientMechState> _sMechsFT = new StateListSerializer<ClientMechState>(CollectionSize.Bits32);
 private readonly StateEventsSerializer<PlayerSpawnRequest> _sSpawnRequestsFT = new StateEventsSerializer<PlayerSpawnRequest>(CollectionSize.Bits8, CollectionSize.Bits128);
 private readonly StateEventsSerializer<HitEvent> _sHitEventsFVersion94T = new StateEventsSerializer<HitEvent>(CollectionSize.Bits512, CollectionSize.Bits524K);

 public Type SerializedType => typeof(ClientState);

... Constructor, Clone

 public void SerializeFull(SerializerContext ctx, ref BitWriter writer, in ClientState toState)
 {
  ctx.Logger.BeginType(typeof(ClientState), false, writer.Position);
  try
  {
   if (toState == null)
   {
    throw new ArgumentNullException(nameof(toState), "Not nullable ClientState");
   }
   SerializeFullFields(ctx, ref writer, in toState);
  }
  finally
  {
   ctx.Logger.EndType(typeof(ClientState), false, writer.Position);
  }
 }

 private void SerializeFullFields(SerializerContext ctx, ref BitWriter writer, in ClientState toState)
 {
  ctx.SerializeFull(ref writer, in toState.ServerTime);
  _sMechsFT.SerializeFull(ctx, ref writer, in toState.Mechs);
  _sSpawnRequestsFT.SerializeFull(ctx, ref writer, in toState.SpawnRequests);
  if (ctx.TargetVersion >= (uint)Versions.Version94)
  {
   _sHitEventsFVersion94T.SerializeFull(ctx, ref writer, in toState.HitEvents);
  }
 }

... DeserializeFull, SerializeDelta, DeserializeDelta
}

Useful tools

At first, we wrote type serializers by hand, made a lot of errors, and came to the conclusion that we needed to automatically check the correctness of serialization/deseralization. That said, it would be very difficult to cover all our choice structure with regular unit tests. We decided to use the equivalence class testing approach where we set value intervals of interest for all types/fields, and the test iterates through all combinations. This way, you get more complete coverage of the logic without writing test cases manually.

We then faced the problem of error reproducing. To solve this, we implemented a simple solution: recording all outgoing and incoming updates on both client and server to dump file. If QA finds some kind of error related to state, they can easily send a complete, corresponding dump to the developers. This greatly simplifies debugging the protocol and mechanics because it makes it possible to easily and accurately repeat what happened in the game, correct the error, and immediately check how the corrected logic works on old data.

Also, in the context of working on pools and performance, we needed to create a benchmark that could be fed a pre-recorded dump of a real player battle and, under controlled and repeatable conditions similar to real ones, be used to optimize the serialization and deserialization process.

Results and conclusions

First, we rolled out the system in the new PvE mode, released in March 2023. This had less risk of getting us into trouble, since the match was played by one player against bots. We then added temporary events and gradually added other modes.
We caught one bug with the non-unique IDs of the mechs, switched them to indexes unique to the match, and everything started working without noticeable problems. After that fix, all matches on production began to work on the new protocol, and this continues to this day. We ocasionally receive complaints about freezes, but these are isolated cases, and we’re gradually collecting information and fixing everything.
The full state with an incomplete list of mechanics amounted to 2–2.5 KB. This was a lot, and we needed to shrink the size. We tested deltas and the average size of a server state update dropped to about 1 KB, which almost fit into the MTU. This is now a priority for us when going into production.
We stopped receiving complaints from players about ending up in “another world” (a noticeable desynchronization of state due to the loss of events). This allowed us to lay the foundation for new modes and features, for example, replays and observer mode.
Because of the new system, we have convenient debugging tools and can now easily and accurately reconstruct what is happening in a network match.
There are still difficulties at the intersection of the old RPC and the new state when, due to desynchronization, we observe some artifacts. But all this will go away when we transfer the mechanics completely to the new protocol.

In general, the transition process took quite a long time, but was carried out without critical problems. Next, we plan to transfer the remaining mechanics to a new track in order to completely abandon the legacy code. At the same time, we’re continuing to develop the network architecture itself, optimizing it in terms of calculation time, memory and traffic. We are also working on reliability and improving the analytics we collect to quickly correct errors.