Pocket Gems Infrastructure: Building a Highly Concurrent Transactional Virtual World
At the end of 2017, War Dragons released Atlas Expansion. It’s a massive virtual world created for highly engaged players where they battle against each other in a persistent world. You can raid your enemy’s castle and take what was theirs as yours. The expansion is highly social, and you need to collaborate with 49 other teammates to decide who will be the main attacker, and who will be the defender.
To model the virtual world, we store each castle as a single model, and each player has a “physical” state in the castle. Any player in a team can attack any other player from another team as long as they are physically nearby. These attacks can occur sporadically, but during a war, it occurs in a high volume.
In summary, the problem is simplified into the following statements:
- We have a persistent world
- Imagine each castle as a single row/entity in the database
- In theory, every player not in your team can attack the castle
- Each attack from the player can affect the same row/entity
Pocket Gems servers are hosted on Google App Engine Standard using Python 2.7 Runtime. It allows us to build a highly scalable system and focus on delivering the best product to our players. We use Google Datastore, which is a non-relational database built on top of Google BigTable, which has limitations for write rate to an entity is 1/s.
Gameplay & Technology Progression
In the alpha version, traveling from one area to another area takes time (it can take hours to move from one area to another area). We assumed that the travel times would hide the contention from the player’s perspective. Whenever there is a contention, we retry the “actions” ( — termed WorkItems) until it eventually succeeds. For that, we used the task queue to store the WorkItems being executed by the player and keep the WorkItems there until it is completed.
The approach worked okay on a smaller scale, but it’s become much more noticeable when we add more players. The contention would cause a holdback in the queue, snowballing the effect to the whole handlers.
We realized we couldn’t keep ignoring the problem, especially when we need to change the gameplay to change the traveling time to almost none (from hours to less than 5 seconds of animation).
These precious 5 seconds give us the chance to finish the database transaction.
From the problem statement, we can roughly translate it to number_of_guilds * number_of_player transactional update per second or almost 60K writes/second to a single entity in a database. The high frequency means that there are high possibilities for database contention.
60K attacks at the same time are bound to have a slim chance to happen. To understand the actual scope of the problems, we would need to look at the real-world situation.
Guild wars happen between 2 different alliances, which usually occur in a cluster of areas. To optimally win the battle, you would need to spread the team attack. A single attack will finish between 3–4 minutes and ends with 5 seconds of animation. In these situations, WorkItems will spread out between 3–4 entities, of 600–1000 updates/minutes.
Based on this fact, we can reduce the scope to allow 50 updates in a single entity to complete transactionally in 5 seconds. One approach is to batch those 50 WorkItems into a single transaction.
The problem changes into:
- Retrieving all update actions for a single entity
- Batching the WorkItems in a scalable way
- Delivering the transaction results to all affected client
For every update action, we put the WorkItems into a queue (Kafka/Google PubSub/Kinesis). In a second window, we will group the WorkItems based on the entity we want to update. Then we do the process in batch, make it into a single update, and apply it in a transaction.
The easiest way to ensure it doesn’t have contention is by having a single worker that gets all the WorkItems from the queue and then update the entity in batch.
We thought that just having a single queue was not scalable because we can’t fetch WorkItems for only a specific area. The WorkItems also compute-intensive where we need to check the existing state and do battle logics to calculate the battle reward. Having a single worker might cause a delay in processing because of a CPU bottleneck.
Multi-Queue Worker is an extended version of the Single-Queue Worker approach. Rather than having a single queue for all WorkItems and entities, we shared it, so that a subset of WorkItems will be handled by a single queue. An individual worker will process each queue. Google AppEngine Standard doesn’t allow you to spawn a new process, so we used thread as the worker.
The thread runs on top of Google AppEngine manual scaling, and it pulled the data from Google Pub/Sub. We used manual scaling as the workers because Google Datastore can only be accessed using the AppEngine.
Each thread will pull multiple WorkItems from a single queue, execute numerous WorkItems, and persist the update into datastore at the end of the transaction.
Using a single-queue worker that batches actions for each entity alleviates the contention issue, and allows us to better the player experience in the high load area.
Unfortunately, it created a new problem. From our monitoring, we found that some action in areas had a delay in being processed. The workers were still active, and the WorkItems were in the queue. This issue happened sporadically to one or two areas, and returned to normal variably after 15–45 minutes, or if we restarted the workers.
After researching into the root cause, we hypothesized the root cause is python GIL.
“… This, with the opcode counting scheme, is the reason why some people have been complaining about latency problems when an I/O thread competes with a computational thread (the I/O thread wouldn’t be scheduled right away when e.g. a packet arrives; or rather, it would be scheduled by the OS, but unscheduled immediately when trying to acquire the GIL, and it would be scheduled again only much later). …”
We had IO threads (pulling from Google Pub/Sub subscription) that competed with CPU thread (which executed the actions) and caused the sporadic delay.
Multi-Queue with External Worker
Using process instead of thread can fix the GIL issue. Unfortunately, we can’t use the multiprocessing module because of the limitation of Google AppEngine Standard.
To fix the issue, we migrated a subset of the system. The execution of actions still needs to be in the Google AppEngine, as it requires interaction with existing game logic. Router/Forwarder and Batcher can be extracted outside of the system to be a standalone microservice.
The Batcher will send a list of actions to be executed in a single transaction to the Batch Handler webserver.
This approach allows us to separate the IO thread and CPU thread, with a small added latency. Its latency is low enough that it is no more an IO-bound application, but CPU-bound.
On an additional note, the battle doesn’t happen all the time in all the areas. We used auto-scaling to optimize the resource. When a dormant area becomes active, we start the Batcher for that area. We used containerized worker as the Batcher. We also used Kubernetes to manage the container and the autoscale policy.
When the area is back to dormant, the container will stop, and the autoscale policy is updated.
Making the Forwarder and the Batcher a standalone microservice enables us to use the same system on another part of the game.
We were iterating the system for six months, implementing what works, and tackling what didn’t work one by one. In doing so, we were able to learn a substantial amount throughout the development process.
- Using a single-queue worker and batched multiple actions in a single transaction can address the contention issue and allow us to run 6 actions/s.
- The way Python GIL in Python 2 work can cause IO thread blocked by CPU thread
With this architecture, we were able to decrease the battle processing 99th percentile latency of 1 hour to 20s, and the average latency down to 2s.
Join our team! Pocket Gems is hiring!