“Encode Ala Mode” is a company which sold a middleware service to game developers to convert their JPG/PNG textures to ETC2 using the cloud to speed things up.
However, once they landed their 5th big contract, artists at each client started to complain about how long it took to get their texture exports back. In some cases they were waiting minutes just to get in the queue for processing.
For each build, their script would create a new bucket, upload all the textures, and then kick off work to a load balancer where a bunch of GCE instances would work on things. They used individual buckets in order to test how many of their textures were finished, since that was faster than asking for each texture individually.
Looking at the logs, they were doing about 23.1k exports a day for their system. Given that there’s only 12 work hours a day that means they were doing a significant number exports, per second, from their customer set!
Checking the scaling
My first hunch was to check the load balancing & scaling operation. If everything is setup right, new instances should be spun up properly.
We checked the CPU utilization for the instance group, and the values were low, in the 12% range, and there was only about 3–4 instances running. So the problem wasn’t that the instance’s were getting overloaded.
Next, we checked if the problem was that the CPUs were getting stalled… I went and looked at the logs, which emitted a timestamp each time a new batch of work came in. The logs showed that there was plenty of time between a batch finishing and a new one coming in.
This all pointed to the idea that the instance-group scaling was working properly and as expected. This gave me a hunch that the problem wasn’t the scaling but that content wasn’t getting to the load balancer in time.
Finding the issue
There is a per-project rate limit to bucket creation and deletion of approximately 1 operation every 2 seconds; For Encode Ala Mode, this meant that if 30 artists across multiple companies all tried to export at the same time, then random folks would start getting exception 429 to be thrown, signaling that the rate was exceeded for bucket creation.
Where this turned into a performance problem, was that the Python API their tool was using had to implement an exponential backoff, and retry (which is nice and common practice) but since the queue was getting backed up, folks ended up waiting a long time for a simple export.
Turns out, they didn’t need that much dynamic bucket creation. Instead, they created 1 bucket for each artist (where the name was equal to the artist user ID) as they registered with the system.
This removed the entire bucket creation bottleneck, made things a lot faster, and made it easier to handle tracking down what assets were causing problems in the case of an error.
I like it when the solutions are easy ;)