The Worst Bug Ever—Randomly Losing Your Best Players
By Ron Little
Imagine discovering a serious bug in production immediately after releasing your game. Imagine this bug hurts only your paying customers. Imagine it freezes the game immediately after players complete an in-app purchase. Imagine that when the player restarts, the game freezes during start-up. Imagine the player can never get unstuck and has to uninstall the game. Imagine your app is currently featured on the Apple Store. This is a story of such a bug, the worst bug I have ever dealt with in 30 years of programming. This is a story of how we tracked it down and worked with Unity to fix it.
In the 24 hours after going live with Adventure Chef: Merge Explorer for iOS, we started seeing a large number of our players encountering freezes during start-up of our game. We use Bugsnag’s excellent app stability monitoring library and dashboard. There was a set of callstacks pointing to Unity’s In App Purchasing (IAP) package. Apparently, this popular Unity library was causing our app to be unresponsive for over two seconds, which triggered the operating system to force quit our app. It appeared that this Unity code was simply parsing an iOS receipt, a bit of text in memory, to determine what the player had bought.
What could be causing Unity IAP 4.1.1 to take more than two seconds just to parse a chunk of text in memory?
There was an extreme sense of urgency. I won’t call it a panic, but urgent messages were being sent on Slack. My manager texted my phone for the first time in six years:
As the engineer most responsible for our low-level IAP support within the game, I was most knowledgeable about what was going on and I felt the weight of this responsibility. I won’t call it a panic, but yeah, this was stressful for sure.
At first, I just tried to get a handle on what was happening. Initially, we thought it was affecting 10% of our iOS players. After more careful examination, we realized it was about 1.4%. So many callstacks (what the program was doing at the time of the error) were different. Many of them indicated that memory was being allocated. Was there some kind of C# memory problem?
But all of these callstacks were within Unity IAP’s code that seemed to be parsing Apple receipts, based on the method names. So, my theory was that we were seeing an infinite loop where some kind of tree structure was being parsed but wasn’t finishing, and the wide variety of callstacks just happened to show where the operating system killed our app. I gathered relevant information and filed the highest priority bug report possible with Unity.
The Workaround — Reverting to a Better Bug
After talking with teammates, someone pointed out that this bug didn’t occur in the previous beta release of our game which used Unity IAP 3.2.3. Well, the reason we upgraded was to improve a “No Products Available” error where the player would be randomly prevented from doing in-app purchases on Android, but they could at least restart the game and likely succeed. But this “iOS app hang” bug was unimaginably worse, so it was an easy decision to revert to an earlier version of Unity IAP. So, at that point we at least had a work-around. We reverted from Unity IAP 4.1.1 to 3.2.3 and did overnight offshore testing and then submitted the fixed version to Apple.
Since Pocket Gems has a contract with Unity, we get first-rate customer service. Customer support got back to us within an hour and the IAP team was alerted shortly after that. Then, they said they could reproduce a freeze as well. Wow, that was fast!
With the crisis temporarily resolved and with the Unity IAP team working on the problem and possibly having a fix already, I was able to finish other urgent tasks as we headed into our winter break.
ASN.1 and the Deep Dive
I was still troubled that we didn’t know how to reproduce this freeze. The Unity IAP team said they could reproduce a freeze, but was it the same freeze? How would I verify their upcoming fix? We could be stuck on IAP 3.2.3 forever.
Going into our winter break, I had time on my hands due to the pandemic and not wanting to travel. I was really curious about whether I could reproduce this problem. I tried to understand what the Unity IAP code was doing. It looked like it was building a tree structure from the Apple receipt, which is a text string that is base-64 encoded, representing a cross-platform binary structure in the ASN.1 format.
ASN.1 is a hierarchical structure of container-like elements and leaf-nodes of simple attributes:
Crucially, if you don’t have a schema or some external description of the data layout, then a blob of bytes, like in the OCTET STRING above, might either be a child ASN.1 structure or it might just be a leaf node containing some string. The parent structure can’t tell you one way or another! So, to decode an arbitrary ASN.1 object, you just have to try to parse every element and see what happens. From an architecture and safety point of view, trying to decode a random binary blob to see if it happens to be a well-defined structure is risky and reminds me of Little Bobby Tables.
Is this how Unity IAP was getting into trouble, by trying to parse random bytes? Spoiler alert: yes, yes it was.
I decided to build an automated unit test that could run just the Apple receipt parsing code within the Unity IAP package, without all of the overhead of debugging our game on an iOS device. Unfortunately, their code is not included when building for Unity Editor’s Player, but I was able to copy the C# files out of this directory, in the root of my local Unity project directory:
For data, we save our players’ anonymized receipts to a Google BigQuery table, so I ran a query looking at a small subset of receipts from in-app purchases by our actual iOS players. I downloaded the data as a CSV file and wrote a quick bit of parsing code. I had just over 300 real-world receipts.
Could I reproduce the freeze? I attached my C# debugger (in the excellent Rider IDE from JetBrains) to Unity Editor, and ran my new unit test. I stepped across the call to Unity’s ASN1 receipt parsing code. The next line of code wasn’t reached. The unit test was running. Frozen. The very first receipt reproduced the freeze! Restarted. Skipped the first receipt. The second receipt parsed fine. And the third and fourth. I let the unit test run freely. Another freeze.
I copied the same code files from Unity IAP 3.2.3 to double-check that it would parse these 300 receipts. Yes, no problem.
I was so happy! I could reproduce the problem! Now for more receipts!
I faced an interesting problem with my automated test — how do I test thousands of receipts knowing that some of them are going to cause an infinite loop yet I want my test to finish and output the results?
One way was to assign a Task to each receipt and then use the .NET WaitAll method along with a timeout parameter. Tasks are run on background threads in the .NET threadpool, and so your main thread wouldn’t be blocked and could report the results.
Out of 9,163 receipts, 2 caused a crash, 180 caused a freeze, and 8,981 parsed correctly. Error rate: 2.0% (= 182 / 9163).
While waiting for a fix from Unity, we realized we needed to use one version of Unity IAP for Android and a different version for iOS. We were navigating around two bugs!
- Unity IAP 3.2.3 — use this for iOS. It has the “No Products Available” bug which almost exclusively affects Android but, crucially, it does not have the “iOS app hang” bug.
- Unity IAP 4.1.1 — use this for Android. It improves the “No Products Available” error for Android but it introduced the “iOS app hang” bug (which only affects iOS, not Android).
But how do you select the package version based on platform? My coworker knew of an elegant solution — you can select it programmatically when loading the project in Unity Editor! The default in the Package Manager would be Unity IAP 3.2.3, and our builder would choose Unity IAP 4.1.1 for Android.
Back-and-Forth For the Fix
I asked the Unity IAP team for a preview of their fix, to double-check that it passed my automated test. They agreed, but when I tried out their fix it did not fix the freeze, unfortunately.
It was hard for me to be sure that my way of reproducing the freeze was valid. What if I was misusing their code? I was sensing confusion in our back-and-forth on the support ticket. So, on a Thursday afternoon, I asked for a meeting with their programmers. They agreed and set up a meeting for 10 a.m. the next morning. Wow, talk about customer support!
They explained what they thought was going on with the freeze — that it was being caused by Unity IAP 4.x having improved support of the fake store in Unity Editor’s Player, by doing a deeper parsing of these ASN.1 structures. They confirmed that my automated test was valid.
The Root Cause
What does that mean, to do a “deeper parsing?” Imagine you’re seeing this octet string in your ASN.1 object:
The ASN.1 structure doesn’t say how this octet string should be interpreted; it’s just a text representation of some binary data. By convention, some octet strings in an Apple receipt represent a standalone ASN.1 encoded object. Which ones? Well, Apple defines that, but I guess that’s kind of a pain to figure out and it’s just easier to try to decode and see what happens! YOLO!
The ASN.1 format has short tags that indicate the type of the structure. For example, 0x3 identifies a bit string, and 0x4 identifies an octet string. There are tags for containers like sets (0x11) and sequences (0x10), etc. So, it’s not hard to have a random byte be misidentified as a tag.
The Unity IAP team found themselves having to harden their code against random data that might as well be from an untrusted adversarial source, just like in the above Little Bobby Tables cartoon. It’s a hard problem to correctly parse complex structures that are incorrectly formatted. This is what led to unhandled exceptions and infinite loops.
Anyway, I soon got a second preview of their fix. It fixed the freeze! But, it crashed on a different receipt. I gave them that receipt, too.
3rd attempt — looks good! All 9,163 receipts were cleanly parsed!
Soon, Unity IAP 4.1.3 was released with the fixes. Whew! I could then get rid of our hacky work-around.
I was relieved that our offshore testing and automated testing looked clean with Unity’s IAP 4.1.3 release. We released a new version of Adventure Chef using this fix, and everything looked good.
I was happy to have helped other Unity-based game developers who may have not even known that some of their customers had run into this issue. And I was happy to help our partner, Unity, too.
Here is the summary in the Unity IAP 4.1.3 changelog; a lot of work and stress were behind this innocent-sounding sentence!
“Fixed edge case where Apple StoreKit receipt parsing would fail, preventing validation.”