Application Failure — Retrospective

Any application engineer will tell you, the worst thing that can happen is for an application to go down, worse still, if you can’t resolve the issue remotely and have to issue an update to the store to fix it.

This happened to us this week, I think it’s important to address the issue, it’s causes and what we can learn from it.

Ground Zero

We rely on social APIs from Facebook and Twitter for social sharing counters, comments and sharing tools.

Yesterday, an untouched and hidden code bomb exploded. One of the Facebook APIs we required for displaying an article’s social count and retrieving the graph ID was deprecated and removed by Facebook. This resulted in our application being unable to retrieve the count, not so much of a problem, except we failed to handle the missing data gracefully and thus the application crashed, this was unrecoverable.

In December 2015 we released V1.0 of our application, as a growing company we didn’t quite have the resources or the capacity within our engineering team to build our iOS application, so we outsourced it.

Unfortunately when we brought the application in-house the code was in a poor condition, no unit tests, little documentation and we’ve fought over the last several months to improve the code smell within the application, we just hadn’t got to the code which would ultimately explode and affect our users in a fatal way.

Fallout

Every version of our application became unusable as the code which triggered the bomb had been laying dormant since V1.0, users would open our application, attempt to read an article (which would load momentarily) then it would crash.

Over the 24 hour period in which our current version was in store, 89% of our users were affected and we were receiving poor reviews and emails.

Recovery

The immediate resolution was to handle the empty response from Facebook, meaning we could get a fix into the store relatively quickly but would be at the mercy of the Apple review gods. After filling in the expedited review form it was a matter of ‘wait and see’.

Whilst we were confident in the fix we issued to store, it was important to review the code and determine how to re-introduce the share count and comments which would be unavailable. Once functionality was restored, we submitted as a separate build, it’s better to plug the hole fast than re-build the plank and risk sinking the ship.

8 hours after submitting our plug to Apple it was in store, we managed to restore access to our content, stopping the fatal crashes.

24 hours after submitting our re-built share and comment functionality, it was ready for release.

All in, we managed to find, fix and recover functionality within 36 hours, 36 long hours.

Takeaways

It’s important to learn from these kinds of events, shit will happen, learning from it will help prevent the same shit happening twice.

We’re reviewing all of the underlying 3rd party APIs and where possible, ensuring we have notifications of changes to those APIs and where we can remove, merge or abstract API functionality remotely we will; this will ensure we can handle changes remotely and avoid similar events.

It was already on our radar but we’ll be adding a hot fix solution that will allow us to monkey patch future issues and then release the relevant fixes.

The biggest lesson we can learn is to carry out regular full code reviews, we review iteratively but over time the entire codebase requires a checkup.

Have you had a similar issue happen with your application? How did you resolve it and what did you learn? I’d love to hear from you.