Build Reliable Products with Resilient Software

Published in

Ben and Dion

4 min readMay 4, 2016

Providing a reliable experience to a user requires resilient software, and this is something that we don’t often discuss, even though I think it warrants the same attention as security and performance.

It is hard to develop truly resilient software as it requires thinking and iterating though the edge cases, similar to the extra mile that you need to go through to make sure that 60 fps is standard, or those attack vectors are mostly covered, across a large range of devices.

We have all seen examples of when the user experience falters. One of the reasons this topic popped back into my mind was watching one of my sons dealing with a poor experience (that is actually a feature in my book, as you will see!). Let me share the story….

I have seen many applications get into a bad state when it comes to in-app purchasing. My son wanted to take real money and convert it to “gems” in Clash Royale, but once the payment went through the gems never showed up. Even worse than that, the gems became greyed out and when tapped said “Transaction is in progress”. The game has been stuck in this state for weeks, which is a bummer for Sam. After searching online I see that SuperCell is losing out on a lot of money as this doesn’t seem to be a unique case at all. If the game was resilient it would be aware of such edge cases and would be able to revert to a state when they could take money again.

Visibility

The first step in building resilient software is being able to see what is going on in the system. You need to build a resilient mechanism to get errors back to you, and a way for you to be alerted to the velocity of errors, as well as critical ones. It is common to get to a new release and think “oh right, I guess we need to get some analytics tags in there quick!” vs. having that thinking occur at the beginning. You really want to be thinking “what outcomes am I looking for with this release?” very early on indeed, as a tool to help you decide what to even build, as well as flushing out the various scenarios.

It is also easy to get flooded and conflate true errors in the system with “valid” logging. I remember joining one company that had millions of exceptions flooding into their system and a large percentage were SocketException’s which were waved away as “just networking issues”.

It just so happened that we put a new orchestration tier in front of the existing backend, and one side effect of this was that this new tier was acting like a client that we controlled. Suddenly we could see the systemic problems in the backend that were causing real issues and costing millions of dollars. On the orchestration tier we were able to play with some timeouts, some retries, and worked around the backend (while that team worked to fix those issues). These hacks are always tricky, as if you aren’t careful the retries can add more stress to the system and you end up causing more problems! You have probably ran into this type of issue when dealing with account login systems, and making sure that you slowly add latency to the response to slow the system down.

Client Control and Service Workers

One of the reasons I was so excited about Service Workers, was being able to take the orchestration tier approach directly to the client where it can actually do the most good. Once you track what is going on there, and see how many errors happen (due to flaky networks) you will be shocked. This isn’t just about making your app work for Wendy on a plane. This is about working around the systemic networking issues (especially on mobile, but very much beyond with cruddy WiFi and networking in between you and the endpoints).

When Alex Russell first shared what he thinks makes up a Progressive Web Application, the notion of “connectivity independent” was the term he used to convey one of the key features of service workers. He purposefully didn’t use the word offline here, yet too many make the assumption that service workers are just for offline, when this sells them short.

In fact, as I was about to publish this article Alex wrote a new piece arguing this very point: It’s About Reliable Performance, Not “Offline”.

With the low level control that service workers give you, you are able to race on networks vs. caches and validate your state along the way. You also have other tools beyond service workers, such as the page visibility API, which you can use to throttle and batch so you aren’t using resources when the user isn’t there.

Service Workers aren’t “a new AppCache for offline”. They are building blocks for a new resilient Web, and one that can deliver game changing features such as push notifications.

Build Reliable Products with Resilient Software

Written by Dion Almaer