Improving Quality by Tracking Unexpected Events on iOS

Unexpected Event Tracking has allowed us to build a useful tool for improving our app’s quality via negative feedback loops. Our team regularly adds new cases where we log an unexpected event. It’s also not an uncommon event for our peer code reviews to identify a new place for logging an unexpected event. We’ve already improved the app’s quality using the information captured this way, and we think other teams can get a similar gain by logging their unexpected events to an analytics dashboard.

Feedback Loops

Toilets contain a critical negative feedback loop. After each flush, the tank must refill to prepare for the next toilet event. Flush, fill. Flush, fill. The float measures the water level. If the level goes below a level, the valve opens and the tank fills. When the water returns to the target level, the valve closes.

The Toilet Feedback Loop

Engineers may recall feedback loops from their control systems class. We know that control systems make life easier and safer almost everywhere. Amplifiers use negative feedback to make music audible. Our bodies use negative feedback to regulate our blood oxygen and heart rate. The official feedback loop of Texas should be the thermostat. Millions of those control systems keep our buildings comfortable in August.

Engineers also may remember how an improperly designed control system might be unstable. If the feedback loop of a control system contains too much delay, terrible things start to happen. Imagine the toilet. If the water level measurement is late by even a minute, the feedback loop breaks. Let’s say the water reaches the proper level, but, like the fuse in the opening credits of Mission Impossible, the signal that the tank is full takes a circuitous path to the valve. For a whole minute, water continues rising past the desired level. It reaches the top of the tank, crests, and finally pours onto the floor.

So usually a feedback loop with a short delay makes for better results.

What About Systems More Advanced than Plumbing?

The quality of software depends on many feedback loops: tests, QA, reviews, customer service calls, etc. Like the toilet example, if our feedback arrives too late we have a mess. On the HomeAway Owner app for iOS, we added a new feedback loop that helps us detect potential problems before many customers notice. We stumbled on this source of feedback on our code quality during our adoption of Swift.

The Swift programming language has many differences from Objective-C — the original language used to develop apps for the App Store. One difference is in how references to objects are handled. In Objective-C, a pointer can point to an object, or it can point to nil (null, for those that speak Java). If you send a message to (call a method on) a valid object, the object handles the message (executes the method). If you send a message to a nil pointer, nothing happens. There is no exception. No crash. No error.

In Swift, calling a method on a nil pointer crashes your app. Fortunately Swift 3 makes it easy to avoid those crashes. A variable can be optional (it can be nil or it can contain a valid reference), or it can be non-optional (it always references a valid object). From this type information the compiler is able to prevent you from abusing nil and causing a crash.* Using Swift safely means your code has many places where your code tests for a nil:

When this code runs, we always expect itemizedFees to be populated. But what should you do in the case where self.itemizedFees is nil? We could just carry on without specifically addressing that case. The if let protects us from a crash. Perhaps that is good enough. We’re pretty sure the itemizedFees won’t be nil. However, what if our assumptions about the way the system works are wrong? What if the rules change and someone forgets to ask us to update the code? What if something breaks the system and it no longer follows the rules?

Do you see the potential feedback loop? We did:

Log is our logging class. It has class methods like warn, debug, info, just as you might expect. It also has an unexpectedEvent class method which we use to monitor this sort of corner case in our code. Instead of merely sending a message to the log, the unexpectedEvent method also sends a “UNEXPECTED EVENT” event to Fabric Answers (our analytics service). The event dictionary contains the log string under the key “WARNING_MESSAGE”. Now we have a dashboard that alerts us when we might not be doing what our customers expect.

The Unexpected Event Feedback Loop

The Unexpected Event Dashboard

Sending our unexpected events to Answers gives us a central place to monitor them and monitor how many customers are impacted.

Our Unexpected Event Dashboard

This lets us shine a light into the dark corners of our code, closing the quality feedback loop earlier — hopefully intercepting any negative reviews in the App Store.

This unexpected event dashboard is similar to our crash dashboard: another automatically gathered source of information about our code quality. And since none of these events lead to crashes, it is information we otherwise wouldn’t have had access to so quickly.

A Return on Investment

We recently added the ability for owners to view file attachments like JPEGs and PDF files. The code had conditionals to check for the type of file. PDFs were displayed one way. Images were displayed another way. But what if we didn’t know what to do? That seemed like a job for unexpectedEvent:

After the feature released, we noticed an interesting new unexpected event:

“Could not view attachment file” events start arriving

We hadn’t taken into account that the file extensions might be capitalized. Our new feedback loop identified a bug (and exactly where it was) before we received a single complaint. We released a fix shortly after discovery. As you can see above, the frequency of these events is trending down since we released the update.

More Uses for Unexpected Event Tracking

  • When the server gives you a new enum string that you don’t know how to handle
  • When an important value is missing from the data from the server or from a persistent store
  • Monitoring the behavior of complex or crufty code as you change it (or try to understand it)
  • When values fall outside of expected bounds
  • When dependencies between distant UI elements break (this code doesn’t smell good, but it sometimes happens)
  • When you get bad data (e.g. corrupt images, invalid URLs)
  • When a library method fails
  • For the default case in a switch statement that you don’t expect to hit
  • When a method’s contract is violated

It’s important to remember that unexpected events are a tool for monitoring the wild world of production code. In our case, this code is running around the world on thousands of devices we don’t control. There are better tools than analytics for monitoring strange things happening on a test device in the lab. Out in production, we can’t monitor our users’ logs, so we have this.

Some of the above uses might mean really bad things in production, so you may wish to use a more robust tool than analytics to track them. You have to use your own best judgement. Speaking of judgement, some caution is advised for what content you log. Never log sensitive user information to analytics.

Technical Considerations

You might wonder why we don’t just use our crash reporting tool instead of using Answers analytics for these events. Our app reports crashes using Crashlytics. Crashlytics automatically reports all crashes with a stack trace. Crashlytics also lets you log your own error events using the same service. We do take advantage of Crashlytics where it makes sense.

Crashlytics isn’t a great solution for our unexpected events for a few reasons. First, Crashlytics is expensive. Their methods use blocking IO. The calling thread is blocked until the error information is persisted to storage. Blocking threads leads to choppy user interfaces. By definition we don’t know how frequently an unexpected event will happen, nor how severe the event is. It would be a poor trade-off to use such an expensive method to learn that sometimes a string isn’t populated, or some other edge case that a user might not notice. If we know that the app is about to crash, then the cost of logging it matters less. Users hardly notice a little UI stuttering before a crash.

Analytics, on the other hand, don’t block the thread with IO. Analytics events are recorded in RAM and another process dispatches them to the server. There is a minimal impact on app performance, so it is a much better choice for an unexpected event. However, such analytics aren’t useful for crashes. Analytics events may be lost If the app crashes soon after recording them. That seems like a fair trade-off.

*Of course, it is possible to override Swift’s safety mechanisms. The places where Objective-C and Swift meet also can get complicated and prevent the compiler from spotting potential issues with nil values. Peer code review is one of our primary tools for avoiding those bugs.