If I’ve Told You Once…

I’m probably going to tell you a hundred more times

Ian Ni-Lewis
Google Developers

--

You know what computers are good at? Repetition. Almost every CPU optimization in the last fifty years — caches, branch prediction, SIMD — has been based on the assumption that if you did something once, you’re going to do it again and again.

So it can be really frustrating when API design doesn’t take repetition into account. Let me give you a real-world example.

StrictMode falls off a cliff

A few days ago my buddy Colt McAnlis pulled me in to help answer a question about performance in Android’s StrictMode class. He was looking at an app that ran perfectly on KitKat, but janked ferociously on Lollipop. Every thirty seconds or so, the app would freeze. I’m not talking about a dropped frame here — it would stop updating the UI for hundreds of milliseconds. It was maddening.

The app developer had already narrowed down the problem to a method in StrictMode called conditionallyCheckInstanceCounts. This method takes a list of classes and counts the number of live instances of each class. It’s useful for debugging, for instance to detect redundant instances of singleton classes. StrictMode runs this method approximately every thirty seconds.

Now, nobody expected StrictMode.conditionallyCheckInstanceCounts to be fast. The developer had only enabled it in debug builds. After all, in order to get an accurate count, it first had to run a GC pass with finalizers. But nobody expected it to take hundreds of milliseconds. It made it impossible to run the app in debug mode.

Hot on the trail of unnecessary work

Our first insight into the problem came from the logcat. We expected to see one expensive garbage collection pass every thirty seconds or so. But instead of one garbage collection, we saw something like sixty collections one right after the other.

Using the open-source code exploration tool OpenGrok, we traced through the code to find out where the extra GC calls were coming from. It turned out that we didn’t have to go too far. The problem was in the implementation of VMDebug.countInstancesOfClass(). Note the singular — “Class,” not “Classes.” This function only counts instances of one class, which means that StrictMode.conditionallyCheckInstanceCounts has to call it once for each class in the instance limit list. That’s problematic in two different ways.

The first problem is that clearly VMDebug and StrictMode don’t trust each other. Both conditionallyCheckInstanceCounts() and countInstancesOfClass() force an explicit garbage collection pass, so that work is done once for every class in the list plus one. There’s no point in forcing a GC more than once. It’s not like the counts are going to change, especially since the runtime holds a global mutex while it performs the count. And forced GCs aren’t cheap.

The second problem is a little deeper down. VMDebug.countInstancesOfClass() calls Heap::CountInstances(), and Heap::CountInstances() visits every live object in the heap. The visitor checks each object’s class, and if the class matches the class that countInstancesOfClass() passed it, it increments an instance count. This heap visit is repeated for every class.

You could argue that there’s a better way to enforce instance limits than by doing a heap walk every thirty seconds, but that’s beside the point. The real problem is that once again, it’s wasted work. If you’re going to visit every object in the list, do it once.

Design Better APIs

The lesson here is simple: if you do something once, there’s a good chance you’ll do it a hundred times — especially if it’s an operation on a common data type. When you design an API, you can either work with that fact or against it.

Typical OOP abstractions work against it. They put operations into class instance methods where, by definition, they only operate on one object at a time. Worse, because they don’t have a broader context, they often end up doing lots of redundant work. The extraneous GC passes in VMDebug.countInstancesOfClass() is one example. Another example is graphics routines that make redundant calls to GL state functions. It’s ironic that so many abstractions force code to make decisions about global state based on local context.

But you don’t have to break OOP encapsulation to write a good API. A static method is just as OOP as an instance method. I propose a simple rule of thumb that isn’t linked to or at odds with any one particular programming paradigm:

Don’t operate on one object if you can operate on many instead.

This especially applies if your API is doing any work that could potentially be shared across multiple iterations.

In my opinion, that’s where VMDebug goes off the rails — it knows it’s doing a full GC, and it knows that the cost of that GC could be amortized over multiple iterations of the instance counting routine, yet it still only operates on one object at a time.

“But,” you might object, “what if there really is only one object to operate on? Should we force callers to create a collection in that case?” Fair enough. While low-level languages like C don’t differentiate between passing an array and passing a single reference, higher-level languages like Java do. Fortunately, higher-level languages tend to support function overloads. You can write a version of your API that takes a collection, and another one that takes a single object. Let the calling function decide which version to use.

It didn’t have to be this way

The real tragedy in the StrictMode example is that Heap::CountInstances() was clearly written by someone who was aware of this rule, because it takes a list of classes as a formal parameter. It’s perfectly capable of counting instances for any number of classes all at once. But because VMDebug didn’t take advantage of that, we end up with this sad sequence of events:

Written out like that it seems silly, but that’s what you get. One poorly designed interface can turn into a mountain of wasted work.

Final Thoughts

The VMDebug example, where the wasted work totaled billions of cycles, is an extreme case. But the advice applies at all levels.

For instance, a high-performance math library might provide a simple operation, such as vector addition. That’s not an expensive operation at all. The SIMD vector units on many CPUs can pump out one vector addition every clock cycle. But when you add in the cost of the function call, the number of cycles it takes to move data to and from the vector unit, and the length of the vector unit’s pipeline, then the total cost of the operation is closer to a hundred cycles. Which means that if your vector addition API can take an array of vectors, it can potentially operate two orders of magnitude faster than if it only takes one.

So remember, computing is all about repetition. If I tell you something once, I’m going to tell you a hundred more times. Or a thousand, or a million, or a billion. Design your APIs with that in mind.

--

--