Kotlin @ 60FPS

Is Kotlin slow?

Collin Flynn
Livefront

--

With a language this fun to write, is there some cost that must be paid? Will I only realize it when it’s too late?

It’s good to have heuristics for the relatively costly operations in any given language, and to back them up with measurements. Below, we’ll take a look at some samples and the context of execution that makes them interesting. If you’re writing Kotlin and have a concern for performance in some regime, your expectations may be inverted.

A still image of three fish below the surface of the water. A pointer drags across the screen, creating a ripple effect.
Now that’s just silly. See the gist for an example of height buffer image refraction.

I acknowledge that Kotlin Native is a thing, but this article will focus on the “home turf” for Kotlin: the JVM.

Firstly, it is worth noting that the vast majority of code is not on a hot code path where performance concerns override the maintainability, readability, and semantic joy in everyday Kotlin. Performance often isn’t a concern, right up until it’s the only concern.

So what’s a “hot” code path?

Hot code paths get executed near continuously, or represent a computationally expensive task that requires oodles of iterations. Even minuscule differences in efficiency inside a hot code path can quickly add up to performance gains or losses.

Examples of hot code paths include:

  • UI applications with a draw() loop. Keeping a UI thread unblocked and responsive is important, and updating visible state is typically a 60fps job.
  • Simulation State: When simulating, often states are computed from previous states. Games are sometimes CPU-bound for this reason.
  • Light post-processing on images or video. A custom GPU shader can speed this up, but it’s nice not to have to support the openGL ecosystem if you don’t have to.

Or countless other domain specific cases. Perhaps you have a scheduled job on a back-end service and you’d like to keep it lean, or a background task on a mobile device that should complete as soon as possible, to avoid being killed by the user or OS.

Hold up. The Garbage Collector Is A Deal-Breaker, Right?

There’s some dated conventional wisdom around garbage collected languages, so first we’re going to write a few small tests to check assumptions.

Allocations.

So the story goes, creating objects is the most expensive thing you can do. It’s not costless, and we’ll see that in a second, but first a quick test:

Here we’re making a cool ten million allocations of (admittedly small) data class. Before you keep reading, make a guess: does this function complete in over or under, say… 500 milliseconds?

For our purposes, it’s most important to look at the relative cost between examples, but this will be useful to establish a baseline. And just for comparison, a 4K image has about 8.2 million pixels. In the case of some light post-processing of an image, this is around the order of magnitude you would encounter.

Result? This function takes around 9 nanoseconds per loop iteration, or 90ms for the whole thing. Was that about what you expected?

What if I said we could cut that down to ~1.5 nanoseconds? This leads into a topic that effects any program running on the JVM

Heap vs. Stack Allocations

It’s common knowledge that object allocations go on the Heap.
Just as a refresher, the Heap is where the JVM puts newly created objects, and frees memory when garbage collecting.

The Stack is the locus of program execution, where each stack frame has its own registers for computing locally. Stack frames have high rates of cache hits. Cache misses are expensive, so allocating on the stack is a nice performance boost if you can constrain your variables to a small enough scope. There’s also no deallocation cost, as stack frames are freed with a pop. By comparison, Heap allocations have to pay coming and going: once to allocate space, perhaps a cache miss, and then to zero out garbage data from previous allocations.

So, can you allocate to the Stack in Kotlin?

Not directly, but the compiler has some tricks. Let’s look at another example:

Here we see the same line instantiating an instance of Alpha, ten million times. In the previous example it took ~90ms. Is this loop faster or slower than that? By how much?

Given what I just said about stack allocation, what do you think a compiler would see when it reads at this loop?

The answer is in Escape Analysis: the reasoning through which the compiler proves to itself that the allocation of Alpha on the heap is unnecessary in this example. The compiler knows it can run functionally the same logic without the heap allocation. The emitted bytecode takes about ~1.5 nanoseconds per loop iteration instead of the ~9ns previously.

But wait. At some point we have to make heap allocations, and they all have to be paid for in a garbage collection. Isn’t de-allocation the real problem?

Deallocation

To collect garbage, the GC has to compact the heap — that is, put the free bytes in a contiguous block by moving the allocated ones next to each other. That’s a bunch of copy operations.

Interestingly, a reliably high percentage of objects (92–98%) are very short lived on the heap, which allows for a clever separation of long lived and short lived objects — making that compactification phase cheaper.

The “old” and “young” generational separation is why your short lived objects don’t always cost much to deallocate. In the common case for short lived objects (like in our first code snippet), deallocation is just moving a heap pointer.

Allocating millions of short lived objects isn’t nearly as expensive as you’d think.

But there are limits, right?
Yes, obviously.

You can observe this with workloads that fragment the heap, maximizing that copy/compact phase while constantly demanding to use the newly freed space. For example, imagine you have a large, nested data structure:

val objects: HashMap<List<List<String>>>

If you continuously add/remove/delete in that collection, expect the garbage collector to be working overtime.

With that I want to walk through very simple techniques to diagnose performance issues in a hot code path.

Simple Techniques

Three simple ideas.

1) Measure

It’s hard to overstate this.
Sometimes when you make an “optimization”, the compiler spits out exactly the same byte code with no performance change whatsoever. Other times the “optimization” actually results in a slow down — this usually indicates that the compiler was previously able to prove to itself some shortcut condition, and you’ve now violated that proof and forced it to go the long way around.

As far as making the measurements, try to use as close to your target hardware and environment as possible. Try to at least establish the comparative cost between alternatives, if not the absolute value.

If you have multiple operations inside a hot code path, test each one in isolation (perhaps with a loop of 10B iterations) and get a sense of proportional cost. Focus your effort on the slowest routines.

It might be the case that each routine is relatively efficient in isolation, but in combination they combine to thrash the heap. As noted above, the garbage collector can cruise along just fine with short-lived allocations, as long as their contiguous space in the heap is uninterrupted. Benchmark each routine, and benchmark them in integration to confirm a performance gain/loss.

2) Pre-Allocate

If you can, pre-allocate the space that your computation needs outside of the hot code path. Keep it allocated and reuse it for each run.

Perhaps you need a large bitmap to process an image (like in the gif at the top of this article), or maybe you’re processing audio with the last 30 seconds of real-time capture. Get the space you need up-front, and keep it on the elder segment of the Heap.

And finally, what if you’ve pre-allocated, you’re not thrashing the heap, each subroutine in your hot code path is reasonably fast in isolation, and it’s still slow? Is it time to write C++?

Maybe, but try this first:

3) Memory Layout

Think about the operations you imagine are fastest in Kotlin:

get()
set()

Getters and setters. Those are lightning fast, right? As far as no-cost abstractions go, these may qualify.
The same is often true for chaining them together:

get().get().get().set()

This amounts to following a few pointers on the heap. A hash map accessor is a similar case:

hashMap.get[key]

But what if you’re chaining them all together several times per loop on a very hot code path? What if you’re doing this a lot.

No allocations, but still slow?

Assume everything here is pre-allocated. All we’re doing is moving around a few numbers. Is this something you’d typically label as costly in performance, given just the innards of the loop? What if the getters are slightly more complex than standard (perhaps they have backing fields, delegates, or state assertions). What are you supposed to do if these few operations add up to a janky slow-down?

Oh yeah, data structures.

This is something you might see more in games, but imagine the case where you have a large collection of objects (say, game entities), and they all have internal pointers like the previous example of alpha, gamma, and beta.

Each entity has some position in 2D space,

class GameEntity(
var pos: Position
)

And suppose you have some operation to move all game entities upwards a bit — perhaps the water level is rising and everything ascends in your simulation.

Everything in my game wears water wings.

This is a compact example, but imagine each positional lookup were piped through functions or data structures that compound the number of accessor methods needed for each loop. This is a plausible performance concern.

Can we re-arrange the input data to speed up this method?

Try this:
Stack each positional x/y value into a fixed arrangement:

Instead of a List<GameEntity>, a vertically stacked buffer.

You can still pre-allocate as much space as you’d like, but methods like elevate(), where a single operation is applied to a large collection of entities, is now just modifying the contents of a single array in one pass:

It shouldn’t be a surprise that GPU inputs are often structured as contiguous blocks of homogeneous data (arrays) rather than mixed pointers and values. Supposing you want to parallelize something like this, its arrangement is divisible into chunks of work.

So, is Kotlin slow?

If you’ve read this far, the answer is contextual and more importantly, measurable. Concerns regarding value boxing, allocations, nullable types, and capturing lambdas depend on execution context, and in the common case will be in proximity of compiler optimizations. At the point it matters to you, take measurements and try the simple techniques above.

Collin writes Kotlin and does occasional performance tuning at Livefront.

--

--