When, what and how to optimize (Go) software

Yuri Brito
Wildlife Studios Tech Blog
14 min readApr 30, 2020

I haven’t met a developer that purposefully builds non-performant software. On the contrary, premature optimization is one of the most common pitfalls we have to be aware of at all times.

If I’d have to pick just two of the possible undesired consequences of premature optimization I’d choose:

  • Acquired complexity that is hard to change later on
  • Time wasted with non-problems

Acquired complexity that is hard to change later on

You’re building a brand new system and want it to behave incredibly well for your users when it hits production. Time to pick the database. Let’s go with the one that all latest benchmarks show scales the most, we had to be crazy not to, right?

Turns out this is a completely new thing to you. It’s hard to be aware of many caveats at this point, so one after the other they start to show up, maybe weeks have passed, and it’s possible you’ll start to design or even limit the software around this choice.

Cassandra, for example, is an incredible choice when you really need to support a massive throughput of writes and they need to be real-time. But you shouldn’t expect CQL, its query language, to have the same capabilities as Postgres’ SQL. With CQL you need to have at least one equals in WHERE clauses, and it doesn’t support not-equals; data modeling is very different depending on whether you choose Cassandra, Mongo, or Postgres.

Significant time and effort are required to code all the particular business logic of your system in a way that works with the chosen database. If we come to regret any of these decisions in some months, it’s possible that this will be a huge pain for everyone involved.

Time wasted with non-problems

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. — Donald Knuth

Being constantly on the lookout for optimization opportunities incurs a non-negligible time price. It’s hard to know ahead of time if a non-trivial improvement is really worth it. So you might as well be tackling non-problems.

One way to fall into this trap, for example, is searching for the most performant library to solve the problem at hand, even if you don’t have stringent performance requirements. There are many json parsing libraries that compete with encoding/json, and many are faster at it in comparison. Using them is also a bit more involved. You usually have to be aware of memory reuse imposed by these libraries, have to explicitly parse each key with a particular function for their type, and lose the handy struct tagging feature that makes encoding/decoding so straightforward with encoding/json.

For that reason, it’s important to figure out strategies to know when to do it.

When

Throughout this article, I assume that you care about observability and you’re keeping track of many important metrics of your system. So an easy answer for when is: any of the key metrics is in an odd state, either unusually high or low. The cause could be many things, and some of them will require code inspection and optimization.

Continuing on the database example from above, it’s much better to make a decision you know will be hard to change if you have information that shows you that the probability of regret will be low. If you’re completely guessing, take the easy road. If you do less you will have less to replace later on.

There are two moments to optimize: in anticipation of a problem, during system design or by trends in metrics, or as a contingency to already existent issues.

It’s not black or white, but each item below for when aims to represent these two moments respectively.

  • As a result of estimates for usage patterns and how they will evolve over time
  • Due to compliance with system requirements or good user experience

As a result of estimates for usage patterns and how they will evolve over time

Handling 100 requests per second, RPS, in a backend system is very different than handling 10.000 RPS. The strain exerted to the system by what kind of request we’re talking about could vary wildly as well. An operation that talks with many different external systems to create records is very different than one that depends on a single cache with a very high hit rate to return read-only data, for example.

Imagine we want to add a social feature to one of our games where players can add other players as friends and all recent relevant events from our friends should show in a new section we’re adding.

Thinking about some aspects of your system can guide you through the difficult task of figuring out when is the best moment to worry about performance, let’s try this exercise.

  • What are the different operations existent in the system?
  1. Add friends; triggered by a player
  2. Get the timeline of events; triggered by a player
  3. Add player game-related relevant event; triggered by game
  • How can we split them between reads and writes?

Reads:

  1. Get the timeline of events

Writes:

  1. Add friends
  2. Add player game-related relevant event
  • What’s the expected throughput for each?
  1. Add friends

Suppose the game has 1MM daily active players, we expect 10% of them to engage with the social feature, and on average they will add 5 new friends a day. If we spread this rate regularly, we’ll have close to 6 RPS.

2. Get the timeline of events

Now, on average, this engaged player base will probably check the timeline 2 times an hour and go until the fifth page of events, so 10 requests per user per hour, 240 per user per day. Rounding up this is 280 RPS, 16.800 RPM.

3. Add player game-related relevant event

Now, all the active player base is included here. On average, each player has 1 new relevant event per hour. Also 280 RPS.

  • How will this throughput evolve over time?

We expect a larger share of our player base to engage. Optimistically it’ll double within a year. The behavior of engaged players might not change significantly.

Now that we answered those questions, we know that the design choices we make must only contemplate these estimates.

Due to compliance with system requirements or good user experience

Some times you’ll have stringent requirements. For example, if the latency of some operation is above a predetermined threshold bad things will happen. Or even, there are some paths of major importance for your end-users, so it’s best to make them as fast as possible. But when to start or stop optimizing? Will weeks of effort to reduce latency from 500ms to 100ms be noticeable? In many cases, probably. What about from 40ms to 20ms? It really depends, in many situations this won’t make any real difference.

In our previous example, depending on the approach, maybe only the “get the timeline of events” will have a noticeable impact on the player's perception of responsiveness.

Other factors might also affect the general system. It’s possible that memory usage has been growing over time. This could cause two things: 1) Risk of crashes due to “out of memory” errors, and 2) increased garbage collection time, on garbage collected languages like Go.

The first is more apparent. It’s even possible that crashes are already happening. In some cases, there’s not a lot development-wise to do and you just have to spin up more replicas, in other cases, it’s completely possible to cut out unneeded intermediate state.

In the second case, you could be a growing long-lived state or you can just be creating garbage at a higher rate (thinking about increased throughput). More garbage is also more work for the garbage collector. In Go, this means collection should happen more frequently and could take more time.

What

You know you have performance issues that must be tackled already. It’s even expected that you might have some clues of what are the bottlenecks or improvement opportunities. Now you have to come up with concrete ideas of what to change in your software.

The best ideas are backed by data. To know what’s the state of your code, or how a change will relatively affect it, you have to measure! There are many ways to do this. Below are the four I like the most.

  • USE method
  • Go’s benchmark testing
  • Go’s pprof
  • Load testing

USE method

A powerful strategy that relies on utilization, saturation, and error (USE) metrics of any system. This can be translated in many different forms depending on what you’re analyzing. For example, when a high volume of parallel operations depends on the same mutex the system might suffer from lock contention. Utilization, in this case, could be the time the lock was held, and saturation is the number of operations waiting on this lock.

It’s also easy to do USE over CPU. High utilization is expected to result in saturation, so the CPU could be preempted frequently and many operations are waiting to have some CPU time for themselves.

However, saturation might happen even when utilization seems low. The reason is, every time-series visualization works with a time window, the higher this window is the more it can hide abrupt variations that occurred for just a small part of that time.

Take a look at the two graphs bellow of the exact same metric, over the same time period, when viewed through different time windows. The first one has p95 data averaged over 5 minutes, and the second uses 20 seconds.

The system is still reasonably well behaved but has arbitrary moments with peaks of 2x latency that don’t show up when looking at larger time windows.

p95 of a real production system endpoint latency averaged over 5min time window
Same metric, same period but with a 20s time window

Only looking at utilization and saturation is not enough. In some situations, a higher than usual error rate in a set of operations can lead you to the root cause of a performance issue.

Go’s benchmark testing

Go makes it very easy to evaluate how a block of code performs. You can not only know the time cost for an operation but also how many allocations it did and how much it allocated.

It’s still not without caveats. Compiler optimizations are present when running benchmarks, so you must be aware of ways to prevent integral parts of your test of being stripped away on runtime.

This kind of test is very useful to watch regression on parts that are important to be fast.

Example of benchmark comparing JSON parsing with encoding/json vs github.com/buger/jsonparser

In the example above, we’re parsing the same payload format with two different json libraries. Below you find the output of this test.

For this specific case, buger/jsonparser it's more than three times faster than encoding/json, doing less allocation while at it. If parsing json is a big part of your hot path and you need to do it faster, this kind of test might be something to consider.

It’s important to understand what overlap in fixed costs exists between different tests as well. In this case, I’ve made another one that does just the payload generation, since it happens inside the for loop we’re benchmarking.

You can try to find ways to make these shared operations to be less relevant in comparison to the part of interest. Payload generation could be done outside of the loop and we could cycle through our sample space, which still isn’t free but not as costly.

Go’s pprof

Pprof is a rich profiling tool that is easy to integrate into any Go program and is also safe to run for short periods of time in production environments.

A safe way to use it in production outside of the HTTP. DefaultServeMux is by creating its own HTTP server. Do not open this address to the external world.

From its documentation page, you’ll notice that there are several types of profiles available. The ones I most frequently use are heap and cpu, the last one is just the default profile route.

Mixing pprof and benchmark tests is a possibility. All you have to do is use the -memprofile and -cpuprofile flags, as in: go test -bench=. -benchmem -memprofile mem.out -cpuprofile cpu.out.

To analyze the output: go tool pprof cpu.out.

Learning more about this tool has great value, this is a good getting started guide.

Load testing

Seeing how your system behaves under artificial load can be useful to identify bottlenecks and scalability risks.

Ideally, you’ll run the target isolated from as much external noise as possible. Meaning that it should not be running alongside other unrelated processes.

As a result of your load tests, you’ll have information on how the software can handle the amount of load exerted: 1) CPU usage, 2) Memory usage, 3) AVG and percentile latencies, and whatever other key metrics are for you.

These techniques are not, and should not be, mutually exclusive. You can add artificial load to a program while running pprof and analyze results through the USE method optic as well.

If you want to load test HTTP services vegeta might be a handy tool to use.

How

Depending on what were the discoveries so far we’ll be able to broadly categorize the issues that were found. Below there’s a list with common types of performance problems and scenarios or strategies of how to tackle them.

Expensive lookup

Either through sheer query complexity or that in addition to the frequency of lookups.

  • Caching

While not a silver bullet, since highly dynamic content poses a challenge to caching, it can be night and day difference to many situations.

In-memory cache is the simplest approach and should be chosen when:

Order of magnitude feasible to store in RAM

Quasi-static and/or can have low TTL

Low cost to refresh from source

Otherwise, an external cache is a safer bet. Since you can scale it’s storage independently from your replicas and you’ll have only one place to invalidate state when it comes to it.

Pressure on the garbage collector

This is a specific case of “bad resource utilization” that deserves its own class of solutions.

Go is a garbage-collected language. Meaning that from time to time its garbage collector will mark and sweep all memory that a program no longer needs to use. Marking can be very CPU intensive in cases where you have a large collection of references living on the heap.

A backend server that handles a large throughput of operations is very susceptible to do any temporary allocations. And the way Go controls the GC pace is, by default, to trigger itself next when the heap becomes 2x the live size of the current mark phase.

So relatively small fluctuations in the total size of occupied RAM for current standards, hundreds of megabytes for example, if holding many references and happening at a fast rate can make the GC be triggered several times a second.

  • Memory ballast

A way to trade space by performance is to pay a fixed cost of memory that you’re willing to hold garbage before the next GC cycle is triggered.

You could start your program with a reference that allocates a chosen amount of virtual memory that won’t immediately translate into physical memory allocation. However, if you make your ballast 1GB and your software only used 100MB before, you’ll potentially have your garbage grow until at least 1GB before it’s collected.

The second line is needed to prevent the compiler from optimizing out the unused ballast variable.

  • Object pool

If you have to do large amounts of allocation of short-lived objects of the same type, it’s possible to create a pool to lend from instead of always reallocating.

While very handy, object pooling requires careful manipulation. Since the soul of this technique is reuse, when you retrieve something from the pool always assume it’s dirty, so you have to reset it to a clean state.

  • Fewer allocations

If you have an operation that is heavy on allocations and turns out it can go on without that many, this one is the best medicine.

A way to figure out how many allocations is being done and to compare alternatives is the already mentioned benchmark test.

Ever growing memory utilization

This kind of problem can cause the program to use so much memory that the operating system has no more left and it’ll have to kill the program when it asks for more, due to the dreaded “out of memory” error.

  • Leak of no longer used objects

The good old memory leak. This requires a discovery process and a good starting point is pprof’s heap profiling. You can compare the output of a cold process with one that’s been running for a while.

Here is a good reference on how to read the heap profile.

  • Go map

Objects of the map type in Go have a non-obvious behavior when it comes to delete. Calling delete(m, key) doesn’t make the map shrink.

If your application uses one or more highly dynamic maps where you’re frequently inserting and deleting entries, you might end up using a lot more memory than you might expect.

Unfortunately, I don’t know an elegant solution to this problem. One thing to try is rebuilding the map with snapshots from time to time, what might not be desirable depending on how many elements it holds.

  • Higher throughput

It’s expected that when your application is doing more work than what it usually did it’ll also need more resources, memory included.

The object pool strategy mentioned before is very valuable here. In situations where you’re dealing with many instances of the same ephemeral operation occurring over time, you’re able to extensively reuse temporary objects, preventing most of them to become garbage to be collected only after a delta of time.

This has limited applicability. In several situations, the easiest way to solve this is to rely on the horizontal scale.

High latency

When timing problems arise you’ll have to either look inwards to your application or change the perspective to its ecosystem.

  • Too much parallelism

Goroutines are lightweight and in many situations, you’ll be able to have thousands running without a problem.

Depending on the way you’re using them, however, you might be deteriorating the overall responsiveness and transforming it in an unpredictable mess.

A hugely exaggerated example of nested parallelism

Scenario 1) Trigger a number of goroutines that depend on a list of high variance length provided by the user

Your system might behave well on the average case, but has the potential of damaging peaks of unresponsiveness, even followed by eventual “out of memory” crashes, if parallel calls handle a considerable amount of data.

Scenario 2) There’s no variance on the number of goroutines triggered by the root operation, but throughput is in an increasing trend.

It’s possible that the application will deteriorate in a more well-behaved manner in this scenario. Eventually, the number of goroutines running simultaneously will still impact the performance of its neighbors.

For both scenarios, the best solution is to be mindful of what should be parallelized and how to prevent unbounded goroutines from spinning up.

If you end up having to limit parallelism, a way to achieve this is by working with a fixed pool of goroutines and synchronizing call/result through channels.

  • Too little parallelism

Now we’re talking about the other side of the same coin. You have a set of independent operations, they still run sequentially, and each of them has a similar time cost that when aggregated is non-trivial. Incredible optimization opportunity.

The simplest way to achieve this is by leveraging sync.WaitGroup and goroutines.

  • Slow dependencies

When routine tasks handled by third-party libraries are not that light anymore, it might be time to search for an alternative that better fits your new requirements.

We already saw a very simple example comparing encoding/json versus github.com/buger/jsonparser on time and allocations. This same idea applies to many different areas, and you always have to be aware of the catch.

For example, github.com/valyala/fasthttp promises to handle more throughput than net/http while being faster at it. But it doesn’t support HTTP/2.

Finishing words

Of course, there’s much more to each of these three guiding topics than what we had a chance to cover here. The most important takeout is having the awareness that not all optimization opportunities are worth pursuing, and to those that are, I hope this article added some new tricks to your toolset.

--

--