Iterative Optimization on Hot Paths in Go Apps
--
At Samsara, we provide real-time data to our customers while ingesting millions of events per minute from thousands of connected devices. The software that powers this is thunder, an open-source GraphQL framework built in-house. Thunder consists of multiple parts, including a SQL generator called sqlgen
. This post covers how we added a feature to sqlgen
, the 33% increase in allocations and 66% memory increase that it resulted in, and how we returned to our baseline numbers following some optimizations.
Building a new feature: Adding support for JSON columns
What is sqlgen?
sqlgen
is a lightweight pseudo-ORM that maps database tables to their Go representations and reduces boilerplate code. Before we dive into the feature we added or its design, let’s establish a base-level understanding of sqlgen
.
A given table and corresponding model:
is represented in sqlgen
like this:
Storing JSON in the database
On the product infrastructure team, part of what we do is identify points of friction that prevent our engineering team from using good practices, and eliminate them. One such pattern we saw was using JSON blobs to store data in the database. sqlgen
didn’t have an easy way to do this, so people were forced to so something like this:
As you can see, the simple act of encoding and decoding some configuration data added a second representation for the configuration, as well as an additional step on any read or write paths (in this case, Update
and ById
).
We instead wanted our API to be simple and to use one representation for data outside of the database. What we wanted was something like this:
Planning it out: Wrangling Go interfaces
To add JSON support in sqlgen
, we had to wrap our (de)serialization layers to allow for arbitrary transformations. Thankfully, Go’s sql
package has support for this.
In the statement below, there are two serialization directions being traveled:
- Go → SQL (
id
) - SQL → Go (
name
)
Go → SQL
When id
is serialized into a SQL value by a SQL driver’s parameter converter, the driver will automatically handle conversion for driver.Values: int64, float64, bool, []byte, string and time.Time — as well as the driver.Valuer interface.
SQL → Go
Similarly, on the read path, Scan
supports pointer types to driver.Value
s, as well as a sql.Scanner
interface.
Armed with this information, we now knew what interfaces to implement to handle JSON (de)serialization, as well as what types to work with!
After discussion and prototyping, we landed on this design for sqlgen
’s type system:
The Descriptor
keeps track of all type information needed, as well as providing the ability to generate both Scanner
and Valuer
proxies for values. The Valuer
translates a struct value into a driver.Value
. Finally, the Scanner
translates a driver.Value
back into our struct value.
Making it real: Identifying and fixing performance issues
We implemented the plan above, and added tests and benchmarks. We then deployed this change to our canary GraphQL server to gauge its performance under a real workload. As it turns out, there’s no data like production data. The code was unacceptably inefficient when it came to memory allocation.
Our benchmarks had indicated an increase of allocations of 33%. We deemed this potentially acceptable if the real-world impact wasn’t drastic. However, our canary test indicated a spike of 66% in memory usage over time. It was time to go back to the drawing board.
Using pprof to find memory problems
The first thing we did was take a look at some of our profiling tools. For local profiling, we highly recommend checking out this post on the official Go blog. At Samsara, we automatically run pprof when our servers are under higher than expected load. This provides us with a representative insight into our memory allocations.
Taking a look at an SVG representation of our pprof reports, we can see a few hints of where the problem might be:
Looking at our inuse_space report, we see a new 3% of allocations being used on Descriptor.Scanner
, as well as 2.8% of allocations being used in Scanner.Scan
.
These numbers might seem low, but around 75% of our heap is consumed by our cache. We suspect that these additional allocations are putting a strain on Go’s garbage collector. We also think that we can provide better hints to Go’s escape analysis to prevent some of these allocations.
Starting with a benchmark
A great starting point when trying to find issues or improve performance is to write a benchmark. This is something Go makes very easy. For this task, we created two benchmarks. Our Go benchmark, an integration test of our CRUD path in sqlgen
, served as our micro-benchmark. Our macro-benchmark was a test server against which we ran simulated traffic using a tool we built specifically for this.
Our original allocations looked something like this:
Name Times Memory used AllocationsBenchmark/Read-8 5000 1766 B/op 45 allocs/opBenchmark/Create-8 2000 1104 B/op 23 allocs/opBenchmark/Update-8 2000 1488 B/op 30 allocs/opBenchmark/Delete-8 2000 592 B/op 16 allocs/op
whereas our new code had 33% more allocations on the read path:
Benchmark/Read-8 5000 2239 B/op 60 allocs/opBenchmark/Create-8 2000 1568 B/op 39 allocs/opBenchmark/Update-8 2000 1976 B/op 49 allocs/opBenchmark/Delete-8 2000 1000 B/op 28 allocs/op
When running a simulated load of read-only traffic against the test server, we were able to reproduce our 66% increase in memory from these allocations. We set our goal: reduce allocations and get within 10% of original memory usage. Our main focus would be the read path, as this our hottest code path.
Optimization 1: Pass by value
The first optimization became clear from just looking at the code. Scanner
and Valuer
hold onto Descriptor
pointers. However, our methods for creating new Valuer
s and Scanner
s didn’t use *Descriptor
. Go functions are pass by value, meaning all values passed into functions, even pointers, are copies. Because we were copying the value, rather than a pointer to the value, we were allocating an entirely new Descriptor
on the heap for each Valuer
and Scanner
.
By making a few quick tweaks:
we were able to significantly reduce both the objects and bytes allocated:
Benchmark/Read-8 5000 1994 B/op 52 allocs/opBenchmark/Create-8 3000 1376 B/op 35 allocs/opBenchmark/Update-8 3000 1784 B/op 45 allocs/opBenchmark/Delete-8 2000 808 B/op 24 allocs/op
and achieve a 10% decrease in memory usage, bringing us to 50% higher than baseline (from 66%).
Optimization 2: Cutting out the middleperson
We noticed in our pprof reports that our reflect.New
allocations had jumped by 15%. Our initial approach was to initialize Descriptor.Scanner
with a reflect.New
value, which we would move to our model struct with a CopyTo
method. By crossing a couple method boundaries, however, we were causing our model’s intermediary values to be heap allocated.
We decided on a new approach of allocating directly to the final model struct.
and rather than scanning into a new value and then copying to our model, we target our final destination directly:
By cutting out the middleperson, we were able to shave off an additional allocation per column on a model.
Benchmark/Read-8 3000 1908 B/op 47 allocs/opBenchmark/Create-8 2000 1376 B/op 35 allocs/opBenchmark/Update-8 2000 1784 B/op 45 allocs/opBenchmark/Delete-8 2000 808 B/op 24 allocs/op
This resulted in another memory decrease of 7.5%, bringing us to 39% above baseline. Steady progress towards our goal!
Optimization 3: Re-use of allocations
pprof also tells us that about 3% of our non-cache memory is being used for Descriptor.Scanner
. If we look at our code, we can see that our Scanner
s are escaping to the heap, since we are passing them to rows.Scan
. Go’s standard library has sync.Pool
, an API that allows us to re-use allocations, creating new allocations only when accessed concurrently.
What do we know? We know that our columns are almost always accessed together. We also suspect that we aren’t de-serializing data concurrently that frequently, since IO timing would likely spread accesses apart. So, we should be able to create a sync.Pool
per Table and re-use the same scanners.
If we run our benchmarks, we’re immediately down 4 allocations per benchmark row. This makes sense as our benchmark isn’t concurrent, so we should be re-using the pool 100% of the time. When we run our traffic simulation, we see CPU and response times stay the same, and get an 8% dip in memory usage (28% above baseline). Another step closer to our goal.
Optimization 4: Preventing escape of complex data types
At this point, we have gone through the hints provided by pprof and made optimizations where we can. We have mostly focused on the Scanner
, as it’s the most obvious candidate for read path optimizations.
However, our other Valuer
abstraction is something we haven’t really inspected, and it’s used extensively on the read path, when making WHERE
clauses.
It’s worth noting here that interface{}
acts like a pointer across function boundaries. That means that when we allocate our Valuer
’s value, it’s going to escape onto the heap.
Because we passinterface{}…
to our SQL driver, we know that a value has to be allocated to the heap. However, our value is more expensive than a valid driver.Value
would be, as it includes at least itself and a Descriptor
reference.
What we can do here is make a change so that we only allow the simpler data types to escape to the heap:
and verify with our benchmarks:
Benchmark/Read-8 5000 1683 B/op 43 allocs/opBenchmark/Create-8 2000 1120 B/op 26 allocs/opBenchmark/Update-8 2000 1496 B/op 32 allocs/opBenchmark/Delete-8 2000 808 B/op 25 allocs/op
We don’t see a decrease in our read benchmark — and that matches our expectations since it should only go down when WHERE
s are included. What we can do is add an additional “read where” benchmark:
Benchmark/Read_Where-8 3000 506274 ns/op 2394 B/op 58 allocs/op
Benchmark/Read_Where-8 2000 589746 ns/op 2346 B/op 56 allocs/op
Each filter value causes a 2-alloc decrease. But what’s the impact with real traffic? We run our traffic simulation again to find…
A whopping 17% decrease in memory usage, which means we are now using only 5.6% more memory than our baseline. Mission accomplished!
Other optimizations
There are other optimizations we considered but did not implement. We could make our benchmarks more informative by comparing them to using a basic SQL driver. We could also pre-compute custom types on initial run, saving CPU cycles on subsequent code paths.
There is probably a plethora of other optimizations we can make. It’s always a trade-off between moving fast and making code fast. Since we’d already met our goal of getting within 10% of original memory usage, we were happy to call it for now.
Moving forward
Since making these allocation improvements, we did another production canary release to test how the new implementation performs. The CPU usage and memory consumption were virtually identical to our master branch.
Since starting down this optimization path, we decreased our GraphQL server’s memory usage by 50% to match what it was before adding the new feature. We even brought down read path allocations, which should result in less garbage collection over time.
Most importantly, we were able to ship sqlgen
support for JSON fields!
What did we learn?
Building and optimizing this feature taught us:
- Heap allocations are expensive. Repetitive, short-lived heap allocations are especially expensive, even when the values themselves are small, since they take time to be garbage collected.
- Wrapping values on a hot path combined with the above can be a problem.
sync.Pool
can help us re-use allocations on hot paths.- Benchmarks are useful, and establishing baselines upfront can help avoid performance regressions.
- There is no replacement for production data.
Over to you
As programmers, battle stories like this one are always great learning experiences — for those involved, and those that we can tell them to. Do you have your own story to tell? We’d love to hear it in the comments below!
If you want to try out our open-source GraphQL framework, you can find thunder on GitHub. We’d love to hear about your experience with it too.
Finally, if you’re looking for a job, Samsara has many more challenges like this one and we’re hiring!
Special thanks to everyone who reviewed the code in question: Changping Chen, Jelle van den Hooff, Stephen Wan & Will Hughes. And an extra special thanks to the main editor of this post, Kavya Joshi.