A learning and exploratory analysis of the new
sync.Map type in Go 1.9.
Go 1.9 is out now and I was eager to get my hands on the new
sync.Map container that exists in the
sync package. First of all — why does this even exist and when should you use it? Unfortunately the answer is: you really should not use it unless a.) you’ve identified a bottleneck in your application with the pre-existing and type-safe map and b.) your use case matches the exact use case of which the new
sync.Map was designed for.
Let’s be real. If you reach for the new
sync.Map construct without having first identified and measured that you’ve got contention with the existing map implementation you might be trading some headaches and pain for more headaches and pain. A few guarantees go out the window when you use the hot new concurrent implementation.
First you lose the obvious one which is type safety. I don’t know about you but with the limited set of data structures provided in the Go standard library, the obvious choice for anything key/value based is your basic
map builtin. Let’s not trade type safety and a simple generic data structure for one that is not.
Also, if you read the docs on the iteration functionality of the new
sync.Map you’ll notice that some additional guarantees are thrown out the window in terms of the observed snapshot that you’ll see as you
Range over the collection of data.
Range does not necessarily correspond to any consistent snapshot of the Map’s contents
The next thing on the list that you may lose is…performance? In many cases, the existing implementation of the builtin
map is well performant enough and will many times beat the
sync.Map implementation hands down in sheer operations per time-unit. In fact if you’re able to leverage a builtin
map without the use of locking semantics you may well already have a winner in your hands.
And I bring up locking for a reason: The builtin
map has a dirty little secret that most of you should know by now. It’s not thread-safe (or concurrent safe) when reading to/writing to happens with more than one goroutine. This is why the standard library provides a few tools at your disposal in the form of synchronization primitives. You’ve got channels, you’ve got mutexes, you’ve got r/w mutexes and you’ve got atomic operations.
And, depending on your use case it’s not uncommon to reach for one or two of these constructs so that you can enjoy production code that is free of data-races. Which tools you use are beyond the scope of this article but let’s go back to the builtin
map when faced with concurrent updates of this data structure.
I like to choose my synchronization primitives carefully before I even think about writing code and you should too. First, if I’m not sharing any state, than I can write code that is completely void of synchronization primitives because I have chosen a
shared-nothing architecture. This usually is always the ideal case, but not always the reachable dream that it is.
Anything that is reasonably complex is usually always going to fall into some choke-point where code needs coordination and/or synchronization…that is unless you live in the purest of pure worlds (I’m looking at you Haskell). And here is where a common idiom shows up to wrap up the basic and builtin
map into a nice little package where synchronization is baked in.
In those cases where you are mostly doing mutation, you can reach for the tried and true
sync.Mutex. This data-structure will provide you with the necessary locking to protect a built-in
map from concurrent updates.
The problem with the
sync.Mutex arises however when you need to read from your builtin
map. Whether it’s getting a count of the number of elements, simply iterating over the map or just checking if a value exists in a map — all of those are read operations that technically should be allowed to make progress with other read operations so long as no mutation occurs.
sync.RWMutex where Go again has your back. The contract proposed here is that all your read operations must promise to take a read-lock:
RLock when they’re guaranteed to NOT mutate the state of the map. Conversely any operation that mutates the map must agree to take the regular write-lock:
Lock such that writers are guaranteed exclusivity under update scenarios.
The great thing with the
sync.RWMutex is we can help to reduce some contention when we have scenarios where high-read rates occur. At least, that was the hope…
Here’s an example of a builtin
map that has been wrapped in a similar API as that of the
Notice in the code above that
Load is the only receiver method that does not mutate the state of the internal
map. This is important because now
Load is free to take a read-lock now affording concurrent progress to those goroutines that are executing this method.
That was until under additional scrutiny and performance analysis of the standard lib, that various members of the Go team discovered less than ideal performance when a
sync.RWMutex was utilized in systems that were deployed on many, many cores. And by many I’m not really talking about your typical 8 or 16 core setup but primarily those server setups that are well beyond that. Where the number of cores really starts to show in the form of highly contentious code such as when the
sync.RWMutex is used.
And now we’ve arrived as to why the
sync.Map was created. The Go team identified situations in the standard lib where performance wasn’t great. There were cases where items were fetched from data structures wrapped in a
sync.RWMutex, under high read scenarios while deployed on very high multi-core setups and performance suffered considerably.
In fact, there’s a quite nice story about it here that was done as a lightning talk at the 2017 GopherCon: An overview of sync.Map. If you’re considering using the implementation please watch this talk as there are some possible performance traps with it. Also, this talk while short definitely goes much more in-depth of why this was created and where it was designed to shine.
Let’s examine the usage of the
sync.Map in the following code:
Notice that the API is clearly different from the regular built-in
map usage. Additionally, because this implementation of the map is not type-safe when we fetch (
Load) an item from the map we must do a type conversion…which may succeed or fail. Be careful.
In addition to
Store we also have methods for
Delete, LoadOrStore and finally
Range which was mentioned earlier. It is an exercise for the reader to understand the usage of those methods in detail.
So how does the
Well, my performance analysis is on-going currently. I’ve only begun to start analyzing this thing and I’m trying my best to write correct benchmarks that mimic different scenarios. Furthermore, I’m trying to find the target scenario that this container was designed for so the best performance case can be demonstrated for the ideal scenarios.
The documentation on the performance characteristics are as stated below:
It is optimized for use in concurrent loops with keys that are stable over time, and either few steady-state stores, or stores localized to one goroutine per key.
For use cases that do not share these attributes, it will likely have comparable or worse performance and worse type safety than an ordinary map paired with a read-write mutex.
So as I understand this above, I might have the chance to use this if I have an existing map that shows signs of contention where the keys are not constantly changing for the life of the application. To me this practically translates to high-read scenarios with relatively few updates to the map or where many updates may occur in bursty scenario.
The first benchmark shows how the regular builtin
map wrapped with a
RWMutex compares with the
Next we show a benchmark of
Next we show a benchmark of
Load operations, where the
Found case should always find a result in the maps. And the
NotFound case will almost always not find a result in the maps.
As you can see, with these benchmarks in many of the cases the regular built-in
map protected by a
RWMutex almost always wins out and so far and we haven’t introduced benchmarks with more than a single goroutine. Let’s keep going with this to see what happens when we write a benchmark that is designed to measure performance where scaling across cores becomes a factor.
Let’s fire up a 32-core High CPU box on Digital Ocean to see if any trends are revealed. For this I created a box in San Francisco, running Ubuntu 16.04.3 (x64) and I picked the most expensive system they have which costs a whopping $640 a month for this box. For me, running these benchmarks I paid for just maybe 2 hours worth so maybe just a couple bucks. This is not a commercial for Digital Ocean but I think their stack is pretty slick and they use Go internally so why not?
For this benchmark, I want to measure both map implementations under load in the ideal scenario at least for now. In this scenario I’ll setup both maps to store a random set of data and I’ll set GOMAXPROCS to something different on each variation of the benchmark and create N number of goroutine workers that match the GOMAXPROCS setting.
Again, I’ll be running these tests to mimic a high-read scenario where the set of keys is populated and fixed before the clock on the benchmark starts. On the next update of this post, I might create a benchmark to mimic perhaps an equal number of writes but for now let’s consider this a somewhat artificial but ideal use-case.
On with the show. Here’s how we’ll define our concurrent benchmarks but first a quick synopsis of the code:
The benchmarks are actually functions invoked by other functions where we have a
workerCount hardcoded. This allows us to configure the GOMAXPROCS and number of workers per benchmark.
To ensure we don’t hit Go’s complier optimizations, we capture the output of the
Load methods. We don’t necessarily care what the results are but we want to ensure that Go doesn’t perform any dead code elimination because returned results are unused.
The primary part of the code will kick off a goroutine as a worker which will iterate as far as necessary to satisfy Go’s benchmarking facility through the use of the
b.N value. With each goroutine in flight, we’ll perform our
Load operation and eventually signal that we’re done using a
As you can see right off the bat, we have great performance with our regular map which is protected by a
RWMutex. Things are moving along great until we start hitting 4 cores where not only contention becomes an issue but our scaling factor across cores as well. If you keep your eyes on the red line, once we start getting passed around 8 cores forget it. We have way too much read contention with a
RWMutex to the point where performance suffers greatly by the time we hit the 32 core mark.
The blue line however is the
sync.Map showing a fairly predicable behavior as we continue to scale across cores. I would say at this point that with my rudimentary testing and analysis we can see where the
sync.Map shines…in the ideal scenario it was built for.
So how is
sync.Map able to realize this performance? Well that is a story for another blog-post and explores more advanced topics such as lock-free programming and atomic operations at the cost of reasoning and readability. Maybe we’ll get into it…maybe we won’t as I’m not sure at this point.
If you’re still with me you might have questions about this benchmark or suggestions on improving it. Please help me make this benchmark better by forking it or creating issues: https://github.com/deckarep/sync-map-analysis.
There are additional test scenarios that I’d like to analyze and for this blog-post I wanted to start with a basic ideal scenario of what I think the
sync.Map was designed for. And at any rate, this was just an excuse to play with the new data structure and learn more about it.
Please leave feedback and comments and remember this overly used but still very relevant quote BEFORE you even think about using the
“Premature optimization is the root of all evil.” — Donald Knuth
Happy Go Programming!