Hash maps that don’t hate you
The map data-structure is incredibly useful when programming. The first time I used Perl the thing that most impressed me was how easy it was to use hashes, as they call them. I used them for everything! Map is a central feature in Go too — it’s one of the built-in object types with syntax support and generics. I think most people who program in Python have a similar experience. They call them dictionaries, and they are everywhere. In JavaScript, too, every object is a sort of a hash map. When I worked on V8 we had to go to huge lengths to hide the performance consequences of that decision, but I’m still a fan of easy-to-use hash maps in languages.
But all these languages initially had hash maps that were unordered. You can put all your key-value pairs in the hash map, but when you want to go through the content of your map you get them back in some strange order. It’s not the order you put them in, and it’s almost certainly not the order you wanted. Go and Perl decided to fix this by making hash ordering more random. JavaScript and Python went the other way and fixed the ordering to be the same order as insertion. (There’s one exception in JS objects — numeric property names are ordered numerically, not by insertion order. We got a lot of hate on V8’s bug tracker for that one!)
The best way to make hash maps efficiently insertion ordered is some variation of what Python calls the “compact dictionary”, popularized/rediscovered by Raymond Hettinger. (According to Hettinger there is some prior art, implemented in Java, and it’s actually similar to the internal object implementation of V8, invented by Lars Bak, also one of Toit’s co-founders.) In compact dictionaries, the keys and values are inserted in a linear array, and there is a more conventional open addressing hash table that indexes into the linear array. It looks something like this.
This is what a hash map looks like in Toit, too. In fact, the hash numbers in the index table are not needed, but they make lookup faster. Storing the last few bits of the hash code lets us detect most of the collisions in the index without having to look at the keys in the backing array. The other advantage of having hash codes in the index is that it speeds up rebuilding the index when it becomes too full.
If the user removes an entry from a hash map of this type we replace the key with a marker that indicates this entry has been removed. Perhaps a little morbidly, we call these markers “tombstones”. When there are enough of these tombstones, the hash map gets rebuilt, but that’s OK. Due to the magic of amortized constant time, this doesn’t slow you down on average.
This is all very cool because now there’s an efficient way to make hash maps with insertion ordering. These are hash maps that don’t hate you, and that’s why we used them in the Dart project too, now probably best known as the engine behind Flutter. (Some of the Dart founders went on to found Toit.)
Other ways to know your hash maps don’t hate you
The other feature you quickly find yourself needing is the ability to use arbitrary objects as keys in a hash map. In languages that don’t support that (Perl, Python, Go, JavaScript until recently) you can use strings as keys, generating a unique string for each key object. This isn’t very elegant, and your program creates a lot of strings that didn’t really need to exist.
Somewhat more advanced is being able to specify the equality function for the keys in a map. In Java and Toit, for example, you can specify a custom equals method for the class of the keys in a collection. In contrast, the map implementation in JavaScript only uses object identity. You can’t create a new object and then discover whether an equivalent key is already in the map. Again the inelegant workaround is for each to generate strings and use those as keys.
Going even further, it is sometimes necessary to specify the equality function per map instead of per-key type. This is available in C++ in the form of the Hash and Pred template parameters on unordered_map. (But as the name unordered_map indicates, the standard C++ collections hate you, so this advanced feature is little comfort.) Currently, we don’t support per-map equality functions in Toit, but it’s coming soon! In Java, you would have to wrap the keys in your map or use Trove.
The fly in the ointment and how to remove it
Even if your language has compact dictionaries, there’s still a tiny fly in the ointment. Under certain circumstances, your hash maps will still act as if they hate you, but in a new way! Some people like to use hash maps (and sets, which have the same implementation) as a sort of deduplicating work queue. You can add “tasks” to the set, and it will automatically ignore duplicates because it’s a set. A task queue (also called FIFO) needs an append-to-the-end operation and a remove-from-the-start operation. The append-to-the-end operation we already have, and it’s easy to make a remove-first operation by looking at the start of the backing array. If the Toit language didn’t already have this function, you could add it as follows:
The problem arises after a while of using a set or map like this. There will be a lot of tombstones near the start of the backing, and suddenly the remove-from-the-start operation isn’t so fast. It needs to skip over more and more tombstones, and the run time to insert-then-remove n items becomes quadratic. This issue can’t be solved by rebuilding the hash map more frequently, because if you do that, you lose the property of amortized constant time operations.
An ad-hoc solution would be to record for each hash map where the first entry is in the backing array. This feels like a band-aid though. What if users want to repeatedly remove the last entry using a set as a sort of deduplicating stack (or LIFO)? The hateful quadratic hash map problem would be back. What if they want to repeatedly remove the second entry in the set? One thing I learned on V8 is that it’s hard to predict what your users will do with a programming language implementation you have built.
And we built Toit to work on tiny devices. Do we really want to spend an extra word per hash map to record the number of leading tombstones? The solution we came up with was tombstones with distance markers. Each tombstone additionally records the distance to the next real entry, either on the left or the right. We update this lazily when scanning through the backing, so it doesn’t cost any time on other operations, but it lets us skip long stretches of tombstones wherever they appear.
For the common case where there are no deletions, or the deletions are randomly distributed, nothing changes, but for the problematic cases we retain the desirable property of amortized constant time insertion and deletion.
Testing languages for this bug
I wrote a tiny program to test whether language implementations have quadratic slowdowns when using a set as a FIFO. I’m only testing languages that have an insertion-ordered set. For JavaScript I’m using the new(ish) Set, rather than abusing ordinary objects as hash maps. In Java I used LinkedHashMap, which has a completely different internal representation. I suspect LinkedHashMap of being fairly profligate in memory use (I’ll be blogging about this issue soon), but it’s the fastest insertion ordered set I tested, especially when the adaptive JIT kicks in.
So here, for your enjoyment, is the graph for various languages. Currently, all the compact-dictionary-based implementations (except Toit) have this issue. Hopefully, this blog post will inspire the implementers to fix it.

