Dissecting khash.h (Part 2: Scouting)

In which the author gets his bearings in the .h file

OK, so after Part 1, I’ve decided to go along with the newest implementation on github. This one is somewhat different than the one in potion but it’ll do. It also has a bit more documentation — specifically docs for the functions — which I’m going to ignore because it’s fun to discover stuff.

Let’s start from the highest level and go downwards from there. The code has an example of how to use the thing, and that must be a good place to start:

A bit arcane, but workable. Studying it, I don’t immediately get everything. Stuff I expect to be there is there — put, get, del, existence check, iteration. But it’s not like a python dictionary. There are two issues:

  1. The function signatures are weird for the things that I do know.
  2. There is other random stuff that I don’t know what it does.

Let’s ignore the hard part now and start with item 1.

  • First of all, why does everything take a 32? Perhaps this is a statically-sized hash table, in which case is that the size of the hash table? Perhaps. Or the size of each element, which seems to make a bit more sense to me. Anyway, why does it need to be reminded of that number every time?
  • Second, why does everything looks like it returns a khiter_t? This looks like some sort of iterator type. But why do I need an iterator if I’m doing a put, for instance? Because of that, the return value has to be passed in by reference (&ret). That’s OK I guess.
  • Third, why does it look like put() is missing a parameter? Put should need a key and a value. We have the mysterious 32, the hash it’s going to be looked up in, what I assume is the key, and then a parameter that the function will return a success / failure in. Where the heck is the value I want to store “in” the key?
  • Fourth, oh shit. What in the name of Dennis Ritchie’s beard is the kh_value()=blah stuff? I think I found the answer to the previous step though. Perhaps put() actually just marks an entry for a key. Then you update the value with kh_value. How the hell does one have a function call on the left hand side of an assignment statement in C???? More macro magic? Regardless, this would also explain why we need the iterator —to be able to set the value for the key we inserted with kh_put.

OK, so now this isn’t completely clear in my mind but at least I feel like I know a few details now.

With this, I think I’ve also solved most of item 2 also. We’ve figured out kh_iter and kh_value. So what about KHASH_MAP_INIT_INT? Why is this separate from kh_init?

The interesting thing about it is that it includes a type declaration (“char”). So our hash tables are fixed type. But what is the type for? For the key or the value? I could imagine it being for the key, and you could perhaps use pointers for the values, pointing out to external objects — though would they be typed or void pointers?

So the python equivalent of the above example would be something like:

Much cleaner. The main differences here are:

  1. The python version lacks typing for the hash — keys and values can be any type.
  2. The python version doesn’t have the concept of a key with no value, so the kh_exist() is redundant. You can sort of simulate it by saying the value is None.

OK, now we can start cheating a little bit by looking at the actual code. Looking further down the code into the init macros, we see that KHASH_MAP_INIT_INT takes a “name” parameter, which explains the mysterious 32. This is probably because the macros generate a different set of functions for each run of KHASH_MAP_INIT_INT, specific for the types passed in, etc. It’s a kind of hack to emulate generics in containers.

The type parameter is specified as “type of values” — in this case “char”. There seem to be different INIT macros for different key types, including INT, INT64 and STR, and they all call the same base INIT macro using different hash and equals functions based on the type of keys, which makes sense. In default cases when you don’t specify the type of values, it’s “char”.

Following the KHASH_MAP_INIT_INT down, it goes through:

  1. KHASH_MAP_INIT_INT(32, char)
  2. KHASH_INIT(name, khint32_t, char, 0, kh_int_hash_func, kh_int_hash_equal)
  3. KHASH_INIT2(name, static kh_inline klib_unused, khkey_t, khval_t, kh_is_map, __hash_func, __hash_equal)
  4. __KHASH_IMPL(name, SCOPE, khkey_t, khval_t, kh_is_map, __hash_func, __hash_equal)

Each invocation gets slightly deeper. The SCOPE parameter is interesting, it ends up evaluating to “static kh_inline klib_unused” which really is just “static inline”, plus a compiler-specific attribute that means that these functions could go unused, and the compiler shouldn’t generate warnings about it. That’s nice.

There is also KHASH_TYPE which is a typedef struct for the hashmap itself, generated each time per named hashmap. Here’s the first example we see of the name being used in the generated functions:

And then __KHASH_IMPL is basically the main body, where most of the actual code that contain the definitions of the get / put / etc is. Phew, we’re getting somewhere!

Leave me a comment if you enjoyed this - more coming soon!