What does a hash function look like?
You send data into a function and different data comes out on the other side. That’s what a function is: a machine.
Well, there’s a type of function that has a special property: knowing the output doesn’t give you much information about what the input was. This is a hash function.
It’s deterministic, trivial to compute, hard to reverse engineer. That’s it. Oh, and the output is of a fixed length.
The Question
So, you may be wondering, what does the input-output space of a hash function look like?
Of course, any modern, useful hash function has a space so gigantic that we can’t actually hold all the hashes in all the memory on earth, let alone visualize them. Perhaps this is a fool’s errand, but there may be some insights we can deduce about the space without actually looking at it.
First of all, we’ve got to define the question a little. It doesn’t make too much sense as is because you can throw any input of any length into a hash function.
So to simplify things we should limit the inputs to only look like outputs (of the same fixed length). In one way this is actually a very natural way to analyze the space of all hashes because the input space is the same size as the output space.
Once we assume this limitation — all inputs are hashes — we can start to wonder what the space looks like.
The Shape
There are a couple of things we can deduce. Knowing that every input leads to one and only one output we can imagine a line, a hash chain.
However, since the space of all possible hashes is finite at some point a hash must lead to a hash previously discovered in the chain. So we’re going to find a loop.
However, there’s no guarantee the loop will include every possible hash. In fact, a perfect hash function like this would be difficult to engineer. Rather the loop would most likely look something more like this (forgive me for the lack of direction but imagine arrows pointing toward the loop).
Let’s take a moment to explore this. Notice there are terminal ends, hashes as inputs that are not expressed as outputs in the hash function space. Notice also that they collide, that is, two different inputs produce the same output. The number of terminal nodes (2) equals the number of collisions (2). Finally, notice that there is always a guaranteed loop, even if that loop is one hash that produces itself.
Kind of looks like a Directed Acyclical Graph (DAG) except that at what could be considered the origin there’s a guaranteed cycle. Considering the loop as a single node — the origin node—it is a DAG. All hashes, if hashed enough times lead to the loop.
Let's notice one more thing before we see some examples. There’s no reason the hash space couldn’t be made up of multiple loops, but each of these loops would necessarily be independent of one another. This follows directly from the fact that every input has only one output (produces one hash deterministically), though every output can have multiple inputs that lead to it (collisions).
Anyway, that’s pretty fun. Maybe we should see this in action.
Examples
We’ll play with a 1-digit, a 2-digit, and a 3-digit hash function to get a feel for this. I have produced a notebook which I’ll make available after I clean it up a bit, the images below come from it.
Firstly, let’s look at the 1-digit hash function. I explored it’s entire space so it could be displayed. It’s 1 digit which includes all lowercase letters and 10 numeric digits, so 36 characters in all.
Such a small space had only one loop.
However, as we discussed above the loop doesn’t have to include all nodes of the hash-space. This loop has 12 out of 36 nodes in the loop; exactly 1/3rd.
Things are a bit more interesting when the ‘tails’ are also included in the graph. Here, in blue, you can see all the other nodes which are not included in the loop.
Notice every blue node represents 1 collision. Consider the y node: y leads to 9 just as 7 does, so it collides with 7. Furthermore, h and q collide on a node that isn’t yet in the loop, 2. And the longest tail, the longest chain that has no nodes in the loop is 3 nodes long.
Ok, on to the 2-digit toy hash function. Remember, each hash now has 2 digits, so every combination is explored. The whole space consists of 36*36, which is 1,269 hashes or nodes.
We’ll inspect the loops first.
Interestingly, and perhaps naturally, there isn’t just one loop anymore. There are multiple, the smallest being 1e which leads to 08 which leads back to 1e. We might also find it interesting that the size of the loops tends to follow a power-law distribution, so that the largest is much larger, in this case, than twice the size of the second largest.
Also fascinating is the number of nodes in these loops: a small fraction of the total number of nodes. 16 out of 1,296: 1.23%. We very quickly reduced the percent of loop nodes from 33 to nearly 1 percent just by increasing the number of hashes in the space.
Something we should mention here; see how the loops are necessarily isolated? We can deduce right now that any tails connected to these loops are also isolated and independent since every hashed hash leads only to one other hash. The loops can in no way be connected.
Well, let's bring in the tails and see what else is going on.
Wow. Longtails. Lots and lots of collisions, pretty uniformly distributed.
I haven’t actually looked into it, and it’s hard to verify from the graph above, but I would assume the length of the tales follows the same distribution as the size of the loops; namely, a power-law distribution. Many small tails, many small loops, one exceedingly long tail, one exceedingly large loop (relative to the size of the other loops).
Take a look at the right side of the graph, see that d4, b3, 23 loop at the bottom? Follow its tails all the way up, we’ll see what we deduced earlier, they’re isolated from the other loops. That small 3-hash loop has a huge tail attached to it.
Lets map the entire space of a 3-digit function: 36*36*36 or 46,656 nodes.
Again, the loops follow the power-law distribution, and here we finally see nodes that lead directly to themselves: 00c, ddb, dc5.
Unfortunately, the graph of all nodes was too complex to render, but we can imagine it would just be a sea of blue tails sparsely peppered with the above yellow loops.
So what?
Yeah, so what? The logical steps I walked through above seem pretty elementary so I’m sure this analysis isn’t anything new to a real cryptographer.
I put it all together several months ago, exploring these concepts in a jupyter notebook, then saved it and walked away because I’m not really sure what other conclusions can come from it. What kind of useful insights can we glean? Not being a cryptographer, I don’t know.
But, I can speculate. One speculation I have is that, if private-public key encryption is at all analogous to a hashed hash, as I’ve heard it might be in some cryptographic schemes (like elliptical curve cryptography, I think), then we can realize with this analysis that there might be many many (absolutely speaking, though not relatively speaking…) private keys able to tie to, (or lead to) each public key. This would also necessitate all public keys to be non-terminal nodes, leading to the conclusion that there are far fewer public keys than perhaps expected. That is to say: not all hashes can serve as a public key.
But again, I don’t really know how private-public key encryption works, so those conclusions may be entirely fallacious.
These kinds of insights don’t let us break cryptography or anything since the space of any useful hash function is so large we can never hope to find a loop or a collision on any specific hash. But what if the hash space was much smaller?
Another thought process I’ve entertained related to all this is the idea that the search for these loops (and even tails) is something that requires a lot of work.
I wonder if this logical structure of the hash space can lead directly to a new proof of work algorithm; one better equipped to fight against centralization since the resource required to produce a hash (computation) is trivial and therefore ubiquitous, the resources required to know if a found hash is eventually useful (memory) is limited and therefore scarce, and the resource required to coordinate a shared structure (bandwidth) requires collaboration between the few wealthy and the mass poor.
Bandwidth is, of course, the master resource so it stands to reason that a distributed consensus algorithm that costs bandwidth would be one most suited to remain distributed. But that too is my own intuition and speculation.
I hope this has been an interesting topic for you. Pure deductive logic upon first principles is always an exercise I find enjoyable or enlightening and this journey — deducing the nature and shape of the hash function state-space — turned out to be both.